From a224b8e6793837b0848fae5fe0489b3118e042c2 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Sat, 20 Sep 2025 20:23:47 -0400 Subject: [PATCH 1/7] Add ORCIDS for 2024.findings-acl --- data/xml/2024.findings.xml | 2864 ++++++++++++++++++------------------ 1 file changed, 1432 insertions(+), 1432 deletions(-) diff --git a/data/xml/2024.findings.xml b/data/xml/2024.findings.xml index 5d42b72134..81138ab654 100644 --- a/data/xml/2024.findings.xml +++ b/data/xml/2024.findings.xml @@ -6120,7 +6120,7 @@ Controllable Data Augmentation for Few-Shot Text Mining with Chain-of-Thought Attribute Manipulation LetianPeng - YuweiZhangUniversity of California, San Diego + YuweiZhangUniversity of California, San Diego JingboShangUniversity of California, San Diego 1-16 Prompting large language models (LLMs) for data augmentation has recently become a common practice in few-shot NLP tasks. In this paper, we propose Chain-of-Thought Attribute Manipulation (CoTAM), a novel approach that generates new data from existing examples by only tweaking in the user-provided, task-specific attribute, e.g., sentiment polarity or topic in movie reviews. Instead of conventional latent representation controlling, we leverage the chain-of-thought prompting to directly edit the text in three steps, (1) attribute decomposition, (2) manipulation proposal, and (3) sentence reconstruction. Extensive results on various tasks, such as text (pair) classification and aspect-based sentiment analysis, verify the superiority of CoTAM over other LLM-based augmentation methods with the same number of training examples for both fine-tuning and in-context learning. Remarkably, the 2D visualization of the augmented dataset using principle component analysis revealed a human-recognizable decision boundary that is likely hinted by the attribute manipulation, demonstrating the potential of our proposed approach. @@ -6130,9 +6130,9 @@ Match More, Extract Better! Hybrid Matching Model for Open Domain Web Keyphrase Extraction - MingyangSongTencent + MingyangSongTencent LipingJingBeijing Jiaotong University - YiFeng + YiFeng 17-27 Keyphrase extraction aims to automatically extract salient phrases representing the critical information in the source document. Identifying salient phrases is challenging because there is a lot of noisy information in the document, leading to wrong extraction. To address this issue, in this paper, we propose a hybrid matching model for keyphrase extraction, which combines representation-focused and interaction-based matching modules into a unified framework for improving the performance of the keyphrase extraction task. Specifically, HybridMatch comprises (1) a PLM-based Siamese encoder component that represents both candidate phrases and documents, (2) an interaction-focused matching (IM) component that estimates word matches between candidate phrases and the corresponding document at the word level, and (3) a representation-focused matching (RM) component captures context-aware semantic relatedness of each candidate keyphrase at the phrase level. Extensive experimental results on the OpenKP dataset demonstrate that the performance of the proposed model HybridMatch outperforms the recent state-of-the-art keyphrase extraction baselines. Furthermore, we discuss the performance of large language models in keyphrase extraction based on recent studies and our experiments. 2024.findings-acl.2 @@ -6145,7 +6145,7 @@ SichengZhang ShijieCaoMicrosoft Research Asia DaYouDuHKUST(GZ) - JianyuWei + JianyuWei TingCaoMicrosoft Research NingyiXuShanghai Jiaotong University 28-36 @@ -6168,7 +6168,7 @@ Overcoming Catastrophic Forgetting by Exemplar Selection in Task-oriented Dialogue System ChenChenNanyang Technological University - RuizheLiUniversity of Aberdeen + RuizheLiUniversity of Aberdeen YuchenHu YuanyuanChenNanyang Technological University ChengweiQinNanyang Technological University @@ -6193,7 +6193,7 @@ AlexGuMassachusetts Institute of Technology Wen-DingLiCornell University NamanJainUniversity of California, Berkeley - TheoOlaussonMassachusetts Institute of Technology + TheoOlaussonMassachusetts Institute of Technology CelineLeeCornell University KoushikSenUC Berkeley, University of California, Berkeley ArmandoSolar-LezamaMassachusetts Institute of Technology @@ -6211,7 +6211,7 @@ BaileyKuehl ChenhaoTanUniversity of Chicago DavidWaddenAllen Institute for Artificial Intelligence - LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence + LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence AakankshaNaikAllen Institute for Artificial Intelligence and National Institutes of Health 118-132 Literature review requires researchers to synthesize a large amount of information and is increasingly challenging as the scientific literature expands. In this work, we investigate the potential of LLMs for producing hierarchical organizations of scientific studies to assist researchers with literature review. We define hierarchical organizations as tree structures where nodes refer to topical categories and every node is linked to the studies assigned to that category. Our naive LLM-based pipeline for hierarchy generation from a set of studies produces promising yet imperfect hierarchies, motivating us to collect CHIME, an expert-curated dataset for this task focused on biomedicine. Given the challenging and time-consuming nature of building hierarchies from scratch, we use a human-in-the-loop process in which experts correct errors (both links between categories and study assignment) in LLM-generated hierarchies. CHIME contains 2,174 LLM-generated hierarchies covering 472 topics, and expert-corrected hierarchies for a subset of 100 topics. Expert corrections allow us to quantify LLM performance, and we find that while they are quite good at generating and organizing categories, their assignment of studies to categories could be improved. We attempt to train a corrector model with human feedback which improves study assignment by 12.6 F1 points. We release our dataset and models to encourage research on developing better assistive tools for literature review. @@ -6221,16 +6221,16 @@ Which Side Are You On? A Multi-task Dataset for End-to-End Argument Summarisation and Evaluation - HaoLi - YupingWu + HaoLi + YupingWu ViktorSchlegelImperial College London RizaBatista-NavarroUniversity of Manchester - TharinduMadusanka + TharinduMadusanka IqraZahid JiayanZeng XiaochiWang XinranHe - YizhiLiUniversity of Manchester and University of Sheffield + YizhiLiUniversity of Manchester and University of Sheffield GoranNenadicUniversity of Manchester 133-150 With the recent advances of large language models (LLMs), it is no longer infeasible to build an automated debate system that helps people to synthesise persuasive arguments. Previous work attempted this task by integrating multiple components. In our work, we introduce an argument mining dataset that captures the end-to-end process of preparing an argumentative essay for a debate, which covers the tasks of claim and evidence identification (Task 1 ED), evidence convincingness ranking (Task 2 ECR), argumentative essay summarisation and human preference ranking (Task 3 ASR) and metric learning for automated evaluation of resulting essays, based on human feedback along argument quality dimensions (Task 4 SQE). Our dataset contains 14k examples of claims that are fully annotated with various properties supporting the aforementioned tasks. We evaluate multiple generative baselines for each of these tasks, including representative LLMs. We find, that while they show promising results on individual tasks in our benchmark, their end-to-end performance on all four tasks in succession deteriorates significantly, both in automated measures as well as in human-centred evaluation. This challenge presented by our proposed dataset motivates future research on end-to-end argument mining and summarisation. The repository of this project is available at https://github.com/HarrywillDr/ArgSum-Datatset. @@ -6245,7 +6245,7 @@ SarathkrishnaSwaminathanInternational Business Machines AsafYehudai SubhajitChaudhuryInternational Business Machines - RaduFlorianInternational Business Machines + RaduFlorianInternational Business Machines RamónAstudilloInternational Business Machines AsimMunawarInternational Business Machines 151-162 @@ -6256,16 +6256,16 @@ Graph Chain-of-Thought: Augmenting Large Language Models by Reasoning on Graphs - BowenJin + BowenJin ChulinXieUniversity of Illinois, Urbana Champaign JiaweiZhang Kashob KumarRoyDepartment of Computer Science - YuZhangTexas A&M University - College Station + YuZhangTexas A&M University - College Station ZhengLiAmazon RuiruiLi XianfengTangAmazon - SuhangWangPennsylvania State University - YuMengUniversity of Virginia + SuhangWangPennsylvania State University + YuMengUniversity of Virginia JiaweiHan 163-184 Large language models (LLMs), while exhibiting exceptional performance, suffer from hallucinations, especially on knowledge-intensive tasks. Existing works propose to augment LLMs with individual text units retrieved from external knowledge corpora to alleviate the issue. However, in many domains, texts are interconnected (e.g., academic papers in a bibliographic graph are linked by citations and co-authorships) which form a (text-attributed) graph. The knowledge in such graphs is encoded not only in single texts/nodes but also in their associated connections. To facilitate the research of augmenting LLMs with graphs, we manually construct a Graph Reasoning Benchmark dataset called GRBench, containing 1,740 questions that can be answered with the knowledge from 10 domain graphs. Then, we propose a simple and effective framework called Graph Chain-of-thought (Graph-CoT) to augment LLMs with graphs by encouraging LLMs to reason on the graph iteratively. Each Graph-CoT iteration consists of three sub-steps: LLM reasoning, LLM-graph interaction, and graph execution. We conduct systematic experiments with three LLM backbones on GRBench, where Graph-CoT outperforms the baselines consistently. The code is available at https://github.com/PeterGriffinJin/Graph-CoT/. @@ -6277,7 +6277,7 @@ <fixed-case>T</fixed-case>ext2<fixed-case>DB</fixed-case>: Integration-Aware Information Extraction with Large Language Model Agents YizhuJiaoUIUC ShaLiUniversity of Illinois, Urbana Champaign - SizheZhou + SizheZhou HengJiUniversity of Illinois, Urbana-Champaign JiaweiHan 185-205 @@ -6304,7 +6304,7 @@ MahmoudSalemCerebras Systems, Inc ShreyasSaxenaCerebras Systems, Inc Chen-YuLeong - JoelHestnessCerebras Systems, Inc + JoelHestnessCerebras Systems, Inc SeanLieCerebras Systems, Inc 214-230 Large language models (LLMs) are typically trained on general source data forvarious domains, but a recent surge in domain-specific LLMs has shown theirpotential to outperform general-purpose models in domain-specific tasks (e.g.,biomedicine). Although domain-specific pre-training enhances efficiency andleads to smaller models, the computational costs of training these LLMs remainhigh, posing budgeting challenges. We introduce MediSwift, a suite of biomedicalLMs that leverage sparse pre-training on domain-specific biomedical text data.By inducing up to 75% weight sparsity during the pre-training phase, MediSwiftachieves a 2-2.5x reduction in training FLOPs. Notably, all sparse pre-trainingwas performed on the Cerebras CS-2 system, which is specifically designed torealize the acceleration benefits from unstructured weight sparsity, therebysignificantly enhancing the efficiency of the MediSwift models. Throughsubsequent dense fine-tuning and strategic soft prompting, MediSwift modelsoutperform existing LLMs up to 7B parameters on biomedical tasks, setting newbenchmarks w.r.t efficiency-accuracy on tasks such as PubMedQA. Our results showthat sparse pre-training, along with dense fine-tuning and soft prompting,offers an effective method for creating high-performing, computationallyefficient models in specialized domains. @@ -6325,11 +6325,11 @@ <fixed-case>P</fixed-case>-<fixed-case>TA</fixed-case>: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models - ShuoYang - ChenchenYuan + ShuoYang + ChenchenYuan YaoRong - FelixSteinbauerDepartment of Informatics, Technische Universität München - GjergjiKasneciTechnische Universität München and University of Tuebingen + FelixSteinbauerDepartment of Informatics, Technische Universität München + GjergjiKasneciTechnische Universität München and University of Tuebingen 248-264 A multitude of industries depend on accurate and reasonable tabular data augmentation for their business processes. Contemporary methodologies in generating tabular data revolve around utilizing Generative Adversarial Networks (GAN) or fine-tuning Large Language Models (LLM). However, GAN-based approaches are documented to produce samples with common-sense errors attributed to the absence of external knowledge. On the other hand, LLM-based methods exhibit a limited capacity to capture the disparities between synthesized and actual data distribution due to the absence of feedback from a discriminator during training. Furthermore, the decoding of LLM-based generation introduces gradient breakpoints, impeding the backpropagation of loss from a discriminator, thereby complicating the integration of these two approaches. To solve this challenge, we propose using proximal policy optimization (PPO) to apply GANs, guiding LLMs to enhance the probability distribution of tabular features. This approach enables the utilization of LLMs as generators for GANs in synthesizing tabular data. Our experiments demonstrate that PPO leads to an approximately 4% improvement in the accuracy of models trained on synthetically generated data over state-of-the-art across three real-world datasets. 2024.findings-acl.16 @@ -6353,7 +6353,7 @@ ShuohangWang YangLiu ChenguangZhuZoom - JulianMcAuleyUniversity of California, San Diego, University of California, San Diego + JulianMcAuleyUniversity of California, San Diego, University of California, San Diego 283-294 Large language models (LLMs) such as GPT-3 and GPT-4 are powerful but their weights are often publicly unavailable and their immense sizes make the models difficult to be tuned with common hardware. As a result, effectively tuning these models with large-scale supervised data can be challenging. As an alternative, In-Context Learning (ICL) can only use a small number of supervised examples due to context length limits. In this paper, we propose Super In-Context Learning (SuperICL) which allows black-box LLMs to work with locally fine-tuned smaller models, resulting in superior performance on supervised tasks. Our experiments demonstrate that SuperICL can improve performance beyond state-of-the-art fine-tuned models while addressing the instability problem of in-context learning. 2024.findings-acl.18 @@ -6362,7 +6362,7 @@ Are self-explanations from Large Language Models faithful? - AndreasMadsenMontreal Institute for Learning Algorithms, École Polytechnique de Montréal, Université de Montréal and Mila + AndreasMadsenMontreal Institute for Learning Algorithms, École Polytechnique de Montréal, Université de Montréal and Mila SarathChandarPolytechnique Montreal SivaReddyMila, McGill University and Mila, McGill University 295-337 @@ -6376,10 +6376,10 @@ Henry PengZou VinaySamuel YueZhou - WeizhiZhang + WeizhiZhang LianchengFang ZiheSong - Philip S.Yu + Philip S.Yu CorneliaCaragea 338-354 Existing datasets for attribute value extraction (AVE) predominantly focus on explicit attribute values while neglecting the implicit ones, lack product images, are often not publicly available, and lack an in-depth human inspection across diverse domains. To address these limitations, we present ImplicitAVE, the first, publicly available multimodal dataset for implicit attribute value extraction. ImplicitAVE, sourced from the MAVE dataset, is carefully curated and expanded to include implicit AVE and multimodality, resulting in a refined dataset of 68k training and 1.6k testing data across five domains. We also explore the application of multimodal large language models (MLLMs) to implicit AVE, establishing a comprehensive benchmark for MLLMs on the ImplicitAVE dataset. Six recent MLLMs with eleven variants are evaluated across diverse settings, revealing that implicit value extraction remains a challenging task for MLLMs. The contributions of this work include the development and release of ImplicitAVE, and the exploration and benchmarking of various MLLMs for implicit AVE, providing valuable insights and potential future research directions. Dataset and code are available at https://github.com/HenryPengZou/ImplicitAVE. @@ -6407,7 +6407,7 @@ UtkarshTyagi SSakshi SanjoyChowdhuryUniversity of Maryland, College Park - DineshManochaUniversity of Maryland, College Park + DineshManochaUniversity of Maryland, College Park 386-406 Neural image classifiers can often learn to make predictions by overly relying on non-predictive features that are spuriously correlated with the class labels in the training data. This leads to poor performance in real-world atypical scenarios where such features are absent. This paper presents ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval), a simple yet effective solution for supplementing the training dataset with images without spurious features, for robust learning against spurious correlations via better generalization. ASPIRE, guided by language at various steps, can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set. Precisely, we employ LLMs to first extract foreground and background features from textual descriptions of an image, followed by advanced language-guided image editing to discover the features that are spuriously correlated with the class label. Finally, we personalize a text-to-image generation model using the edited images to generate diverse in-domain images without spurious features. ASPIRE is complementary to all prior robust training methods in literature, and we demonstrate its effectiveness across 4 datasets and 9 baselines and show that ASPIRE improves the worst-group classification accuracy of prior methods by 1% - 38%. We also contribute a novel test set for the challenging Hard ImageNet dataset. 2024.findings-acl.22 @@ -6416,14 +6416,14 @@ Tables as Texts or Images: Evaluating the Table Reasoning Ability of <fixed-case>LLM</fixed-case>s and <fixed-case>MLLM</fixed-case>s - NaihaoDeng + NaihaoDeng ZhenjieSun RuiqiHe AmanSikkaUniversity of Michigan - Ann Arbor YulongChenUniversity of Cambridge - LinMaUniversity of Michigan - Ann Arbor - YueZhangWestlake University - RadaMihalceaUniversity of Michigan + LinMaUniversity of Michigan - Ann Arbor + YueZhangWestlake University + RadaMihalceaUniversity of Michigan 407-426 Tables contrast with unstructured text data by its structure to organize the information.In this paper, we investigate the efficiency of various LLMs in interpreting tabular data through different prompting strategies and data formats. Our analysis extends across six benchmarks for table-related tasks such as question-answering and fact-checking. We pioneer in the assessment of LLMs’ performance on image-based table representation. Specifically, we compare five text-based and three image-based table representations, revealing the influence of representation and prompting on LLM performance. We hope our study provides researchers insights into optimizing LLMs’ application in table-related tasks. 2024.findings-acl.23 @@ -6462,7 +6462,7 @@ <fixed-case>LLM</fixed-case>-<fixed-case>QAT</fixed-case>: Data-Free Quantization Aware Training for Large Language Models ZechunLiuMeta Inc. BarlasOguzMeta - ChangshengZhaoMeta Inc. + ChangshengZhaoMeta Inc. ErnieChangMeta AI PierreStockFacebook YasharMehdadFacebook @@ -6479,15 +6479,15 @@ <fixed-case>I</fixed-case>nfi<fixed-case>MM</fixed-case>: Advancing Multimodal Understanding with an Open-Sourced Visual Language Model HaogengLiu QuanzengYouByteDance - YiqiWang - XiaotianHanByteDance + YiqiWang + XiaotianHanByteDance BohanZhaiSnowflake YongfeiLiuBytedance WentaoChenByteDance Inc. YirenJianByteDance Inc. YunzheTaoByteDance JianboYuanBytedance - RanHeInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + RanHeInstitute of automation, Chinese academy of science, Chinese Academy of Sciences HongxiaYang 485-492 In this work, we present InfiMM, an advanced Multimodal Large Language Model that adapts to intricate vision-language tasks. InfiMM, inspired by the Flamingo architecture, distinguishes itself through the utilization of large-scale training data, comprehensive training strategies, and diverse large language models. This approach ensures the preservation of Flamingo’s foundational strengths while simultaneously introducing augmented capabilities. Empirical evaluations across a variety of benchmarks underscore InfiMM’s remarkable capability in multimodal understanding. The code can be found at: https://anonymous.4open.science/r/infimm-zephyr-F60C/. @@ -6501,7 +6501,7 @@ YixinCaoFudan University LiangmingPan YuboMaSchool of Computer Science and Engineering, Nanyang Technological University - AixinSunNanyang Technological University + AixinSunNanyang Technological University 493-516 Although achieving great success, Large Language Models (LLMs) usually suffer from unreliable hallucinations. Although language attribution can be a potential solution, there are no suitable benchmarks and evaluation metrics to attribute LLMs to structured knowledge. In this paper, we define a new task of Knowledge-aware Language Model Attribution (KaLMA) that improves upon three core concerns with conventional attributed LMs. First, we extend attribution source from unstructured texts to Knowledge Graph (KG), whose rich structures benefit both the attribution performance and working scenarios. Second, we propose a new “Conscious Incompetence” setting considering the incomplete knowledge repository, where the model identifies the need for supporting knowledge beyond the provided KG. Third, we propose a comprehensive automatic evaluation metric encompassing text quality, citation quality, and text citation alignment. To implement the above innovations, we build a dataset in biography domain BioKaLMA via evolutionary question generation strategy, to control the question complexity and necessary knowledge to the answer. For evaluation, we develop a baseline solution and demonstrate the room for improvement in LLMs’ citation generation, emphasizing the importance of incorporating the “Conscious Incompetence” setting, and the critical role of retrieval accuracy. 2024.findings-acl.28 @@ -6515,7 +6515,7 @@ VipulRahejaColumbia University, Grammarly and International Institute of Information Technology Hyderabad Jong InnPark Zae MyungKimUniversity of Minnesota - Twin Cities - DongyeopKangUniversity of Minnesota + DongyeopKangUniversity of Minnesota 517-545 Large Language Models (LLMs) have recently been shown to be effective as automatic evaluators with simple prompting and in-context learning. In this work, we assemble 16 LLMs encompassing four different size ranges and evaluate their output responses by preference ranking from the other LLMs as evaluators, such as System Star is better than System Square. We then evaluate the quality of ranking outputs introducing the Cognitive Bias Benchmark for LLMs as Evaluators (CoBBLer), a benchmark to measure six different cognitive biases in LLM evaluation outputs, such as the Egocentric bias where a model prefers to rank its own outputs highly in evaluation. We find that LLMs are biased text quality evaluators, exhibiting strong indications on our bias benchmark (40% of comparisons made by all models) within each of their evaluations that question their robustness as evaluators. Furthermore, we examine the correlation between human and machine preferences and calculate the average Rank-Biased Overlap (RBO) score to be 44%, indicating that machine preferences are misaligned with humans. According to our findings, LLMs may still be unable to be utilized for automatic annotation aligned with human preferences. 2024.findings-acl.29 @@ -6527,7 +6527,7 @@ ChongLiInstitute of automation, Chinese Academy of Sciences WenYangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences JiajunZhangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences - JinliangLuInstitute of automation, Chinese Academy of Sciences + JinliangLuInstitute of automation, Chinese Academy of Sciences ShaonanWang ChengqingZongInstitute of automation, Chinese academy of science, Chinese Academy of Sciences 546-566 @@ -6538,11 +6538,11 @@ Muffin: Mitigating Unhelpfulness in Emotional Support Conversations with Multifaceted <fixed-case>AI</fixed-case> Feedback - JiashuoWang + JiashuoWang ChunpuXu Chak TouLeongHong Kong Polytechnic University - WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University - JingLiThe Hong Kong Polytechnic University + WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University + JingLiThe Hong Kong Polytechnic University 567-585 2024.findings-acl.31 wang-etal-2024-muffin @@ -6554,7 +6554,7 @@ IvanKobyzevHuawei Noah’s Ark Lab PengLuUniversity of Montreal MehdiRezagholizadeh - BangLiuUniversity of Montreal + BangLiuUniversity of Montreal 586-598 This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications. 2024.findings-acl.32 @@ -6564,13 +6564,13 @@ <fixed-case>M</fixed-case>ed<fixed-case>A</fixed-case>gents: Large Language Models as Collaborators for Zero-shot Medical Reasoning XiangruTangYale University - AnniZou - ZhuoshengZhangShanghai Jiao Tong University + AnniZou + ZhuoshengZhangShanghai Jiao Tong University ZimingLi YilunZhaoYale University XingyaoZhangAlibaba Group ArmanCohanYale University and Allen Institute for Artificial Intelligence - MarkGersteinYale University + MarkGersteinYale University 599-621 Large language models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare. This field faces unique challenges such as domain-specific terminologies and reasoning over specialized knowledge. To address these issues, we propose MedAgents, a novel multi-disciplinary collaboration framework for the medical domain. MedAgents leverages LLM-based agents in a role-playing setting that participate in a collaborative multi-round discussion, thereby enhancing LLM proficiency and reasoning capabilities. This training-free framework encompasses five critical steps: gathering domain experts, proposing individual analyses, summarising these analyses into a report, iterating over discussions until a consensus is reached, and ultimately making a decision. Our work focuses on the zero-shot setting, which is applicable in real-world scenarios. Experimental results on nine datasets (MedQA, MedMCQA, PubMedQA, and six subtasks from MMLU) establish that our proposed MedAgents framework excels at mining and harnessing the medical expertise within LLMs, as well as extending its reasoning abilities. Our code can be found at https://github.com/gersteinlab/MedAgents. 2024.findings-acl.33 @@ -6579,11 +6579,11 @@ Meta-Reasoning: Semantics-Symbol Deconstruction for Large Language Models - YimingWangShanghai Jiao Tong University - ZhuoshengZhangShanghai Jiao Tong University + YimingWangShanghai Jiao Tong University + ZhuoshengZhangShanghai Jiao Tong University PeiZhangAlibaba Group BaosongYang - RuiWangShanghai Jiao Tong University + RuiWangShanghai Jiao Tong University 622-643 Neural-symbolic methods have demonstrated efficiency in enhancing the reasoning abilities of large language models (LLMs). However, existing methods mainly rely on syntactically mapping natural languages to complete formal languages like Python and SQL. Those methods require that reasoning tasks be convertible into programs, which cater to the computer execution mindset and deviate from human reasoning habits. To broaden symbolic methods’ applicability and adaptability in the real world, we propose Meta-Reasoning from a linguistic perspective. This method empowers LLMs to deconstruct reasoning-independent semantic information into generic symbolic representations, thereby efficiently capturing more generalized reasoning knowledge. We conduct extensive experiments on more than ten datasets encompassing conventional reasoning tasks like arithmetic, symbolic, and logical reasoning, and the more complex interactive reasoning tasks like theory-of-mind reasoning. Experimental results demonstrate that Meta-Reasoning significantly enhances in-context reasoning accuracy, learning efficiency, out-of-domain generalization, and output stability compared to the Chain-of-Thought technique. 2024.findings-acl.34 @@ -6593,13 +6593,13 @@ <fixed-case>DPDLLM</fixed-case>: A Black-box Framework for Detecting Pre-training Data from Large Language Models BaohangZhouNankai University - ZezhongWang - LingzhiWangThe Chinese University of Hong Kong - HongruWangThe Chinese University of Hong Kong + ZezhongWang + LingzhiWangThe Chinese University of Hong Kong + HongruWangThe Chinese University of Hong Kong YingZhangNankai University KehuiSong XuhuiSui - Kam-FaiWongThe Chinese University of Hong Kong + Kam-FaiWongThe Chinese University of Hong Kong 644-653 The success of large language models (LLM) benefits from large-scale model parameters and large amounts of pre-training data. However, the textual data for training LLM can not be confirmed to be legal because they are crawled from different web sites. For example, there are copyrighted articles, personal reviews and information in the pre-training data for LLM which are illegal. To address the above issue and develop legal LLM, we propose to detect the pre-training data from LLM in a pure black-box way because the existing LLM services only return the generated text. The previous most related works are the membership inference attack (MIA) on machine learning models to detect the training data from them. But the existing methods are based on analyzing the output probabilities of models which are unrealistic to LLM services. To tackle the problem, we firstly construct the benchmark datasets by collecting textual data from different domains as the seen and unseen pre-training data for LLMs. Then, we investigate a black-box framework named DPDLLM, with the only access to the generated texts from LLM for detecting textual data whether was used to train it. In the proposed framework, we exploit GPT-2 as the reference model to fit the textual data and feed the generated text from LLM into it to acquire sequence probabilities as the significant feature for detection. The experimental results on the benchmark datasets demonstrate that DPDLLM is effective on different popular LLMs and outperforms the existing methods. 2024.findings-acl.35 @@ -6610,9 +6610,9 @@ <fixed-case>PACIT</fixed-case>: Unlocking the Power of Examples for Better In-Context Instruction Tuning TianciXue ZiqiWang - YixiaLi + YixiaLi YunChenShanghai University of Finance and Economics - GuanhuaChenSouthern University of Science and Technology + GuanhuaChenSouthern University of Science and Technology 654-665 Instruction tuning enhances the instruction following ability of large language models by finetuning with supervised instruction data. Previous work proposes in-context instruction tuning (ICIT) where specific positive or negative examples are incorporated into the prompt for better performance. In this work, we propose PACIT, a simple and effective in-context instruction tuning method, inspired by the pedagogical concept of desirable difficulty. The PACIT method unlocks the power of examples by encouraging the model to actively learn to grasp the distinctions between the positive and negative examples instead of merely reading. The model is expected to first verify the correctness of the provided example according to the task description, which is then set as the condition for generating a better response to the task instance. Our extensive experiments prove the effectiveness of PACIT, outperforming ICIT baseline on both in-domain and out-domain tasks up to 9.16 and 3.14 average ROUGE-L scores, respectively. Moreover, PACIT can notably enhance the performance of instruction tuning even when all positive and negative examples are generated with a self-instruct method. 2024.findings-acl.36 @@ -6626,7 +6626,7 @@ ChengweiQinNanyang Technological University QiushiZhu EngSiongChngNanyang Technological University - RuizheLiUniversity of Aberdeen + RuizheLiUniversity of Aberdeen 666-679 Recent advances in large language models (LLMs) have promoted generative error correction (GER) for automatic speech recognition (ASR), which aims to predict the ground-truth transcription from the decoded N-best hypotheses. Thanks to the strong language generation ability of LLMs and rich information in the N-best list, GER shows great effectiveness in enhancing ASR results. However, it still suffers from two limitations: 1) LLMs are unaware of the source speech during GER, which may lead to results that are grammatically correct but violate the source speech content, 2) N-best hypotheses usually only vary in a few tokens, making it redundant to send all of them for GER, which could confuse LLM about which tokens to focus on and thus lead to increased miscorrection. In this paper, we propose ClozeGER, a new paradigm for ASR generative error correction. First, we introduce a multimodal LLM (i.e., SpeechGPT) to receive source speech as extra input to improve the fidelity of correction output. Then, we reformat GER as a cloze test with logits calibration to remove the input information redundancy and simplify GER with clear instructions. Experiments show that ClozeGER achieves a new breakthrough over vanilla GER on 9 popular ASR datasets. 2024.findings-acl.37 @@ -6638,7 +6638,7 @@ HaoYue ShaopengLaiAlibaba Group ChengyiYang - LiangZhang + LiangZhang JunfengYaoXiamen University JinsongSuXiamen University 680-691 @@ -6662,15 +6662,15 @@ <fixed-case>C</fixed-case>ode<fixed-case>M</fixed-case>: Less Data Yields More Versatility via Ability Matrix - DaoguangZan + DaoguangZan AilunYu WeiLiu BoShen ShaoxinLin - YongshunGongShandong University + YongshunGongShandong University YafenYao YanLiu - BeiGuan + BeiGuan WeihuaLuoAlibaba Group YongjiWang QianxiangWangPeking University @@ -6685,9 +6685,9 @@ Do <fixed-case>LVLM</fixed-case>s Understand Charts? Analyzing and Correcting Factual Errors in Chart Captioning Kung-HsiangHuangSalesForce.com MingyangZhou - Hou PongChanAlibaba Group + Hou PongChanAlibaba Group YiFung - ZhenhailongWang + ZhenhailongWang LingyuZhangDuke University Shih-FuChangColumbia, Columbia University, Columbia University, Columbia University, Columbia University, Columbia University and Columbia University HengJiUniversity of Illinois, Urbana-Champaign @@ -6699,10 +6699,10 @@ <fixed-case>BIDER</fixed-case>: Bridging Knowledge Inconsistency for Efficient Retrieval-Augmented <fixed-case>LLM</fixed-case>s via Key Supporting Evidence - JiajieJinRenmin University of China - YutaoZhu - YujiaZhouTsinghua University, Tsinghua University - ZhichengDouRenmin University of China + JiajieJinRenmin University of China + YutaoZhu + YujiaZhouTsinghua University, Tsinghua University + ZhichengDouRenmin University of China 750-761 Retrieval-augmented large language models (LLMs) have demonstrated efficacy in knowledge-intensive tasks such as open-domain QA, addressing inherent challenges in knowledge update and factual inadequacy.However, inconsistencies between retrieval knowledge and the necessary knowledge for LLMs, leading to a decline in LLM’s answer quality. This paper introduces BIDER, an approach that refines retrieval documents into Key Supporting Evidence (KSE) through knowledge synthesis, supervised fine-tuning (SFT), and preference alignment. We train BIDER by learning from crafting KSE, while maximizing its output to align with LLM’s information acquisition preferences through reinforcement learning. Evaluations across five datasets show BIDER boosts LLMs’ answer quality by 7% while reducing input content length in retrieval documents by 80%, outperforming existing methods. The proposed KSE simulation effectively equips LLMs with essential information for accurate question answering. 2024.findings-acl.42 @@ -6711,7 +6711,7 @@ Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions - WenxuanWangNational Lab of Pattern Recognition, Institute of Automation,Chinese Academy of Sciences and Beijing Academy of Artificial Intelligence + WenxuanWangNational Lab of Pattern Recognition, Institute of Automation,Chinese Academy of Sciences and Beijing Academy of Artificial Intelligence YisiZhangUniversity of Science and Technology Beijing XingjianHe, Institute of automation, Chinese academy of science YichenYan @@ -6727,10 +6727,10 @@ Incremental Sequence Labeling: A Tale of Two Shifts ShengjieQiu - JunhaoZheng + JunhaoZheng ZhenLiu YichengLuo - QianliMaSouth China University of Technology + QianliMaSouth China University of Technology 777-791 The incremental sequence labeling task involves continuously learning new classes over time while retaining knowledge of the previous ones. Our investigation identifies two significant semantic shifts: E2O (where the model mislabels an old entity as a non-entity) and O2E (where the model labels a non-entity or old entity as a new entity). Previous research has predominantly focused on addressing the E2O problem, neglecting the O2E issue. This negligence results in a model bias towards classifying new data samples as belonging to the new class during the learning process. To address these challenges, we propose a novel framework, Incremental Sequential Labeling without Semantic Shifts (IS3). Motivated by the identified semantic shifts (E2O and O2E), IS3 aims to mitigate catastrophic forgetting in models. As for the E2O problem, we use knowledge distillation to maintain the model’s discriminative ability for old entities. Simultaneously, to tackle the O2E problem, we alleviate the model’s bias towards new entities through debiased loss and optimization levels.Our experimental evaluation, conducted on three datasets with various incremental settings, demonstrates the superior performance of IS3 compared to the previous state-of-the-art method by a significant margin. 2024.findings-acl.44 @@ -6745,8 +6745,8 @@ TingjianZhangTsinghua University, Tsinghua University LunyiuNieUniversity of Texas at Austin LinmeiHuBeijing Institute of Technology - LeiHouTsinghua University, Tsinghua University - JuanziLi + LeiHouTsinghua University, Tsinghua University + JuanziLi 792-815 Knowledge Base Question Answering (KBQA) aims to answer natural language questions based on facts in knowledge bases. A typical approach to KBQA is semantic parsing, which translates a question into an executable logical form in a formal language. Recent works leverage the capabilities of large language models (LLMs) for logical form generation to improve performance. However, although it is validated that LLMs are capable of solving some KBQA problems, there has been little discussion on the differences in LLMs’ proficiency in formal languages used in semantic parsing. In this work, we propose to evaluate the understanding and generation ability of LLMs to deal with differently structured logical forms by examining the inter-conversion of natural and formal language through in-context learning of LLMs. Extensive experiments with models of different sizes show that state-of-the-art LLMs can understand formal languages as well as humans, but generating correct logical forms given a few examples remains a challenge. Most importantly, our results also indicate that LLMs exhibit considerable sensitivity. In general, the formal language with a lower formalization level, i.e., the more similar it is to natural language, is more friendly to LLMs. Code and data can be found at https://github.com/Matthewlliu/structure_probe. 2024.findings-acl.45 @@ -6784,7 +6784,7 @@ BlaineHillUniversity of Illinois at Urbana-Champaign BoxinDuAmazon FeiWangAmazon - HanghangTong + HanghangTong 839-850 Conversational question answering (ConvQA) over knowledge graphs (KGs) involves answering multi-turn natural language questions about information contained in a KG. State-of-the-art methods of ConvQA often struggle with inexplicit question-answer pairs. These inputs are easy for human beings to understand given a conversation history, but hard for a machine to interpret, which can degrade ConvQA performance. To address this problem, we propose a reinforcement learning (RL) based model, CoRnNet, which utilizes question reformulations generated by large language models (LLMs) to improve ConvQA performance. CoRnNet adopts a teacher-student architecture where a teacher model learns question representations using human writing reformulations, and a student model to mimic the teacher model’s output via reformulations generated by LLMs. The learned question representation is then used by a RL model to locate the correct answer in a KG. Extensive experimental results show that CoRnNet outperforms state-of-the-art ConvQA models. 2024.findings-acl.48 @@ -6794,7 +6794,7 @@ Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step by Step LiZhong - ZilongWangUniversity of California, San Diego + ZilongWangUniversity of California, San Diego JingboShangUniversity of California, San Diego 851-870 Large language models (LLMs) are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce Large Language Model Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections. @@ -6804,10 +6804,10 @@ Effective In-Context Example Selection through Data Compression - ZhongXiangSun + ZhongXiangSun KepuZhang HaoyuWang - XiaoZhang + XiaoZhang JunXuRenmin University of China 871-877 In-context learning has been extensively validated in large language models. However, the mechanism and selection strategy for in-context example selection, which is a crucial ingredient in this approach, lacks systematic and in-depth research. In this paper, we propose a data compression approach to the selection of in-context examples. We introduce a two-stage method that can effectively choose relevant examples and retain sufficient information about the training dataset within the in-context examples. Our method shows a significant improvement of an average of 5.90% across five different real-world datasets using four language models. @@ -6821,7 +6821,7 @@ ChongYangByteDance Inc. TuHu XinhaoChen - ManLan + ManLan LiCaiGuizhou University XinlinZhuang XuanLinAnt Group @@ -6835,9 +6835,9 @@ Knowledgeable Preference Alignment for <fixed-case>LLM</fixed-case>s in Domain-specific Question Answering - YichiZhang - ZhuoChenZhejiang University - YinFang + YichiZhang + ZhuoChenZhejiang University + YinFang YanxiLu LiFangming WenZhangZhejiang University @@ -6851,7 +6851,7 @@ <fixed-case>MARIO</fixed-case>: <fixed-case>MA</fixed-case>th Reasoning with code Interpreter Output - A Reproducible Pipeline MinpengLiao - ChengxiLiAlibaba Group + ChengxiLiAlibaba Group WeiLuo WuJingAlibaba Group KaiFanAlibaba Group @@ -6863,8 +6863,8 @@ <fixed-case>D</fixed-case>iffus<fixed-case>P</fixed-case>oll: Conditional Text Diffusion Model for Poll Generation - LeCheng - ShuangyinLi + LeCheng + ShuangyinLi 925-935 Online social media platforms often gather user feedback through polls to enhance user engagement. Automatically generating polls from social media and its context can decrease the labor expenses of media workers and enhance workplace productivity. However, on social media platforms, there are internet water armies that manipulate public opinion through sheer numbers and causing the comments to be biased, drowning out minority views. In such circumstances, polls created based on biased comments often have limited types of options and poor coverage. Therefore, it is crucial to diversify the poll options and try to listen to the voices of the minority. To achieve this, we introduce DiffusPoll, a novel paradigm for poll generation based on a non-autoregressive diffusion model that can generate diversified and high-quality samples. Under the new paradigm, we design a task-specific mask strategy tailored to the inherent logic of polls to optimize controlled generation. Furthermore, we also leverage additional attribute tags from comments to enhance the generation quality. Experimental results indicate that DiffusPoll has achieved state-of-the-art performance in both the quality and diversity of poll generation tasks, and is more likely to hit the voices of minority. 2024.findings-acl.54 @@ -6875,7 +6875,7 @@ Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data HaolongLi YuMaByteDance Inc. - YinqiZhangByteDance Inc. and East China Normal University + YinqiZhangByteDance Inc. and East China Normal University ChenYeTongji University JieChen 936-946 @@ -6887,7 +6887,7 @@ Implanting <fixed-case>LLM</fixed-case>’s Knowledge via Reading Comprehension Tree for Toxicity Detection HankunKang - TieyunQianWuhan University + TieyunQianWuhan University 947-962 Toxicity detection plays a crucial role in maintaining the peace of the society. Existing methods can be roughly categorized as small language model (SLM) based and large language model (LLM) based. However, due to the limitation of SLMs on general knowledge and the potential embedded bias in LLMs despite their large amount of knowledge, it is not a good idea to detect toxicity only with either SLM or LLM based method.In this work, we propose to implant LLM’s knowledge into SLM based methods such that we can stick to both types of models’ strengths. To this end, we develop a reading comprehension (RC) tree to transfer knowledge between two models. Specifically, we first construct the RC tree, from an extensive to intensive reading perspective, to capture the local and global information in the text. We then model samples encoded by SLM and knowledge extracted from LLM as two distributions using the constructed RT tree. We finally transfer knowledge via optimal transportation between two distributions. Extensive experiments prove the effectiveness of our method on real-world and machine-generated datasets. 2024.findings-acl.56 @@ -6898,17 +6898,17 @@ <fixed-case>LLML</fixed-case>ingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression ZhuoshiPan QianhuiWuMicrosoft - HuiqiangJiangMicrosoft + HuiqiangJiangMicrosoft MenglinXiaMicrosoft XufangLuoMicrosoft Research JueZhangMicrosoft - QingweiLinMicrosoft Research - VictorRühleMicrosoft + QingweiLinMicrosoft Research + VictorRühleMicrosoft YuqingYangResearch, Microsoft Chin-YewLinMicrosoft H. VickyZhaoTsinghua University, Tsinghua University LiliQiuMicrosoft - DongmeiZhangMicrosoft and Microsoft + DongmeiZhangMicrosoft and Microsoft 963-981 This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. Considering the redundancy in natural language, existing approaches compress prompts by removing tokens or lexical units according to their information entropy obtained from a causal language model such as LLaMa-7B. The challenge is that information entropy may be a suboptimal compression metric: (i) it only leverages unidirectional context and may fail to capture all essential information needed for prompt compression; (ii) it is not aligned with the prompt compression objective.To address these issues, we propose a data distillation procedure to derive knowledge from an LLM to compress prompts without losing crucial information, and meantime, introduce an extractive text compression dataset. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one, and use a Transformer encoder as the base architecture to capture all essential information for prompt compression from the full bidirectional context. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.We evaluate our method on both in-domain and out-of-domain datasets, including MeetingBank, LongBench, ZeroScrolls, GSM8K, and BBH. Despite its small size, our model shows significant performance gains over strong baselines and demonstrates robust generalization ability across different LLMs. Additionally, our model is 3x-6x faster than existing prompt compression methods, while accelerating the end-to-end latency by 1.6x-2.9x with compression ratios of 2x-5x. 2024.findings-acl.57 @@ -6917,7 +6917,7 @@ <fixed-case>E</fixed-case>con<fixed-case>NLI</fixed-case>: Evaluating Large Language Models on Economics Reasoning - YueGuoHong Kong University of Science and Technology + YueGuoHong Kong University of Science and Technology YiYangHong Kong University of Science and Technology 982-994 Large Language Models (LLMs) are widely used for writing economic analysis reports or providing financial advice, but their ability to understand economic knowledge and reason about potential results of specific economic events lacks systematic evaluation. To address this gap, we propose a new dataset, natural language inference on economic events (EconNLI), to evaluate LLMs’ knowledge and reasoning abilities in the economic domain. We evaluate LLMs on (1) their ability to correctly classify whether a premise event will cause a hypothesis event and (2) their ability to generate reasonable events resulting from a given premise. Our experiments reveal that LLMs are not sophisticated in economic reasoning and may generate wrong or hallucinated answers. Our study raises awareness of the limitations of using LLMs for critical decision-making involving economic reasoning and analysis. The dataset and codes are available at https://github.com/Irenehere/EconNLI. @@ -6928,7 +6928,7 @@ Better Late Than Never: Model-Agnostic Hallucination Post-Processing Framework Towards Clinical Text Summarization SongdaLi - YunqiZhang + YunqiZhang ChunyuanDengRice University YakeNiu HuiZhaoEast China Normal University @@ -6942,9 +6942,9 @@ Finding and Editing Multi-Modal Neurons in Pre-Trained Transformers HaowenPanUniversity of Science and Technology of China YixinCaoFudan University - XiaozhiWangDepartment of Computer Science and Technology, Tsinghua University - XunYangUniversity of Science and Technology of China - MengWangHefei University of Technology + XiaozhiWangDepartment of Computer Science and Technology, Tsinghua University + XunYangUniversity of Science and Technology of China + MengWangHefei University of Technology 1012-1037 Understanding the internal mechanisms by which multi-modal large language models (LLMs) interpret different modalities and integrate cross-modal representations is becoming increasingly critical for continuous improvements in both academia and industry. In this paper, we propose a novel method to identify key neurons for interpretability — how multi-modal LLMs bridge visual and textual concepts for captioning. Our method improves conventional works upon efficiency and applied range by removing needs of costly gradient computation. Based on those identified neurons, we further design a multi-modal knowledge editing method, beneficial to mitigate sensitive words or hallucination. For rationale of our design, we provide theoretical assumption. For empirical evaluation, we have conducted extensive quantitative and qualitative experiments. The results not only validate the effectiveness of our methods, but also offer insightful findings that highlight three key properties of multi-modal neurons: sensitivity, specificity and causal-effect, to shed light for future research. 2024.findings-acl.60 @@ -6967,7 +6967,7 @@ Controllable Text Generation with Residual Memory Transformer HanqingZhang SiSunTsinghua University, Tsinghua University - HaimingWuBeijing Institute of Technology + HaimingWuBeijing Institute of Technology DaweiSongBeijing Institute of Technology and Open University 1048-1066 Large-scale Causal Language Models (CLMs), e.g., GPT3 and ChatGPT, have brought great success in text generation. However, it is still an open challenge to effectively control the generation process of a CLM while balancing the flexibility, control granularity, and generation efficiency. In this paper, we provide a new alternative for controllable text generation (CTG), by designing a non-intrusive, lightweight control plugin, namely Residual Memory Transformer (RMT), to accompany the generation of CLM at arbitrary time steps. With an encoder-decoder setup, RMT can accept any types of control conditions and cooperate with the base CLM through a residual learning paradigm, to achieve a more flexible, general, and efficient CTG. Extensive experiments are carried out on various control tasks, in the form of both automatic and human evaluations. The results demonstrate the superiority of RMT over a wide range of state-of-the-art CTG approaches. The code implementation of our work is available at: https://github.com/Residual_Memory_Transformer. @@ -6978,10 +6978,10 @@ Prompt-Based Length Controlled Generation with Multiple Control Types RenlongJieNorthwest Polytechnical University Xi’an - XiaojunMengNoah’s Ark Lab, Huawei Technologies Ltd. + XiaojunMengNoah’s Ark Lab, Huawei Technologies Ltd. LifengShangHuawei Technologies Ltd. - XinJiang - QunLiuHuawei Noah’s Ark Lab + XinJiang + QunLiuHuawei Noah’s Ark Lab 1067-1085 Large language models (LLMs) have attracted great attention given their strong performance on a wide range of NLP tasks. In practice, users often expect generated texts to fall within a specific length range, making length controlled generation an important topic, especially for GPT-style models. Existing length control methods mostly focus on a simple control type of “equal to” a target length. Different from them, we propose a prompt-based method to achieve length controlled generation under different control types with high accuracy. In particular, we adopt reinforcement learning (RL) and sample filtering with the reward signal given by rule-based reward models, which enhances the length control ability of models by rewarding outputs that follow certain control instructions. In addition, we introduce a standard prompt extractor to parse arbitrary users’ input into standard control instructions. Experiments show that our method significantly improves the accuracy of prompt-based length control on popular summarization datasets like CNNDM and NYT under multiple control types. Moreover, both the standard prompt extractor and RL-tuned model show strong generalization to unseen control prompt templates. 2024.findings-acl.63 @@ -6993,7 +6993,7 @@ LiangChen YichiZhang ShuhuaiRen - HaozheZhao + HaozheZhao ZefanCai YuchiWang PeiyiWang @@ -7012,10 +7012,10 @@ MinjuKim HanaKimYonsei University Beong-wooKwakYonsei University - SeongKuKangUniversity of Illinois Urbana-Champaign + SeongKuKangUniversity of Illinois Urbana-Champaign YoungjaeYuYonsei University - JinyoungYeoYonsei University - DonghaLeeYonsei University + JinyoungYeoYonsei University + DonghaLeeYonsei University 1105-1120 Conversational recommender systems are an emerging area that has garnered increasing interest in the community, especially with the advancements in large language models (LLMs) that enable sophisticated handling of conversational input. Despite the progress, the field still has many aspects left to explore. The currently available public datasets for conversational recommendation lack specific user preferences and explanations for recommendations, hindering high-quality recommendations. To address such challenges, we present a novel conversational recommendation dataset named PEARL, synthesized with persona- and knowledge-augmented LLM simulators. We obtain detailed persona and knowledge from real-world reviews and construct a large-scale dataset with over 57k dialogues. Our experimental results demonstrate that PEARL contains more specific user preferences, show expertise in the target domain, and provides recommendations more relevant to the dialogue context than those in prior datasets. Furthermore, we demonstrate the utility of PEARL by showing that our downstream models outperform baselines in both human and automatic evaluations. We release our dataset and code. 2024.findings-acl.65 @@ -7026,8 +7026,8 @@ <fixed-case>C</fixed-case>o<fixed-case>LL</fixed-case>a<fixed-case>VO</fixed-case>: Crayon Large Language and Vision m<fixed-case>O</fixed-case>del Byung-KwanLeeKorea Advanced Institute of Science and Technology BeomchanParkKAIST - Chae WonKim - Yong ManRoKorea Advanced Institute of Science and Technology + Chae WonKim + Yong ManRoKorea Advanced Institute of Science and Technology 1121-1138 The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from ‘what objects are in the image?’ or ‘which object corresponds to a specified bounding box?’. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting. 2024.findings-acl.66 @@ -7036,7 +7036,7 @@ Modelling Variability in Human Annotator Simulation - WenWu + WenWu WenlinChenUniversity of Cambridge and Max Planck Institute for Intelligent Systems ChaoZhangTsinghua University and University College London PhilWoodlandUniversity of Cambridge @@ -7051,7 +7051,7 @@ SheikhShafayatKAIST HHasan MinhajurMahim - RifkiPutriKorea Advanced Institute of Science & Technology + RifkiPutriKorea Advanced Institute of Science & Technology JamesThorneKAIST AliceOhKorea Advanced Institute of Science and Technology 1158-1177 @@ -7063,7 +7063,7 @@ <fixed-case>MORE</fixed-case>: Multi-m<fixed-case>O</fixed-case>dal <fixed-case>RE</fixed-case>trieval Augmented Generative Commonsense Reasoning WanqingCui - KepingBiChinese Academy of Sciences + KepingBiChinese Academy of Sciences JiafengGuoInstitute of Computing Technolgy, Chinese Academy of Sciences XueqiCheng, Chinese Academy of Sciences 1178-1192 @@ -7090,15 +7090,15 @@ <fixed-case>B</fixed-case>io<fixed-case>T</fixed-case>5+: Towards Generalized Biological Understanding with <fixed-case>IUPAC</fixed-case> Integration and Multi-task Tuning - QizhiPei - LijunWu - KaiyuanGao + QizhiPei + LijunWu + KaiyuanGao XiaozhuanLiangZhejiang University - YinFang + YinFang JinhuaZhu ShufangXieRenmin University of China TaoQin - RuiYanRenmin University of China + RuiYanRenmin University of China 1216-1240 Recent research trends in computational biology have increasingly focused on integrating text and bio-entity modeling, especially in the context of molecules and proteins. However, previous efforts like BioT5 faced challenges in generalizing across diverse tasks and lacked a nuanced understanding of molecular structures, particularly in their textual representations (e.g., IUPAC). This paper introduces BioT5+, an extension of the BioT5 framework, tailored to enhance biological research and drug discovery. BioT5+ incorporates several novel features: integration of IUPAC names for molecular understanding, inclusion of extensive bio-text and molecule data from sources like bioRxiv and PubChem, the multi-task instruction tuning for generality across tasks, and a numerical tokenization technique for improved processing of numerical data. These enhancements allow BioT5+ to bridge the gap between molecular representations and their textual descriptions, providing a more holistic understanding of biological entities, and largely improving the grounded reasoning of bio-text and bio-sequences. The model is pre-trained and fine-tuned with a large number of experiments, including 3 types of problems (classification, regression, generation), 15 kinds of tasks, and 21 total benchmark datasets, demonstrating the remarkable performance and state-of-the-art results in most cases. BioT5+ stands out for its ability to capture intricate relationships in biological data, thereby contributing significantly to bioinformatics and computational biology. Our code is available at https://github.com/QizhiPei/BioT5. 2024.findings-acl.71 @@ -7107,9 +7107,9 @@ <fixed-case>SIBO</fixed-case>: A Simple Booster for Parameter-Efficient Fine-Tuning - ZhihaoWen + ZhihaoWen JieZhang - YuanFangSingapore Management University + YuanFangSingapore Management University 1241-1257 Fine-tuning all parameters of large language models (LLMs) necessitates substantial computational power and extended time. Latest advancements in parameter-efficient fine-tuning (PEFT) techniques, such as Adapter tuning and LoRA, allow for adjustments to only a minor fraction of the parameters of these LLMs. Concurrently, it has been noted that the issue of over-smoothing diminishes the effectiveness of these Transformer-based LLMs, resulting in suboptimal performances in downstream tasks. In this paper, we present SIBO, which is a SImple BOoster to enhance PEFT, by injecting an initial residual. SIBO is straightforward and readily extensible to a range of state-of-the-art PEFT techniques to alleviate over-smoothing and enhance performance. Extensive experiments on 22 benchmark datasets demonstrate that SIBO significantly enhances the performance of various strong baselines, achieving up to 15.7% and 23.5% improvement over existing PEFT methods on the arithmetic and commonsense reasoning tasks, respectively. 2024.findings-acl.72 @@ -7118,11 +7118,11 @@ <fixed-case>G</fixed-case>eo<fixed-case>E</fixed-case>val: Benchmark for Evaluating <fixed-case>LLM</fixed-case>s and Multi-Modal Models on Geometry Problem-Solving - JiaxinZhangUniversity of Strathclyde + JiaxinZhangUniversity of Strathclyde Zhong-ZhiLi Ming-LiangZhang FeiYin, Institute of automation, Chinese academy of science - Cheng-LinLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + Cheng-LinLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences YasharMoshfeghiUniversity of Strathclyde 1258-1276 Recent advancements in large language models (LLMs) and multi-modal models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2,000 problems, a 750 problems subset focusing on backward reasoning, an augmented sub- set of 2,000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs in solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67% accuracy rate on the main subset but only a 6.00% accuracy on the hard subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities. @@ -7133,13 +7133,13 @@ Boosting Textural <fixed-case>NER</fixed-case> with Synthetic Image and Instructive Alignment JiahaoWang - WenjunKeSoutheast University + WenjunKeSoutheast University PengWang - HangZhang + HangZhang DongNieMeta Inc. - JiajunLiuSoutheast University + JiajunLiuSoutheast University GuozhengLiSoutheast University - ZiyuShang + ZiyuShang 1277-1287 Named entity recognition (NER) is a pivotal task reliant on textual data, often impeding the disambiguation of entities due to the absence of context. To tackle this challenge, conventional methods often incorporate images crawled from the internet as auxiliary information. However, the images often lack sufficient entities or would introduce noise. Even with high-quality images, it is still challenging to efficiently use images as auxiliaries (i.e., fine-grained alignment with texts). We introduce a novel method named InstructNER to address these issues. Leveraging the rich real-world knowledge and image synthesis capabilities of a large pre-trained stable diffusion (SD) model, InstructNER transforms the text-only NER into a multimodal NER (MNER) task. A selection process automatically identifies the best synthetic image by comparing fine-grained similarities with internet-crawled images through a visual bag-of-words strategy. Note, during the image synthesis, a cross-attention matrix between synthetic images and raw text emerges, which inspires a soft attention guidance alignment (AGA) mechanism. AGA optimizes the MNER task and concurrently facilitates instructive alignment in MNER. Empirical experiments on prominent MNER datasets show that our method surpasses all text-only baselines, improving F1-score by 1.4% to 2.3%. Remarkably, even when compared to fully multimodal baselines, our approach maintains competitive. Furthermore, we open-source a comprehensive synthetic image dataset and the code to supplement existing raw dataset. The code and datasets are available in https://github.com/Heyest/InstructNER. 2024.findings-acl.74 @@ -7152,7 +7152,7 @@ Neurons in Large Language Models: Dead, N-gram, Positional ElenaVoitaFAIR at Meta AI and University of Amsterdam JavierFerrando - ChristoforosNalmpantis + ChristoforosNalmpantis 1288-1301 We analyze a family of large language models in such a lightweight manner that can be done on a single GPU. Specifically, we focus on the OPT family of models ranging from 125m to 66b parameters and rely only on whether an FFN neuron is activated or not. First, we find that the early part of the network is sparse and represents many discrete features. Here, many neurons (more than in some layers of the 66b model) are “dead”, i.e. they never activate on a large collection of diverse data. At the same time, many of the alive neurons are reserved for discrete features and act as token and n-gram detectors. Interestingly, their corresponding FFN updates not only promote next token candidates as could be expected, but also explicitly focus on removing the information about triggering them tokens, i.e., current input. To the best of our knowledge, this is the first example of mechanisms specialized at removing (rather than adding) information from the residual stream. With scale, models become more sparse in a sense that they have more dead neurons and token detectors. Finally, some neurons are positional: them being activated or not depends largely (or solely) on position and less so (or not at all) on textual data. We find that smaller models have sets of neurons acting as position range indicators while larger models operate in a less explicit manner. 2024.findings-acl.75 @@ -7180,7 +7180,7 @@ ThanitTativannarat ChawanPiansaddhayanonChulalongkorn University AttapolRutherfordChulalongkorn University - EkapolChuangsuwanichChulalongkorn University + EkapolChuangsuwanichChulalongkorn University 1319-1329 Learning job title representation is a vital process for developing automatic human resource tools. To do so, existing methods primarily rely on learning the title representation through skills extracted from the job description, neglecting the rich and diverse content within. Thus, we propose an alternative framework for learning job titles through their respective job description (JD) and utilize a Job Description Aggregator component to handle the lengthy description and bidirectional contrastive loss to account for the bidirectional relationship between the job title and its description. We evaluated the performance of our method on both in-domain and out-of-domain settings, achieving a superior performance over the skill-based approach. 2024.findings-acl.77 @@ -7205,9 +7205,9 @@ Flexible Weight Tuning and Weight Fusion Strategies for Continual Named Entity Recognition YahanYuKyoto University, Kyoto University - DuzhenZhang + DuzhenZhang XiuyiChen - ChenhuiChuKyoto University + ChenhuiChuKyoto University 1351-1358 Continual Named Entity Recognition (CNER) is dedicated to sequentially learning new entity types while mitigating catastrophic forgetting of old entity types. Traditional CNER approaches commonly employ knowledge distillation to retain old knowledge within the current model. However, because only the representations of old and new models are constrained to be consistent, the reliance solely on distillation in existing methods still suffers from catastrophic forgetting. To further alleviate the forgetting issue of old entity types, this paper introduces flexible Weight Tuning (WT) and Weight Fusion (WF) strategies for CNER. The WT strategy, applied at each training step, employs a learning rate schedule on the parameters of the current model. After learning the current task, the WF strategy dynamically integrates knowledge from both the current and previous models for inference. Notably, these two strategies are model-agnostic and seamlessly integrate with existing State-Of-The-Art (SOTA) models. Extensive experiments demonstrate that the WT and WF strategies consistently enhance the performance of previous SOTA methods across ten CNER settings in three datasets. 2024.findings-acl.79 @@ -7217,11 +7217,11 @@ Unveiling the Achilles’ Heel of <fixed-case>NLG</fixed-case> Evaluators: A Unified Adversarial Framework Driven by Large Language Models YimingChennational university of singaore, National University of Singapore - ChenZhangNational University of Singapore + ChenZhangNational University of Singapore DanqingLuoNational University of Singapore - Luis FernandoD’HaroUniversidad Politécnica de Madrid - RobbyTanNational University of Singapore - HaizhouLiThe Chinese University of Hong Kong (Shenzhen); National University of Singapore and National University of Singapore + Luis FernandoD’HaroUniversidad Politécnica de Madrid + RobbyTanNational University of Singapore + HaizhouLiThe Chinese University of Hong Kong (Shenzhen); National University of Singapore and National University of Singapore 1359-1375 The automatic evaluation of natural language generation (NLG) systems presents a long-lasting challenge. Recent studies have highlighted various neural metrics that align well with human evaluations. Yet, the robustness of these evaluators against adversarial perturbations remains largely under-explored due to the unique challenges in obtaining adversarial data for different NLG evaluation tasks. To address the problem, we introduce AdvEval, a novel black-box adversarial framework against NLG evaluators. AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators. Specifically, inspired by the recent success of large language models (LLMs) in text generation and evaluation, we adopt strong LLMs as both the data generator and gold evaluator. Adversarial data are automatically optimized with feedback from the gold and victim evaluator. We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation. The results show that AdvEval can lead to significant performance degradation of various victim metrics, thereby validating its efficacy. 2024.findings-acl.80 @@ -7256,7 +7256,7 @@ ShenZhou YongqiLi XinMiao - TieyunQianWuhan University + TieyunQianWuhan University 1410-1423 Continual relation extraction (CRE) aims to continuously learn relations in new tasks without forgetting old relations in previous tasks.Current CRE methods are all rehearsal-based which need to store samples and thus may encounter privacy and security issues.This paper targets rehearsal-free continual relation extraction for the first time and decomposes it into task identification and within-task prediction sub-problems. Existing rehearsal-free methods focus on training a model (expert) for within-task prediction yet neglect to enhance models’ capability of task identification.In this paper, we propose an Ensemble-of-Experts (EoE) framework for rehearsal-free continual relation extraction. Specifically, we first discriminatively train each expert by augmenting analogous relations across tasks to enhance the expert’s task identification ability. We then propose a cascade voting mechanism to form an ensemble of experts for effectively aggregating their abilities.Extensive experiments demonstrate that our method outperforms current rehearsal-free methods and is even better than rehearsal-based CRE methods. 2024.findings-acl.83 @@ -7265,7 +7265,7 @@ Temporal Validity Change Prediction - GeorgWenzel + GeorgWenzel AdamJatowt 1424-1446 Temporal validity is an important property of text that has many downstream applications, such as recommender systems, conversational AI, and user status tracking. Existing benchmarking tasks often require models to identify the temporal validity duration of a single statement. However, many data sources contain additional context, such as successive sentences in a story or posts on a social media profile. This context may alter the duration for which the originally collected statement is expected to be valid. We propose Temporal Validity Change Prediction, a natural language processing task benchmarking the capability of machine learning models to detect context statements that induce such change. We create a dataset consisting of temporal target statements sourced from Twitter and crowdsource corresponding context statements. We then benchmark a set of transformer-based language models on our dataset. Finally, we experiment with a multitasking approach to improve the state-of-the-art performance. @@ -7276,7 +7276,7 @@ <fixed-case>RIFF</fixed-case>: Learning to Rephrase Inputs for Few-shot Fine-tuning of Language Models SaeedNajafiUniversity of Alberta - AlonaFysheUniversity of Alberta + AlonaFysheUniversity of Alberta 1447-1466 Pre-trained Language Models (PLMs) can be accurately fine-tuned for downstream text processing tasks. Recently, researchers have introduced several parameter-efficient fine-tuning methods that optimize input prompts or adjust a small number of model parameters (e.g LoRA). In this study, we explore the impact of altering the input text of the original task in conjunction with parameter-efficient fine-tuning methods. To most effectively rewrite the input text, we train a few-shot paraphrase model with a Maximum-Marginal Likelihood objective. Using six few-shot text classification datasets, we show that enriching data with paraphrases at train and test time enhances the performance beyond what can be achieved with parameter-efficient fine-tuning alone. The code used for our experiments can be found at https://github.com/SaeedNajafi/RIFF. 2024.findings-acl.85 @@ -7286,9 +7286,9 @@ Modelling Commonsense Commonalities with Multi-Facet Concept Embeddings HananeKteich - NaLiSchool of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology + NaLiSchool of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology UsashiChatterjeeCardiff University - ZiedBouraouiCRIL Univ-Artois & CNRS + ZiedBouraouiCRIL Univ-Artois & CNRS StevenSchockaertCardiff University 1467-1480 Concept embeddings offer a practical and efficient mechanism for injecting commonsense knowledge into downstream tasks. Their core purpose is often not to predict the commonsense properties of concepts themselves, but rather to identify commonalities, i.e. sets of concepts which share some property of interest. Such commonalities are the basis for inductive generalisation, hence high-quality concept embeddings can make learning easier and more robust. Unfortunately, standard embeddings primarily reflect basic taxonomic categories, making them unsuitable for finding commonalities that refer to more specific aspects (e.g. the colour of objects or the materials they are made of). In this paper, we address this limitation by explicitly modelling the different facets of interest when learning concept embeddings. We show that this leads to embeddings which capture a more diverse range of commonsense properties, and consistently improves results in downstream tasks such as ultra-fine entity typing and ontology completion. @@ -7323,7 +7323,7 @@ SaifeiLiao VictoriaNg SimonDe Montigny - GeraldPennDepartment of Computer Science, University of Toronto + GeraldPennDepartment of Computer Science, University of Toronto 1521-1533 The task of temporal relation extraction (TRE) involves identifying and extracting temporal relations between events from narratives. We identify two primary issues with TRE systems. First, by formulating TRE as a simple text classification task where every temporal relation is independent, it is hard to enhance the TRE model’s representation of meaning of temporal relations, and its facility with the underlying temporal calculus. We solve the issue by proposing a novel Temporally Contrastive learning model (ConTempo) that increase the model’s awareness of the meaning of temporal relations by leveraging their symmetric or antisymmetric properties. Second, the reusability of innovations has been limited due to incompatibilities in model architectures. Therefore, we propose a unified framework and show that ConTempo is compatible with all three main branches of TRE research. Our results demonstrate that the performance gains of ConTempo are more pronounced, with the total combination achieving state-of-the-art performance on the widely used MATRES and TBD corpora. We furthermore identified and corrected a large number of annotation errors present in the test set of MATRES, after which the performance increase brought by ConTempo becomes more apparent. 2024.findings-acl.89 @@ -7333,10 +7333,10 @@ <fixed-case>CHARP</fixed-case>: Conversation History <fixed-case>A</fixed-case>wa<fixed-case>R</fixed-case>eness Probing for Knowledge-grounded Dialogue Systems AbbasGhaddarHuawei Technologies Ltd. - DavidAlfonso-HermeloHuawei Technologies Ltd. - PhilippeLanglaisUniversité de Montréal + DavidAlfonso-HermeloHuawei Technologies Ltd. + PhilippeLanglaisUniversité de Montréal MehdiRezagholizadeh - BoxingChenHuawei Technologies Ltd. + BoxingChenHuawei Technologies Ltd. PrasannaParthasarathiHuawei Technologies Ltd. 1534-1551 In this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a testbed, designed for evaluating supposedly non-hallucinatory models trained on the FaithDial dataset. Our extensive analysis reveals that models primarily exhibit poor performance on CHARP due to their inability to effectively attend to and reason over the conversation history. Furthermore, the evaluation methods of FaithDial fail to capture these shortcomings, neglecting the conversational history. Our findings indicate that there is substantial room for contribution in both dataset creation and hallucination evaluation for knowledge-grounded dialogue, and that CHARP can serve as a tool for monitoring the progress in this particular research area. Data, models, and source code will be publicly available upon acceptance. @@ -7349,9 +7349,9 @@ ZichengLin ZhibinGou TianLiang - RuilinLuo + RuilinLuo HaoweiLiuUniversity of Hong Kong - YujiuYangGraduate School at Shenzhen,Tsinghua University + YujiuYangGraduate School at Shenzhen,Tsinghua University 1552-1587 The ability of Large Language Models (LLMs) to critique and refine their reasoning is crucial for their application in evaluation, feedback provision, and self-improvement. This paper introduces CriticBench, a comprehensive benchmark designed to assess LLMs’ abilities to critique and rectify their reasoning across a variety of tasks. CriticBench encompasses five reasoning domains: mathematical, commonsense, symbolic, coding, and algorithmic. It compiles 15 datasets and incorporates responses from three LLM families. Utilizing CriticBench, we evaluate and dissect the performance of 17 LLMs in generation, critique, and correction reasoning, i.e., GQC reasoning. Our findings reveal: (1) a linear relationship in GQC capabilities, with critique-focused training markedly enhancing performance; (2) a task-dependent variation in correction effectiveness, with logic-oriented tasks being more amenable to correction; (3) GQC knowledge inconsistencies that decrease as model size increases; and (4) an intriguing inter-model critiquing dynamic, where stronger models are better at critiquing weaker ones, while weaker models can surprisingly surpass stronger ones in their self-critique. We hope these insights into the nuanced critique-correct reasoning of LLMs will foster further research in LLM critique and self-improvement. 2024.findings-acl.91 @@ -7361,10 +7361,10 @@ <fixed-case>DAFN</fixed-case>et: Dynamic Auxiliary Fusion for Sequential Model Editing in Large Language Models TaolinZhangAlibaba Group - QizhouChen + QizhouChen DongyangLiEast China Normal University ChengyuWangAlibaba Group - XiaofengHeEast China Normal University + XiaofengHeEast China Normal University LongtaoHuangAlibaba Group HuiXue’ JunHuang @@ -7376,7 +7376,7 @@ Controllable Text Summarization: Unraveling Challenges, Approaches, and Prospects - A Survey - AshokUrlanaTata Consultancy Services Limited, India + AshokUrlanaTata Consultancy Services Limited, India PruthwikMishraIIIT-Hyderabad TathagatoRoy RahulMishraInternational Institute of Information Technology Hyderabad @@ -7394,7 +7394,7 @@ SongtaoWang HongfuLiuNational University of Singapore HaoWangRutgers University - YeWang + YeWang 1624-1637 Traditional applications of natural language processing (NLP) in healthcare have predominantly focused on patient-centered services, enhancing patient interactions and care delivery, such as through medical dialogue systems. However, the potential of NLP to benefit inexperienced doctors, particularly in areas such as communicative medical coaching, remains largely unexplored. We introduce “ChatCoach”, a human-AI cooperative framework designed to assist medical learners in practicing their communication skills during patient consultations. ChatCoach differentiates itself from conventional dialogue systems by offering a simulated environment where medical learners can practice dialogues with a patient agent, while a coach agent provides immediate, structured feedback. This is facilitated by our proposed Generalized Chain-of-Thought (GCoT) approach, which fosters the generation of structured feedback and enhances the utilization of external knowledge sources. Additionally, we have developed a dataset specifically for evaluating Large Language Models (LLMs) within the ChatCoach framework on communicative medical coaching tasks. Our empirical results validate the effectiveness of ChatCoach. 2024.findings-acl.94 @@ -7408,11 +7408,11 @@ LuWangMicrosoft YongXu MinghuaMaMicrosoft - WeiZhangEast China Normal University + WeiZhangEast China Normal University SiQinMicrosoft SaravanRajmohanMicrosoft - QingweiLinMicrosoft Research - DongmeiZhangMicrosoft and Microsoft + QingweiLinMicrosoft Research + DongmeiZhangMicrosoft and Microsoft 1638-1662 This paper introduce a novel thought prompting approach called ”Everything of Thoughts” (XoT) for Large Language Models (LLMs) to defy the law of ”Penrose triangle” of existing thought paradigms, to achieve three key perspectives in thought generation simultaneously: performance, efficiency, and flexibility. XoT leverages pretrained reinforcement learning and Monte Carlo Tree Search (MCTS) to incorporate external domain knowledge and planning capability into thoughts, thereby enhancing LLMs’ decision-making capabilities. Through the MCTS-LLM collaborative thought revision framework, XoT autonomously produces high-quality comprehensive cognitive mappings with minimal LLM interactions. Additionally, XoT empowers LLMs to utilize flexible cognitive mappings for solving problems with multiple solutions.We evaluate XoT on several challenging problem-solving tasks, including Game of 24, 8-Puzzle, and Pocket Cube. Our results demonstrate that XoT significantly outperforms existing approaches in various dimensions, showcasing its remarkable proficiency in addressing complex problems across diverse domains. The data and code are available at https://github.com/microsoft/Everything-of-Thoughts-XoT. 2024.findings-acl.95 @@ -7422,10 +7422,10 @@ <fixed-case>SPAGHETTI</fixed-case>: Open-Domain Question Answering from Heterogeneous Data Sources with Retrieval and Semantic Parsing HeidiZhangStanford University - SinaSemnaniStanford University + SinaSemnaniStanford University FarhadGhassemiComputer Science Department, Stanford University JialiangXu - ShichengLiuStanford University + ShichengLiuStanford University MonicaLamStanford University 1663-1678 We introduce SPAGHETTI: Semantic Parsing Augmented Generation for Hybrid English information from Text Tables and Infoboxes, a hybrid question-answering (QA) pipeline that utilizes information from heterogeneous knowledge sources, including knowledge base, text, tables, and infoboxes. Our LLM-augmented approach achieves state-of-the-art performance on the Compmix dataset, the most comprehensive heterogeneous open-domain QA dataset, with 56.5% exact match (EM) rate. More importantly, manual analysis on a sample of the dataset suggests that SPAGHETTI is more than 90% accurate, indicating that EM is no longer suitable for assessing the capabilities of QA systems today. @@ -7440,7 +7440,7 @@ RuochenZhao TianzeLuo XinzeLiSchool of Computer Science and Engineering, Nanyang Technological University - GuizhenChen + GuizhenChen WenhanXia JunjieHuUniversity of Wisconsin, Madison Anh TuanLuuNanyang Technological University @@ -7490,7 +7490,7 @@ <fixed-case>C</fixed-case>ee<fixed-case>BERT</fixed-case>: Cross-Domain Inference in Early Exit <fixed-case>BERT</fixed-case> Divya JyotiBajpai - Manjesh KumarHanawal + Manjesh KumarHanawal 1736-1748 Pre-trained Language Models (PLMs), like BERT, with self-supervision objectives exhibit remarkable performance and generalization across various tasks. However, they suffer in inference latency due to their large size. To address this issue, side branches are attached at intermediate layers, enabling early inference of samples without requiring them to pass through all layers. However, the challenge is to decide which layer to infer and exit each sample so that the accuracy and latency are balanced. Moreover, the distribution of the samples to be inferred may differ from that used for training necessitating cross-domain adaptation. We propose an online learning algorithm named Cross-Domain Inference in Early Exit BERT (CeeBERT) that dynamically determines early exits of samples based on the level of confidence at each exit point. CeeBERT learns optimal thresholds from domain-specific confidence observed at intermediate layers on the fly, eliminating the need for labeled data. Experimental results on five distinct datasets with BERT and ALBERT models demonstrate CeeBERT’s ability to improve latency by reducing unnecessary computations with minimal drop in performance. By adapting to the threshold values, CeeBERT can speed up the BERT/ALBERT models by 2\times - 3.1\times with minimal drop in accuracy. The anonymized source code is available at https://github.com/Div290/CeeBERT. 2024.findings-acl.101 @@ -7513,7 +7513,7 @@ MehakDhaliwalUniversity of California, Santa Barbara PeterFrischAmazon TobiasDomhanAmazon - MarcelloFedericoAmazon + MarcelloFedericoAmazon 1763-1775 We show that content on the web is often translated into many languages, and the low quality of these multi-way translations indicates they were likely created using Machine Translation (MT). Multi-way parallel, machine generated content not only dominates the translations in lower resource languages; it also constitutes a large fraction of the total web content in those languages. We also find evidence of a selection bias in the type of content which is translated into many languages, consistent with low quality English content being translated en masse into many lower resource languages, via MT. Our work raises serious concerns about training models such as multilingual large language models on both monolingual and bilingual data scraped from the web. 2024.findings-acl.103 @@ -7522,7 +7522,7 @@ <fixed-case>R</fixed-case>ank<fixed-case>M</fixed-case>ean: Module-Level Importance Score for Merging Fine-tuned <fixed-case>LLM</fixed-case> Models - GabrielPerin + GabrielPerin XuxiChenUniversity of Texas at Austin ShusenLiuLawrence Livermore National Labs BhavyaKailkhuraLawrence Livermore National Laboratory @@ -7562,11 +7562,11 @@ Towards Safer Large Language Models through Machine Unlearning - ZheyuanLiuUniversity of Notre Dame + ZheyuanLiuUniversity of Notre Dame GuangyaoDou - ZhaoxuanTanUniversity of Notre Dame - YijunTian - MengJiangUniversity of Notre Dame + ZhaoxuanTanUniversity of Notre Dame + YijunTian + MengJiangUniversity of Notre Dame 1817-1829 The rapid advancement of Large Language Models (LLMs) has demonstrated their vast potential across various domains, attributed to their extensive pretraining knowledge and exceptional generalizability. However, LLMs often encounter challenges in generating harmful content when faced with problematic prompts. To address this problem, existing work attempted to implement a gradient ascent based approach to prevent LLMs from producing harmful output. While these methods can be effective, they frequently impact the model utility in responding to normal prompts. To address this gap, we introduce Selective Knowledge negation Unlearning (SKU), a novel unlearning framework for LLMs, designed to eliminate harmful knowledge while preserving utility on normal prompts. Specifically, SKU is consisted of two stages: harmful knowledge acquisition stage and knowledge negation stage. The first stage aims to identify and acquire harmful knowledge within the model, whereas the second is dedicated to remove this knowledge. SKU selectively isolates and removes harmful knowledge in model parameters, ensuring the model’s performance remains robust on normal prompts. Our experiments conducted across various LLM architectures demonstrate that SKU identifies a good balance point between removing harmful information and preserving utility. 2024.findings-acl.107 @@ -7581,7 +7581,7 @@ HaiyanZhaoNew Jersey Institute of Technology WenyueHuaRutgers University, New Brunswick YandaMengUniversity of Exeter - YongfengZhangRutgers University + YongfengZhangRutgers University MengnanDuNew Jersey Institute of Technology 1830-1842 Chain of Thought (CoT) is significant in improving the reasoning abilities of large language models (LLMs). However, the correlation between the effectiveness of CoT and the length of reasoning steps in prompts remains largely unknown. To shed light on this, we have conducted several empirical experiments to explore the relations. Specifically, we design experiments that expand and compress the rationale reasoning steps within CoT demonstrations, while keeping all other factors constant. We have the following key findings. First, the results indicate that lengthening the reasoning steps in prompts, even without adding new information into the prompt, considerably enhances LLMs’ reasoning abilities across multiple datasets. Alternatively, shortening the reasoning steps, even while preserving the key information, significantly diminishes the reasoning abilities of models. This finding highlights the importance of the number of steps in CoT prompts and provides practical guidance to make better use of LLMs’ potential in complex problem-solving scenarios. Second, we also investigated the relationship between the performance of CoT and the rationales used in demonstrations. Surprisingly, the result shows that even incorrect rationales can yield favorable outcomes if they maintain the requisite length of inference. Third, we observed that the advantages of increasing reasoning steps are task-dependent: simpler tasks require fewer steps, whereas complex tasks gain significantly from longer inference sequences. @@ -7607,14 +7607,14 @@ <fixed-case>SKGS</fixed-case>um: Structured Knowledge-Guided Document Summarization - QiqiWangUniversity of Auckland + QiqiWangUniversity of Auckland RuofanWang - KaiqiZhaoUniversity of Auckland + KaiqiZhaoUniversity of Auckland RobertAmorUniversity of Auckland BenjaminLiu JiamouLiuThe University of Auckland XiandaZhengUniversity of Auckland - ZijianHuangUniversity of Auckland + ZijianHuangUniversity of Auckland 1857-1871 A summary structure is inherent to certain types of texts according to the Genre Theory of Linguistics. Such structures aid readers in efficiently locating information within summaries. However, most existing automatic summarization methods overlook the importance of summary structure, resulting in summaries that emphasize the most prominent information while omitting essential details from other sections. While a few summarizers recognize the importance of summary structure, they rely heavily on the predefined labels of summary structures in the source document and ground truth summaries. To address these shortcomings, we developed a Structured Knowledge-Guided Summarization (SKGSum) and its variant, SKGSum-W, which do not require structure labels. Instead, these methods rely on a set of automatically extracted summary points to generate summaries. We evaluate the proposed methods using three real-world datasets. The results indicate that our methods not only improve the quality of summaries, in terms of ROUGE and BERTScore, but also broaden the types of documents that can be effectively summarized. 2024.findings-acl.110 @@ -7651,7 +7651,7 @@ YixinYangPeking University ZhengLi QingxiuDong - HemingXia + HemingXia ZhifangSuiPeking University 1898-1912 Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models’ (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs. @@ -7663,7 +7663,7 @@ Harvesting Events from Multiple Sources: Towards a Cross-Document Event Extraction Paradigm QiangGao ZixiangMeng - BoboLiWuhan University + BoboLiWuhan University JunZhouWuhan University FeiLiWuhan University ChongTeng @@ -7679,7 +7679,7 @@ EunJeongHwangUniversity of British Columbia VeredShwartz DanGutfreundMIT-IBM Watson AI Lab - VeronikaThostInternational Business Machines + VeronikaThostInternational Business Machines 1928-1942 Reasoning about subjective natural language descriptions, such as opinions and preferences, is a challenging topic that largely remains unsolved to date. In particular, state-of-the-art large language models (LLMs) perform disappointingly in this task, show strong biases, and do not meet the interpretability requirements often needed in these kinds of applications. We propose a novel approach for reasoning about subjective knowledge that integrates potential and implicit meanings and explicitly models the relational nature of the information. We apply supervised graph learning, offer explanations for the model’s reasoning, and show that our model performs well across all 15 topics of OpinionQA, outperforming several prominent LLMs. Our detailed analysis further shows its unique advantages and the complementary nature it offers in comparison to LLMs. 2024.findings-acl.115 @@ -7695,8 +7695,8 @@ ZhiyuanLiuNational University of Singapore SihangLi KunWangUniversity of Science and Technology of China - WenjieDu - XiangWangUniversity of Science and Technology of China + WenjieDu + XiangWangUniversity of Science and Technology of China 1943-1958 Molecular Relational Learning (MRL), aiming to understand interactions between molecular pairs, plays a pivotal role in advancing biochemical research. Recently, the adoption of large language models (LLMs), known for their vast knowledge repositories and advanced logical inference capabilities, has emerged as a promising way for efficient and effective MRL. Despite their potential, these methods predominantly rely on textual data, thus not fully harnessing the wealth of structural information inherent in molecular graphs. Moreover, the absence of a unified framework exacerbates the issue of insufficient data exploitation, as it hinders the sharing of interaction mechanism learned across various datasets. To address these challenges, this work proposes a novel LLM-based multi-modal framework for molecular interaction modeling following Chain-of-Thought (CoT) theory, termed MolTC, which effectively integrate graphical information of two molecules in pair. To train this integrated framework efficiently, we introduce a *multi-hierarchical CoT theory* to refine its training paradigm, and conduct a comprehensive *Molecular Interactive Instructions* dataset for the development of biochemical LLMs involving MRL.Our experiments,conducted across various datasets involving over 4,000,000 molecular pairs, exhibit the superiority of our method over current GNN and LLM-based baselines. Code is available at https://github.com/MangoKiller/MolTC. 2024.findings-acl.116 @@ -7728,7 +7728,7 @@ <fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case> Meets Dropout under a Unified Framework - ShengWang + ShengWang LihengChen JiyueJiang BoyangXue @@ -7742,7 +7742,7 @@ Enhancing Text-to-<fixed-case>SQL</fixed-case> Parsing through Question Rewriting and Execution-Guided Refinement - WenxinMao + WenxinMao RuiqiWang JiyuGuo JichuanZeng @@ -7759,7 +7759,7 @@ The Knowledge Alignment Problem: Bridging Human and External Knowledge for Large Language Models ShuoZhang LiangmingPan - JunzhouZhaoXi’an Jiaotong University + JunzhouZhaoXi’an Jiaotong University William YangWangUC Santa Barbara 2025-2038 Large language models often necessitate grounding on external knowledge to generate faithful and reliable answers. Yet even with the correct groundings in the reference, they can ignore them and rely on wrong groundings or their inherent biases to hallucinate when users, being largely unaware of the specifics of the stored information, pose questions that might not directly correlate with the retrieved groundings. In this work, we formulate this knowledge alignment problem and introduce MixAlign, a framework that interacts with both the human user and the knowledge base to obtain and integrate clarifications on how the user question relates to the stored information. MixAlign employs a language model to achieve automatic knowledge alignment and, if necessary, further enhances this alignment through human user clarifications. Experimental results highlight the crucial role of knowledge alignment in boosting model performance and mitigating hallucination, with improvements noted up to 22.2% and 27.1% respectively. We also demonstrate the effectiveness of MixAlign in improving knowledge alignment by producing high-quality, user-centered clarifications. @@ -7769,17 +7769,17 @@ <fixed-case>C</fixed-case>hat<fixed-case>KBQA</fixed-case>: A Generate-then-Retrieve Framework for Knowledge Base Question Answering with Fine-tuned Large Language Models - HaoranLuo + HaoranLuo HaihongEBeijing University of Post and Telecommunication - ZichenTangBeijing University of Posts and Telecommunications + ZichenTangBeijing University of Posts and Telecommunications ShiyaoPeng - YikaiGuo + YikaiGuo WentaiZhang ChenghaoMa GuantingDongRenmin University of China MeinaSongBeijing University of Posts and Telecommunications WeiLin - YifanZhuBeijing University of Posts and Telecommunications + YifanZhuBeijing University of Posts and Telecommunications Anh TuanLuuNanyang Technological University 2039-2056 Knowledge Base Question Answering (KBQA) aims to answer natural language questions over large-scale knowledge bases (KBs), which can be summarized into two crucial steps: knowledge retrieval and semantic parsing. However, three core challenges remain: inefficient knowledge retrieval, mistakes of retrieval adversely impacting semantic parsing, and the complexity of previous KBQA methods. To tackle these challenges, we introduce ChatKBQA, a novel and simple generate-then-retrieve KBQA framework, which proposes first generating the logical form with fine-tuned LLMs, then retrieving and replacing entities and relations with an unsupervised retrieval method, to improve both generation and retrieval more directly. Experimental results show that ChatKBQA achieves new state-of-the-art performance on standard KBQA datasets, WebQSP, and CWQ. This work can also be regarded as a new paradigm for combining LLMs with knowledge graphs (KGs) for interpretable and knowledge-required question answering. @@ -7805,10 +7805,10 @@ <fixed-case>INTERVENOR</fixed-case>: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair HanbinWang ZhenghaoLiuNortheastern University - ShuoWang + ShuoWang GanquCui NingDingTsinghua University, Tsinghua University - ZhiyuanLiuTsinghua University + ZhiyuanLiuTsinghua University GeYu 2081-2107 This paper introduces INTERVENOR (INTERactiVE chaiN Of Repair), a system designed to emulate the interactive code repair processes observed in humans, encompassing both code diagnosis and code repair. INTERVENOR prompts Large Language Models (LLMs) to play distinct roles during the code repair process, functioning as both a Code Learner and a Code Teacher. Specifically, the Code Learner is tasked with adhering to instructions to generate or repair code, while the Code Teacher is responsible for crafting a Chain-of-Repair (CoR) to serve as guidance for the Code Learner. During generating the CoR, the Code Teacher needs to check the generated codes from Code Learner and reassess how to address code bugs based on error feedback received from compilers. Experimental results demonstrate that INTERVENOR surpasses baseline models, exhibiting improvements of approximately 18% and 4.3% over GPT-3.5 in code generation and code translation tasks, respectively. Our further analyses show that CoR is effective to illuminate the reasons behind bugs and outline solution plans in natural language. With the feedback of code compilers, INTERVENOR can accurately identify syntax errors and assertion errors and provide precise instructions to repair codes. All data and codes are available at [https://github.com/NEUIR/INTERVENOR](https://github.com/NEUIR/INTERVENOR). @@ -7820,8 +7820,8 @@ <fixed-case>S</fixed-case>ocial<fixed-case>B</fixed-case>ench: Sociality Evaluation of Role-Playing Conversational Agents HongzhanChenSUN YAT-SEN UNIVERSITY HehongChen - MingYan - WenshenXu + MingYan + WenshenXu GaoXing WeizhouShen XiaojunQuanSUN YAT-SEN UNIVERSITY @@ -7836,9 +7836,9 @@ From Model-centered to Human-Centered: Revision Distance as a Metric for Text Evaluation in <fixed-case>LLM</fixed-case>s-based Applications - YongqiangMaWuhan University + YongqiangMaWuhan University LizhiQingAlibaba Group - JiaweiLiuWuhan University + JiaweiLiuWuhan University YangyangKangAlibaba Group YueZhangAlibaba Group WeiLu @@ -7858,7 +7858,7 @@ XuanLinAnt Group LiubinWang DaqianLiDaqianLi - YongruiChen + YongruiChen 2138-2148 Incomplete utterance rewriting (IUR) aims to reconstruct the utterance with omitted information and pronouns to be standalone and complete based on the context. The existing works predominantly focus on simple ellipsis and coreference problems in brief multi-turn dialogues. But in actual scenarios: 1) the context of the dialogues frequently comprises multiple similar candidates for ellipsis and coreference resolution, pouring to confuse. 2) the number of turns tends to be more extensive, while the content with various topics also grows more complex. This paper proposes a novel method called CaT to address these issues. In particular, we first devise a tacker model, distilled from GPT4-turbo, to adopt Context Tracking that dynamically updates a list of key phrases turn by turn, as accurate candidates for ellipsis and coreference resolution. Second, we further present the Dynamic Context Introduction mechanism to filter irrelevant preceding contexts that are not relied on by any element within the key phrase list to condense extended dialogues. Comprehensive experiments indicate that our solution provides a significant improvement over the existing baselines, and achieves state-of-the-art on three benchmarks. 2024.findings-acl.127 @@ -7869,9 +7869,9 @@ <fixed-case>E</fixed-case>motion<fixed-case>Q</fixed-case>ueen: A Benchmark for Evaluating Empathy of Large Language Models YuyanChen SongzhouYan - SijiaLiu + SijiaLiu YuezeLi - YanghuaXiaoFudan University + YanghuaXiaoFudan University 2149-2176 Emotional intelligence in large language models (LLMs) is of great importance in Natural Language Processing. However, the previous research mainly focus on basic sentiment analysis tasks, such as emotion recognition, which is not enough to evaluate LLMs’ overall emotional intelligence. Therefore, this paper presents a novel framework named EmotionQueen for evaluating the emotional intelligence of LLMs. The framework includes four distinctive tasks: Key Event Recognition, Mixed Event Recognition, Implicit Emotional Recognition, and Intention Recognition. LLMs are requested to recognize important event or implicit emotions and generate empathetic response.We also design two metrics to evaluate LLMs’ capabilities in recognition and response for emotion-related statements. Experiments yield significant conclusions about LLMs’ capabilities and limitations in emotion intelligence. 2024.findings-acl.128 @@ -7880,15 +7880,15 @@ Plum: Prompt Learning using Metaheuristics - RuiPanThe Hong Kong University of Science and Technology + RuiPanThe Hong Kong University of Science and Technology ShuoXingTexas A&M University - College Station ShizheDiaoHong Kong University of Science and Technology WenheSun XiangLiu - KaShunShum + KaShunShum JipengZhang RenjiePi - TongZhangUIUC + TongZhangUIUC 2177-2197 Since the emergence of large language models, prompt learning has become a popular method for optimizing and customizing these models. Special prompts, such as Chain-of-Thought, have even revealed previously unknown reasoning capabilities within these models. However, the progress of discovering effective prompts has been slow, driving a desire for general prompt optimization methods. Unfortunately, few existing prompt learning methods satisfy the criteria of being truly “general”, i.e., automatic, discrete, black-box, gradient-free, and interpretable all at once. In this paper, we introduce metaheuristics, a branch of discrete non-convex optimization methods with over 100 options, as a promising approach to prompt learning. Within our paradigm, we test six typical methods: hill climbing, simulated annealing, genetic algorithms with/without crossover, tabu search, and harmony search, demonstrating their effectiveness in white-box and black-box prompt learning. Furthermore, we show that these methods can be used to discover more human-understandable prompts that were previously unknown in both reasoning and image generation tasks, opening the door to a cornucopia of possibilities in prompt optimization. 2024.findings-acl.129 @@ -7902,7 +7902,7 @@ QingpeiGuoAnt Group JiyuanJiasouthern university of science and technology ZhixuLi - YanghuaXiaoFudan University + YanghuaXiaoFudan University 2198-2224 In the era of social media video platforms, popular “hot-comments” play a crucial role in attracting user impressions of short-form videos, making them vital for marketing and branding purpose. However, existing research predominantly focuses on generating descriptive comments or “danmaku” in English, offering immediate reactions to specific video moments. Addressing this gap, our study introduces HOTVCOM, the largest Chinese video hot-comment dataset, comprising 94k diverse videos and 137 million comments. We also present the ComHeat framework, which synergistically integrates visual, auditory, and textual data to generate influential hot-comments on the Chinese video dataset. Empirical evaluations highlight the effectiveness of our framework, demonstrating its excellence on both the newly constructed and existing datasets. 2024.findings-acl.130 @@ -7914,9 +7914,9 @@ YuyanChen YuezeLi SongzhouYan - SijiaLiu + SijiaLiu JiaqingLiangFudan University - YanghuaXiaoFudan University + YanghuaXiaoFudan University 2225-2238 The evaluation of the problem-solving capability under incomplete information scenarios of Large Language Models (LLMs) is increasingly important, encompassing capabilities such as questioning, knowledge search, error detection, and path planning. Current research mainly focus on LLMs’ problem-solving capability such as “Twenty Questions”.However, these kinds of games do not require recognizing misleading cues which are necessary in the incomplete information scenario.Moreover, the existing game such as “Who is undercover” are highly subjective, making it challenging for evaluation.Therefore, in this paper, we introduce a novel game named BrainKing based on the “Who is undercover” and “Twenty Questions” for evaluating LLM capabilities under incomplete information scenarios. It requires LLMs to identify target entities with limited yes-or-no questions and potential misleading answers. By setting up easy, medium, and hard difficulty modes, we comprehensively assess the performance of LLMs across various aspects. Our results reveal the capabilities and limitations of LLMs in BrainKing, providing significant insights of LLM problem-solving levels. 2024.findings-acl.131 @@ -7936,7 +7936,7 @@ Into the Unknown: Generating Geospatial Descriptions for New Environments TzufPaz-ArgamanBar-Ilan University - JohnPalowitchGoogle + JohnPalowitchGoogle SayaliKulkarniResearch, Google and Google ReutTsarfatyGoogle and Bar-Ilan University, Technion JasonBaldridgeGoogle @@ -7962,14 +7962,14 @@ Length-aware Byte Pair Encoding for Mitigating Over-segmentation in <fixed-case>K</fixed-case>orean Machine Translation - JungseobLeeKorea University - HyeonseokMoonKorea University + JungseobLeeKorea University + HyeonseokMoonKorea University SeungjunLeeKorea University - ChanjunParkUpstage - SugyeongEoKorea University + ChanjunParkUpstage + SugyeongEoKorea University HyunwoongKo - JaehyungSeo - SeungyoonLeeKorea University + JaehyungSeo + SeungyoonLeeKorea University HeuiseokLimKorea University 2287-2303 Byte Pair Encoding is an effective approach in machine translation across several languages. However, our analysis indicates that BPE is prone to over-segmentation in the morphologically rich language, Korean, which can erode word semantics and lead to semantic confusion during training. This semantic confusion, stemming from over-segmentation, ultimately contributes to a degradation of overall translation quality. To address this issue, we introduce Length-aware Subword Vocabulary Construction (LeVoC), a novel approach strategically incorporating longer words into the vocabulary. By utilizing an external monolingual Korean corpus, LeVoC extracts and integrates long words, effectively preserving morphological information and reducing semantic confusion. Our experiments demonstrate that LeVoC not only significantly outperforms BPE, but also can be applied to and surpass current state-of-the-art morpheme-aware subword tokenization methods. We provide evidence that the difficulty in translating sentences with long words in Korean is associated with morphological compositionality, and LeVoC’s ability to reduce semantic confusion during training leads to improved translation quality. @@ -7997,8 +7997,8 @@ ShitaoXiao PeitianZhang KunLuo - DefuLianUniversity of Science and Technology of China - ZhengLiu + DefuLianUniversity of Science and Technology of China + ZhengLiu 2318-2335 In this paper, we introduce a new embedding model called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It provides a uniform support for the semantic retrieval of more than 100 working languages. It can simultaneously accomplish the three common retrieval functionalities: dense retrieval, multi-vector retrieval, and sparse retrieval. Besides, it is also capable of processing inputs of different granularities, spanning from short sentences to long documents of up to 8,192 tokens. The effective training of M3-Embedding presents a series of technical contributions. Notably, we propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, which enables a large batch size and high training throughput to improve the discriminativeness of embeddings. M3-Embedding exhibits a superior performance in our experiment, leading to new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks. 2024.findings-acl.137 @@ -8012,7 +8012,7 @@ ZhengWangUniversity of Leeds HongyuZhangUniversity of Newcastle, Australia BatuGuan - FangxinLu + FangxinLu ZiliZhang YuleiSuiUniversity of New South Wales HaiJinHuazhong University of Science and Technology @@ -8026,8 +8026,8 @@ An Element is Worth a Thousand Words: Enhancing Legal Case Retrieval by Incorporating Legal Elements ChenlongDengRenmin University of China - ZhichengDouRenmin University of China - YujiaZhouTsinghua University, Tsinghua University + ZhichengDouRenmin University of China + YujiaZhouTsinghua University, Tsinghua University PeitianZhang KelongMao 2354-2365 @@ -8041,10 +8041,10 @@ XinnongZhangFudan University HaoyuKuangFudan University XinyiMou - HanjiaLyuUniversity of Rochester + HanjiaLyuUniversity of Rochester KunWu SimingChenFudan University - JieboLuoUniversity of Rochester and University of Rochester + JieboLuoUniversity of Rochester and University of Rochester XuanjingHuangFudan University ZhongyuWeiFudan University 2366-2389 @@ -8055,10 +8055,10 @@ <fixed-case>K</fixed-case>o<fixed-case>C</fixed-case>ommon<fixed-case>GEN</fixed-case> v2: A Benchmark for Navigating <fixed-case>K</fixed-case>orean Commonsense Reasoning Challenges in Large Language Models - JaehyungSeo - JaewookLeeKorea University - ChanjunParkUpstage - SeongTaeHongKorea University + JaehyungSeo + JaewookLeeKorea University + ChanjunParkUpstage + SeongTaeHongKorea University SeungjunLeeKorea University HeuiseokLimKorea University 2390-2415 @@ -8098,11 +8098,11 @@ Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback ChengfengDou - YingZhang - ZhiJinPeking University and Peking University - WenpinJiaoPeking University + YingZhang + ZhiJinPeking University and Peking University + WenpinJiaoPeking University HaiyanZhaoPeking University - YongqiangZhao + YongqiangZhao ZhengweiTao 2453-2473 The utilization of large language models for medical dialogue generation has attracted considerable attention due to its potential to enhance response richness and coherence. While previous studies have made strides in optimizing model performance, there is a pressing need to bolster the model’s capacity for diagnostic logic to ensure patient safety. In response to this need, we propose an approach termed preference learning from process feedback (PLPF), which involves integrating the doctor’s diagnostic logic into LLMs. PLPF encompasses three key components: rule modeling, preference data generation, and preference alignment. These components collectively serve to train the model to adhere to the diagnostic process. Our experimental results, utilizing Standardized Patient Testing, demonstrate that PLPF enhances the diagnostic accuracy of the baseline model in medical conversations by 17.6%, surpassing the performance of traditional approaches. Moreover, PLPF exhibits effectiveness in both multi-round and single-round dialogue tasks, thereby highlighting its potential in improving medical dialogue generation. Our dataset is available at https://github.com/Chengfeng-Dou/SpTesting. @@ -8113,7 +8113,7 @@ <fixed-case>LM</fixed-case>-Cocktail: Resilient Tuning of Language Models via Model Merging ShitaoXiao - ZhengLiu + ZhengLiu PeitianZhang XingrunXing 2474-2488 @@ -8127,7 +8127,7 @@ XinMiao YongqiLi ShenZhou - TieyunQianWuhan University + TieyunQianWuhan University 2489-2511 Large language models (LLMs) have achieved satisfactory performance in counterfactual generation. However, confined by the stochastic generation process of LLMs, there often are misalignments between LLMs and humans which hinder LLMs from handling complex tasks like relation extraction. As a result, LLMs may generate commonsense-violated counterfactuals like ‘eggs were produced by a box’. To bridge this gap, we propose to mimick the episodic memory retrieval, the working mechanism of human hippocampus, to align LLMs’ generation process with that of humans. In this way, LLMs can derive experience from their extensive memory, which keeps in line with the way humans gain commonsense. We then implement two central functions in the hippocampus, i.e., pattern separation and pattern completion, to retrieve the episodic memory from LLMs and generate commonsense counterfactuals for relation extraction. Experimental results demonstrate the improvements of our framework over existing methods in terms of the quality of counterfactuals. 2024.findings-acl.146 @@ -8137,32 +8137,32 @@ <fixed-case>S</fixed-case>em<fixed-case>R</fixed-case>el2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages NedjmaOusidhoumCardiff University - ShamsuddeenMuhammadBayero University, Kano-Nigeria + ShamsuddeenMuhammadBayero University, Kano-Nigeria MohamedAbdallaUniversity of Alberta - IdrisAbdulmuminAhmadu Bello University - IbrahimAhmadNortheastern University + IdrisAbdulmuminAhmadu Bello University + IbrahimAhmadNortheastern University SanchitAhujaResearch, Microsoft AlhamAjiMohamed bin Zayed University of Artificial Intelligence and Amazon - VladimirAraujoKU Leuven - AbinewAyeleBahir Dar University, Universität Hamburg + VladimirAraujoKU Leuven + AbinewAyeleBahir Dar University, Universität Hamburg PavanBaswani MeriemBeloucifUppsala University ChrisBiemannU Hamburg SofiaBourhim - ChristineKockUniversity of Melbourne + ChristineKockUniversity of Melbourne GenetDekebo OumaimaHourrane GopichandKanumolu LokeshMadasu SamuelRutunda - ManishShrivastavaInternational Institute of Information Technology Hyderabad, India + ManishShrivastavaInternational Institute of Information Technology Hyderabad, India ThamarSolorioMohamed bin Zayed University of Artificial Intelligence and University of Houston NirmalSurangeInternational Institute of Information Technology Hyderabad - HailegnawTilayeKotebe University of Education + HailegnawTilayeKotebe University of Education KrishnapriyaVishnubhotla GentaWinataCapital One AI Foundations - SeidYimamUniversität Hamburg - SaifMohammadNational Research Council Canada + SeidYimamUniversität Hamburg + SaifMohammadNational Research Council Canada 2512-2530 Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP. 2024.findings-acl.147 @@ -8181,7 +8181,7 @@ <fixed-case>VISP</fixed-case>ool: Enhancing Transformer Encoders with Vector Visibility Graph Neural Networks - TunaAlikaşifoğlu + TunaAlikaşifoğlu ArdaAras AykutKocBilkent University 2547-2556 @@ -8195,7 +8195,7 @@ KrishnapriyaVishnubhotla AdamHammondUniversity of Toronto GraemeHirstUniversity of Toronto - SaifMohammadNational Research Council Canada + SaifMohammadNational Research Council Canada 2557-2574 Stories are rich in the emotions they exhibit in their narratives and evoke in the readers. The emotional journeys of the various characters within a story are central to their appeal. Computational analysis of the emotions of novels, however, has rarely examined the variation in the emotional trajectories of the different characters within them, instead considering the entire novel to represent a single story arc. In this work, we use character dialogue to distinguish between the emotion arcs of the narration and the various characters. We analyze the emotion arcs of the various characters in a dataset of English literary novels using the framework of Utterance Emotion Dynamics. Our findings show that the narration and the dialogue largely express disparate emotions through the course of a novel, and that the commonalities or differences in the emotional arcs of stories are more accurately captured by those associated with individual characters. 2024.findings-acl.150 @@ -8215,8 +8215,8 @@ Dictionary-Aided Translation for Handling Multi-Word Expressions in Low-Resource Languages AntoniosDimakisUniversity of Athens - StellaMarkantonatou - AntoniosAnastasopoulosAthena Research Center and George Mason University + StellaMarkantonatou + AntoniosAnastasopoulosAthena Research Center and George Mason University 2588-2595 Multi-word expressions (MWEs) present unique challenges in natural language processing (NLP), particularly within the context of translation systems, due to their inherent scarcity, non-compositional nature, and other distinct lexical and morphosyntactic characteristics, issues that are exacerbated in low-resource settings.In this study, we elucidate and attempt to address these challenges by leveraging a substantial corpus of human-annotated Greek MWEs. To address the complexity of translating such phrases, we propose a novel method leveraging an available out-of-context lexicon.We assess the translation capabilities of current state-of-the-art systems on this task, employing both automated metrics and human evaluators.We find that by using our method when applicable, the performance of current systems can be significantly improved, however these models are still unable to produce translations comparable to those of a human speaker. 2024.findings-acl.152 @@ -8228,7 +8228,7 @@ Zhong-ZhiLi Ming-LiangZhang FeiYin, Institute of automation, Chinese academy of science - Cheng-LinLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences + Cheng-LinLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences 2596-2608 Geometry problem solving (GPS) is a challenging mathematical reasoning task requiring multi-modal understanding, fusion, and reasoning. Existing neural solvers take GPS as a vision-language task but are short in the representation of geometry diagrams that carry rich and complex layout information. In this paper, we propose a layout-aware neural solver named LANS, integrated with two new modules: multimodal layout-aware pre-trained language module (MLA-PLM) and layout-aware fusion attention (LA-FA). MLA-PLM adopts structural-semantic pre-training (SSP) to implement global relationship modeling, and point-match pre-training (PMP) to achieve alignment between visual points and textual points. LA-FA employs a layout-aware attention mask to realize point-guided cross-modal fusion for further boosting layout awareness of LANS. Extensive experiments on datasets Geometry3K and PGPS9K validate the effectiveness of the layout-aware modules and superior problem-solving performance of our LANS solver, over existing symbolic and neural solvers. We have made our code and data publicly available. 2024.findings-acl.153 @@ -8238,12 +8238,12 @@ Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models WenxuanDingHong Kong University of Science and Technology - ShangbinFengUniversity of Washington + ShangbinFengUniversity of Washington YuhanLiu - ZhaoxuanTanUniversity of Notre Dame + ZhaoxuanTanUniversity of Notre Dame VidhishaBalachandranResearch, Microsoft TianxingHe - YuliaTsvetkovDepartment of Computer Science, University of Washington + YuliaTsvetkovDepartment of Computer Science, University of Washington 2609-2636 We propose Knowledge Crosswords, a geometric knowledge reasoning benchmark consisting of incomplete knowledge networks bounded by structured factual constraints, where LLMs are tasked with inferring the missing facts to meet all constraints. The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA, such as backtracking, verifying facts and constraints, reasoning with uncertainty, and more. Knowledge Crosswords contains 2,101 individual problems, covering diverse knowledge domains, and is further divided into three difficulty levels. We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords. Results demonstrate that baseline approaches struggle with larger knowledge networks and semantically-equivalent entity distractors. In light of their limitations, we propose two new approaches, Staged Prompting and Verify-All, to augment LLMs’ abilities for error-aware backtracking and constraint verification. Our Verify-All significantly outperforms prior methods and is more robust towards problems in the hard subset. Further analysis shows that geometric knowledge reasoning poses new challenges to LLMs’ knowledge abilities, particularly in robustness towards varying option orders, complex structural constraints in knowledge networks, “none of the above” scenarios, and more. 2024.findings-acl.154 @@ -8253,11 +8253,11 @@ <fixed-case>DELL</fixed-case>: Generating Reactions and Explanations for <fixed-case>LLM</fixed-case>-Based Misinformation Detection HerunWan - ShangbinFengUniversity of Washington - ZhaoxuanTanUniversity of Notre Dame + ShangbinFengUniversity of Washington + ZhaoxuanTanUniversity of Notre Dame HengWang - YuliaTsvetkovDepartment of Computer Science, University of Washington - MinnanLuoXi’an Jiaotong University + YuliaTsvetkovDepartment of Computer Science, University of Washington + MinnanLuoXi’an Jiaotong University 2637-2667 Large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles, where factual accuracy is paramount. In this work, we propose DELL that identifies three key stages in misinformation detection where LLMs could be incorporated as part of the pipeline: 1) LLMs could generate news reactions to represent diverse perspectives and simulate user-news interaction networks; 2) LLMs could generate explanations for proxy tasks (e.g., sentiment, stance) to enrich the contexts of news articles and produce experts specializing in various aspects of news understanding; 3) LLMs could merge task-specific experts and provide an overall prediction by incorporating the predictions and confidence scores of varying experts. Extensive experiments on seven datasets with three LLMs demonstrate that DELL outperforms state-of-the-art baselines by up to 16.8% in macro f1-score. Further analysis reveals that the generated reactions and explanations are greatly helpful in misinformation detection, while our proposed LLM-guided expert merging helps produce better-calibrated predictions. 2024.findings-acl.155 @@ -8273,7 +8273,7 @@ JingyuZhangJohns Hopkins University HaoranXuJohns Hopkins University BoyuanZhengOhio State University, Columbus - PhilippKoehnJohns Hopkins University + PhilippKoehnJohns Hopkins University DanielKhashabiJohns Hopkins University 2668-2680 As the influence of large language models (LLMs) spans across global communities, their safety challenges in multilingual settings become paramount for alignment research. This paper examines the variations in safety challenges faced by LLMs across different languages and discusses approaches to alleviating such concerns. By comparing how state-of-the-art LLMs respond to the same set of malicious prompts written in higher- vs. lower-resource languages,we observe that (1) LLMs tend to generate unsafe responses much more often when a malicious prompt is written in a lower-resource language, and (2) LLMs tend to generate more irrelevant responses to malicious prompts in lower-resource languages. To understand where the discrepancy can be attributed, we study the effect of instruction tuning with reinforcement learning from human feedback (RLHF) or supervised finetuning (SFT) on the HH-RLHF dataset. Surprisingly, while training with high-resource languages improves model alignment, training in lower-resource languages yields minimal improvement. This suggests that the bottleneck of cross-lingual alignment is rooted in the pretraining stage. Our findings highlight the challenges in cross-lingual LLM safety, and we hope they inform future research in this direction. @@ -8285,9 +8285,9 @@ Self-Specialization: Uncovering Latent Expertise within Large Language Models JunmoKangGeorgia Institute of Technology HongyinLuoMassachusetts Institute of Technology - YadaZhuIBM Research + YadaZhuIBM Research JacobHansen - JamesGlass + JamesGlass DavidCoxInternational Business Machines AlanRitterGeorgia Institute of Technology RogerioFerisInternational Business Machines @@ -8303,8 +8303,8 @@ FredXuUniversity of California, Los Angeles SongJiangFAIR ZijieHuangUniversity of California, Los Angeles - XiaoLuoUniversity of California, Los Angeles - ShichangZhangHarvard Business School + XiaoLuoUniversity of California, Los Angeles + ShichangZhangHarvard Business School YuanzhouChen, University of California, Los Angeles YizhouSunUniversity of California, Los Angeles 2707-2720 @@ -8315,7 +8315,7 @@ Chain of Logic: Rule-Based Reasoning with Large Language Models - SergioServantez + SergioServantez JoeBarrowPattern Data KristianHammond RajivJainAdobe Systems @@ -8350,11 +8350,11 @@ Simulated Misinformation Susceptibility (<fixed-case>SMISTS</fixed-case>): Enhancing Misinformation Research with Large Language Model Simulations - WeichengMaDartmouth College + WeichengMaDartmouth College ChunyuanDengRice University AramMoossavi LiliWang - SoroushVosoughiDartmouth College + SoroushVosoughiDartmouth College DiyiYangStanford University 2774-2788 Psychological inoculation, a strategy designed to build resistance against persuasive misinformation, has shown efficacy in curbing its spread and mitigating its adverse effects at early stages. Despite its effectiveness, the design and optimization of these inoculations typically demand substantial human and financial resources, primarily due to the need for repeated experimental trials. To address these challenges, this paper introduces Simulated Misinformation Susceptibility Tests (SMISTs), leveraging Large Language Models (LLMs) to simulate participant responses in misinformation studies. SMIST employs a life experience-driven simulation methodology, which accounts for various aspects of participants’ backgrounds, to mitigate common issues of caricatures and stereotypes in LLM simulations and enhance response diversity. Our extensive experimentation demonstrates that SMIST, utilizing GPT-4 as the backend model, yields results that align closely with those obtained from human-subject studies in misinformation susceptibility. This alignment suggests that LLMs can effectively serve as proxies in evaluating the impact of psychological inoculations. Moreover, SMIST offers the critical benefit of being applicable to emerging or anticipated misinformation scenarios without exposing human participants to potentially harmful content. This characteristic of SMIST not only preserves participant safety but also expands the scope of misinformation research to include more sensitive or speculative topics. @@ -8388,8 +8388,8 @@ <fixed-case>MODABS</fixed-case>: Multi-Objective Learning for Dynamic Aspect-Based Summarization - XiaoboGuo - SoroushVosoughiDartmouth College + XiaoboGuo + SoroushVosoughiDartmouth College 2814-2827 The rapid proliferation of online content necessitates effective summarization methods, among which dynamic aspect-based summarization stands out. Unlike its traditional counterpart, which assumes a fixed set of known aspects, this approach adapts to the varied aspects of the input text. We introduce a novel multi-objective learning framework employing a Longformer-Encoder-Decoder for this task. The framework optimizes aspect number prediction, minimizes disparity between generated and reference summaries for each aspect, and maximizes dissimilarity across aspect-specific summaries. Extensive experiments show our method significantly outperforms baselines on three diverse datasets, largely due to the effective alignment of generated and reference aspect counts without sacrificing single-aspect summarization quality. 2024.findings-acl.165 @@ -8409,15 +8409,15 @@ Medical Dialogue System: A Survey of Categories, Methods, Evaluation and Challenges XiaomingShiEast China Normal University - ZemingLiu + ZemingLiu LiDu YuxuanWangZhejiang Lab, Zhejiang Lab - HongruWangThe Chinese University of Hong Kong + HongruWangThe Chinese University of Hong Kong YuhangGuo TongRuan - JieXu + JieXu XiaofanZhangShanghai Jiaotong University - ShaotingZhangUniversity of North Carolina at Charlotte + ShaotingZhangUniversity of North Carolina at Charlotte 2840-2861 This paper surveys and organizes research works of medical dialog systems, which is an important yet challenging task. Although these systems have been surveyed in the medical community from an application perspective, a systematic review from a rigorous technical perspective has to date remained noticeably absent. As a result, an overview of the categories, methods, evaluation of medical dialogue systems remain limited and underspecified, hindering the further improvement of this area. To fill this gap, we investigate an initial pool of 325 papers from well-known computer science, natural language processing conferences and journals, and make an overview. Recently, large language models have shown strong model capacity on downstream tasks, which also reshape medical dialog systems’ foundation.Despite the alluring practical application value, current medical dialogue systems still suffer from problems. To this end, this paper lists grand challenges of medical dialog systems, especially of large language models. 2024.findings-acl.167 @@ -8426,8 +8426,8 @@ Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs - Minh-VuongNguyen - LinhaoLuo + Minh-VuongNguyen + LinhaoLuo FatemehShiri DinhPhung Yuan-FangLi @@ -8454,9 +8454,9 @@ Self-Supervised Position Debiasing for Large Language Models ZhongkunLiu - ZhengChen - MengqiZhangShandong University - ZhaochunRenLeiden University + ZhengChen + MengqiZhangShandong University + ZhaochunRenLeiden University PengjieRenShandong University ZhuminChenShandong University 2897-2917 @@ -8469,8 +8469,8 @@ <fixed-case>H</fixed-case>yper<fixed-case>CL</fixed-case>: A Contrastive Learning Framework for Hyper-Relational Knowledge Graph Embedding with Hierarchical Ontology YuhuanLu WeijianYu - XinJing - DingqiYangUniversity of Macau + XinJing + DingqiYangUniversity of Macau 2918-2929 2024.findings-acl.171 lu-etal-2024-hypercl @@ -8479,8 +8479,8 @@ Encoding Hierarchical Schema via Concept Flow for Multifaceted Ideology Detection SongtaoLiu - BangWangHuazhong University of Science and Technology - WeiXiangHuazhong University of Science and Technology + BangWangHuazhong University of Science and Technology + WeiXiangHuazhong University of Science and Technology HanXuHuazhong University of Science and Technology MinghuaXuHuazhong University of Science and Technology 2930-2942 @@ -8501,9 +8501,9 @@ <fixed-case>A</fixed-case>lign<fixed-case>RE</fixed-case>: An Encoding and Semantic Alignment Approach for Zero-Shot Relation Extraction - ZehanLiNortheastern University - FuZhangNortheastern University - JingweiChengNortheastern University, China + ZehanLiNortheastern University + FuZhangNortheastern University + JingweiChengNortheastern University, China 2957-2966 Zero-shot Relation Extraction (ZSRE) aims to predict unseen relations between entity pairs from input sentences. Existing prototype-based ZSRE methods encode relation descriptions into prototype embeddings and predict by measuring the similarity between sentence embeddings and prototype embeddings. However, these methods often overlook abundant side information of relations and suffer from a significant encoding gap between prototypes and sentences, limiting performance. To this end, we propose a framework named AlignRE, based on two Alignment methods for ZSRE. Specifically, we present a novel perspective centered on encoding schema alignment to enhance prototype-based ZSRE methods. We utilize well-designed prompt-tuning to bridge the encoding gap. To improve prototype quality, we explore and leverage multiple side information and propose a prototype aggregation method based on semantic alignment to create comprehensive relation prototype representations. We conduct experiments on FewRel and Wiki-ZSL datasets and consistently outperform state-of-the-art methods. Moreover, our method exhibits substantially faster performance and reduces the need for extensive manual labor in prototype construction. Code is available at https://github.com/lizehan1999/AlignRE. 2024.findings-acl.174 @@ -8516,7 +8516,7 @@ DengCaiTencent AI Lab LemaoLiuTencent ShumingShiTencent AI Lab - RuiYanRenmin University of China + RuiYanRenmin University of China 2967-2985 Supervised fine-tuning (SFT) on instruction-following corpus is a crucial approach toward the alignment of large language models (LLMs). However, the performance of LLMs on standard knowledge and reasoning benchmarks tends to suffer from deterioration at the latter stage of the SFT process, echoing the phenomenon of alignment tax. Through our pilot study, we put a hypothesis that the data biases are probably one cause behind the phenomenon. To address the issue, we introduce a simple disperse-then-merge framework. To be concrete, we disperse the instruction-following data into portions and then train multiple sub-models using different data portions. Lastly, we merge multiple models into a single one via model merging techniques. Despite its simplicity, our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks. 2024.findings-acl.175 @@ -8541,8 +8541,8 @@ Towards Precise Localization of Critical Errors in Machine Translation - DahyunJungKorea University - SugyeongEoKorea University + DahyunJungKorea University + SugyeongEoKorea University HeuiseokLimKorea University 3000-3012 The advent of large language models has experienced a remarkable improvement in the field of machine translation. However, machine translation is still vulnerable to critical meaning deviations, which may incur catastrophic issues in social or ethical contexts. In particular, existing critical error detection primarily focuses on identifying sentence-level errors, leaving the precise localization of such errors within the sentence unaddressed. In this paper, we introduce a new task, word-level critical error detection (WCED), to detect critical errors at a fine-grained level in machine translation sentences. The task aims to identify the parts of a machine translation that contain catastrophic meaning distortions. We hypothesize that the ability to determine errors at the sentence level will positively influence the detection of more granular errors. We propose a sentence-level error detection module to predict which words in a sentence have critical errors. Experimental results demonstrate that our method outperforms existing methodologies and LLM in En-De, Zh-En, En-Ru, and En-Ko. Our method is helpful for determining the fine-grained location of errors. We hope that such studies will improve the capacity to address critical errors adeptly. @@ -8568,7 +8568,7 @@ Speculative Decoding via Early-exiting for Faster <fixed-case>LLM</fixed-case> Inference with <fixed-case>T</fixed-case>hompson Sampling Control Mechanism JiahaoLiuMeituan - QifanWangMeta AI + QifanWangMeta AI JingangWangMeituan XunliangCai 3027-3043 @@ -8597,8 +8597,8 @@ MingdaoLiu RuiLu BowenWangTsinghua University, Tsinghua University - XiaoLiu - YuxiaoDongTsinghua University + XiaoLiu + YuxiaoDongTsinghua University JieTangTsinghua University, Tsinghua University 3053-3077 Open large language models (LLMs) with great performance in various tasks have significantly advanced the development of LLMs. However, they are far inferior to commercial models such as ChatGPT and GPT-4 when acting as agents to tackle complex tasks in the real world. These agent tasks employ LLMs as the central controller responsible for planning, memorization, and tool utilization, necessitating both fine-grained prompting methods and robust LLMs to achieve satisfactory performance. Though many prompting methods have been proposed to complete particular agent tasks, there is lack of research focusing on improving the agent capabilities of LLMs themselves without compromising their general abilities. In this work, we present AgentTuning, a simple and general method to enhance the agent abilities of LLMs while maintaining their general LLM capabilities. We construct AgentInstruct, a lightweight instruction-tuning dataset containing high-quality interaction trajectories. We employ a hybrid instruction-tuning strategy by combining AgentInstruct with open-source instructions from general domains. AgentTuning is used to instruction-tune the Llama 2 series, resulting in AgentLM. Our evaluations show that AgentTuning enables LLMs’ agent capabilities without compromising general abilities. The AgentLM-70B is comparable to GPT-3.5-turbo on unseen agent tasks, demonstrating generalized agent capabilities. We open source the AgentInstruct and AgentLM-7B, 13B, and 70B models at https://anonymous.4open.science/r/AgentTuning, serving open and powerful alternatives to commercial LLMs for agent tasks. @@ -8634,13 +8634,13 @@ A <fixed-case>C</fixed-case>hinese Dataset for Evaluating the Safeguards in Large Language Models YuxiaWang ZenanZhai - HaonanLi - XudongHanUniversity of Melbourne + HaonanLi + XudongHanUniversity of Melbourne ShomLin ZhenxuanZhang AngelaZhao - PreslavNakovMohamed bin Zayed University of Artificial Intelligence - TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne + PreslavNakovMohamed bin Zayed University of Artificial Intelligence + TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne 3106-3119 Many studies have demonstrated that large language models (LLMs) can produce harmful responses, exposing users to unexpected risks. Previous studies have proposed comprehensive taxonomies of LLM risks, as well as corresponding prompts that can be used to examine LLM safety. However, the focus has been almost exclusively on English. We aim to broaden LLM safety research by introducing a dataset for the safety evaluation of Chinese LLMs, and extending it to better identify false negative and false positive examples in terms of risky prompt rejections. We further present a set of fine-grained safety assessment criteria for each risk type, facilitating both manual annotation and automatic evaluation in terms of LLM response harmfulness. Our experiments over five LLMs show that region-specific risks are the prevalent risk type. Warning: this paper contains example data that may be offensive, harmful, or biased. Our data is available at https://github.com/Libr-AI/do-not-answer. 2024.findings-acl.184 @@ -8651,7 +8651,7 @@ <fixed-case>LLMF</fixed-case>actor: Extracting Profitable Factors through Prompts for Explainable Stock Movement Prediction MeiyunWang KiyoshiIzumi - HirokiSakajiHokkaido University + HirokiSakajiHokkaido University 3120-3131 Recently, Large Language Models (LLMs) have attracted significant attention for their exceptional performance across a broad range of tasks, particularly in text analysis. However, the finance sector presents a distinct challenge due to its dependence on time-series data for complex forecasting tasks. In this study, we introduce a novel framework called LLMFactor, which employs Sequential Knowledge-Guided Prompting (SKGP) to identify factors that influence stock movements using LLMs. Unlike previous methods that relied on keyphrases or sentiment analysis, this approach focuses on extracting factors more directly related to stock market dynamics, providing clear explanations for complex temporal changes. Our framework directs the LLMs to create background knowledge through a fill-in-the-blank strategy and then discerns potential factors affecting stock prices from related news. Guided by background knowledge and identified factors, we leverage historical stock prices in textual format to predict stock movement. An extensive evaluation of the LLMFactor framework across four benchmark datasets from both the U.S. and Chinese stock markets demonstrates its superiority over existing state-of-the-art methods and its effectiveness in financial time-series forecasting. 2024.findings-acl.185 @@ -8660,7 +8660,7 @@ You Only Look at Screens: Multimodal Chain-of-Action Agents - ZhuoshengZhangShanghai Jiao Tong University + ZhuoshengZhangShanghai Jiao Tong University AstonZhangMeta 3132-3149 Autonomous graphical user interface (GUI) agents aim to facilitate task automation by interacting with the user interface without manual intervention. Recent studies have investigated eliciting the capabilities of large language models (LLMs) for effective engagement in diverse environments. To align with the input-output requirement of LLMs, most existing approaches are developed under a sandbox setting where they rely on external tools and application-specific APIs to parse the environment into textual elements and interpret the predicted actions. Consequently, those approaches often grapple with inference inefficiency and error propagation risks. To mitigate the challenges, we introduce Auto-GUI, a multimodal solution that directly interacts with the interface, bypassing the need for environment parsing or reliance on application-dependent APIs. Moreover, we propose a chain-of-action technique—leveraging a series of intermediate previous action histories and future action plans—to help the agent decide what action to execute. We evaluate our approach on a new device-control benchmark AITW with 30K unique instructions, spanning multi-step tasks such as application operation, web searching, and web shopping. Experimental results show that Auto-GUI achieves state-of-the-art performance with an action type prediction accuracy of 90% and an overall action success rate of 74%. Code is publicly available at https://github.com/cooelf/Auto-GUI. @@ -8685,7 +8685,7 @@ <fixed-case>GENDEX</fixed-case>: Generative Data Augmentation Strategy Leveraging External Data for Abstractive Dialogue Summarization SangwonParkGwangju Institute of Science and Technology - HongseokChoiElectronics and Telecommunications Research Institute + HongseokChoiElectronics and Telecommunications Research Institute DonghaChoiGwangju Institute of Science and Technology HyunjuLeeGwangju Institute of Science and Technology 3171-3185 @@ -8719,7 +8719,7 @@ Refine, Align, and Aggregate: Multi-view Linguistic Features Enhancement for Aspect Sentiment Triplet Extraction GuixinSu - MingminWu + MingminWu ZhongqiangHuang YongchengZhang TongguanWang @@ -8733,8 +8733,8 @@ Pro-Woman, Anti-Man? Identifying Gender Bias in Stance Detection - YingjieLiWestlake University - YueZhangWestlake University + YingjieLiWestlake University + YueZhangWestlake University 3229-3236 Gender bias has been widely observed in NLP models, which has the potential to perpetuate harmful stereotypes and discrimination. In this paper, we construct a dataset GenderStance of 36k samples to measure gender bias in stance detection, determining whether models consistently predict the same stance for a particular gender group. We find that all models are gender-biased and prone to classify sentences that contain male nouns as Against and those with female nouns as Favor. Moreover, extensive experiments indicate that sources of gender bias stem from the fine-tuning data and the foundation model itself. We will publicly release our code and dataset. 2024.findings-acl.192 @@ -8744,7 +8744,7 @@ Likelihood-based Mitigation of Evaluation Bias in Large Language Models MasanariOhi - MasahiroKanekoMohamed bin Zayed University of Artificial Intelligence and Tokyo Institute of Technology, Tokyo Institute of Technology + MasahiroKanekoMohamed bin Zayed University of Artificial Intelligence and Tokyo Institute of Technology, Tokyo Institute of Technology RyutoKoike MengsayLoemSansan, Inc. NaoakiOkazakiTokyo Institute of Technology @@ -8785,7 +8785,7 @@ From Role-Play to Drama-Interaction: An <fixed-case>LLM</fixed-case> Solution - WeiqiWu + WeiqiWu HongqiuWu LaiJiang XingyuanLiu @@ -8802,10 +8802,10 @@ JaewooAhnSeoul National University TaehyunLeeSeoul National University JunyoungLimSeoul National University - Jin-HwaKimSeoul National University and NAVER + Jin-HwaKimSeoul National University and NAVER SangdooYunNAVER - HwaranLeeNAVER AI Lab - GunheeKimSeoul National University + HwaranLeeNAVER AI Lab + GunheeKimSeoul National University 3291-3325 While Large Language Models (LLMs) can serve as agents to simulate human behaviors (i.e., role-playing agents), we emphasize the importance of point-in-time role-playing. This situates characters at specific moments in the narrative progression for three main reasons: (i) enhancing users’ narrative immersion, (ii) avoiding spoilers, and (iii) fostering engagement in fandom role-playing. To accurately represent characters at specific time points, agents must avoid character hallucination, where they display knowledge that contradicts their characters’ identities and historical timelines. We introduce TimeChara, a new benchmark designed to evaluate point-in-time character hallucination in role-playing LLMs. Comprising 10,895 instances generated through an automated pipeline, this benchmark reveals significant hallucination issues in current state-of-the-art LLMs (e.g., GPT-4o). To counter this challenge, we propose Narrative-Experts, a method that decomposes the reasoning steps and utilizes narrative experts to reduce point-in-time character hallucinations effectively. Still, our findings with TimeChara highlight the ongoing challenges of point-in-time character hallucination, calling for further study. 2024.findings-acl.197 @@ -8815,11 +8815,11 @@ Red Teaming Visual Language Models MukaiLi - LeiLiUniversity of Hong Kong - YuweiYin + LeiLiUniversity of Hong Kong + YuweiYin MasoodAhmed ZhenguangLiuZhejiang University - QiLiuUniversity of Hong Kong + QiLiuUniversity of Hong Kong 3326-3342 VLMs (Vision-Language Models) extend the capabilities of LLMs (Large Language Models) to accept multimodal inputs. Since it has been verified that LLMs can be induced to generate harmful or inaccurate content through specific test cases (termed as Red Teaming), how VLMs perform in similar scenarios, especially with their combination of textual and visual inputs, remains a question. To explore this problem, we present a novel red teaming dataset RTVLM, which encompasses 12 subtasks (e.g., image misleading, multi-modal jailbreaking, face fairness, etc) under 4 primary aspects (faithfulness, privacy, safety, fairness). Our RTVLM is the first red teaming dataset to benchmark current VLMs in terms of these 4 different aspects. Detailed analysis shows that 10 prominent open-sourced VLMs struggle with the red teaming in different degrees and have up to 31% performance gap with GPT-4V. Additionally, we simply apply red teaming alignment to LLaVA-v1.5 with Supervised Fine-tuning (SFT) using RTVLM, and this bolsters the models’ performance with 10% in RTVLM test set, 13% in MM-hallu, and without noticeable decline in MM-Bench, overpassing other LLaVA-based models in similar size with regular alignment data. This reveals that current open-sourced VLMs still lack red teaming alignment. Our code and datasets will be open-sourced. 2024.findings-acl.198 @@ -8832,7 +8832,7 @@ DapengChenHuawei Technologies Ltd. YajingSun RongjunLi - ZhiyongFengTianjin University + ZhiyongFengTianjin University WeiPengHuawei Technologies Ltd. 3343-3353 A Large Language Model (LLM) tends to generate inconsistent and sometimes contradictory outputs when presented with a prompt that has equivalent semantics but is expressed differently from the original prompt. To achieve semantic consistency of an LLM, one of the key approaches is to finetune the model with prompt-output pairs with semantically equivalent meanings. Despite its effectiveness, a data-driven finetuning method incurs substantial computation costs in data preparation and model optimization. In this regime, an LLM is treated as a “black box”, restricting our ability to gain deeper insights into its internal mechanism. In this paper, we are motivated to enhance the semantic consistency of LLMs through a more interpretable method (i.e., model editing) to this end. We first identify the model components (i.e., attention heads) that have a key impact on the semantic consistency of an LLM. We subsequently inject biases into the output of these model components along the semantic-consistency activation direction. It is noteworthy that these modifications are cost-effective, without reliance on mass manipulations of the original model parameters. Through comprehensive experiments on the constructed NLU and open-source NLG datasets, our method demonstrates significant improvements in the semantic consistency and task performance of LLMs. Additionally, our method exhibits promising generalization capabilities by performing well on tasks beyond the primary tasks. @@ -8846,7 +8846,7 @@ SeungHyunKim YoungsooJangLG AI Research MoontaeLeeUniversity of Illinois, Chicago - HongukWoo + HongukWoo 3354-3376 In embodied instruction-following (EIF), the integration of pretrained language models (LMs) as task planners emerges as a significant branch, where tasks are planned at the skill level by prompting LMs with pretrained skills and user instructions. However, grounding these pretrained skills in different domains remains challenging due to their intricate entanglement with the domain-specific knowledge. To address this challenge, we present a semantic skill grounding (SemGro) framework that leverages the hierarchical nature of semantic skills. SemGro recognizes the broad spectrum of these skills, ranging from short-horizon low-semantic skills that are universally applicable across domains to long-horizon rich-semantic skills that are highly specialized and tailored for particular domains. The framework employs an iterative skill decomposition approach, starting from the higher levels of semantic skill hierarchy and then moving downwards, so as to ground each planned skill to an executable level within the target domain. To do so, we use the reasoning capabilities of LMs for composing and decomposing semantic skills, as well as their multi-modal extension for assessing the skill feasibility in the target domain. Our experiments in the VirtualHome benchmark show the efficacy of SemGro in 300 cross-domain EIF scenarios. 2024.findings-acl.200 @@ -8857,7 +8857,7 @@ <fixed-case>LIRE</fixed-case>: listwise reward enhancement for preference alignment MingyeZhu YiLiuState Key Laboratory of Communication Content Cognition - LeiZhangUniversity of Science and Technology of China + LeiZhangUniversity of Science and Technology of China JunboGuoPeople’s Daily Online ZhendongMaoUniversity of Science and Technology of China 3377-3394 @@ -8873,7 +8873,7 @@ Seung HwanKimLG AI Research SoonyoungLee BumsooKimLG AI Research - GunheeKimSeoul National University + GunheeKimSeoul National University 3395-3405 3D dense captioning is a task to localize objects in a 3D scene and generate descriptive sentences for each object. Recent approaches in 3D dense captioning have adopted transformer encoder-decoder frameworks from object detection to build an end-to-end pipeline without hand-crafted components. However, these approaches struggle with contradicting objectives where a single query attention has to simultaneously view both the tightly localized object regions and contextual environment. To overcome this challenge, we introduce SIA (See-It-All), a transformer pipeline that engages in 3D dense captioning with a novel paradigm called late aggregation. SIA simultaneously decodes two sets of queries—context query and instance query. The instance query focuses on localization and object attribute descriptions, while the context query versatilely captures the region-of-interest of relationships between multiple objects or with the global scene, then aggregated afterwards (i.e., late aggregation) via simple distance-based measures. To further enhance the quality of contextualized caption generation, we design a novel aggregator to generate a fully informed caption based on the surrounding context, the global environment, and object instances. Extensive experiments on two of the most widely-used 3D dense captioning datasets demonstrate that our proposed method achieves a significant improvement over prior methods. 2024.findings-acl.202 @@ -8883,7 +8883,7 @@ <tex-math>\texttt{DARA}</tex-math>: Decomposition-Alignment-Reasoning Autonomous Language Agent for Question Answering over Knowledge Graphs HaishuoFangTechnische Universität Darmstadt - XiaodanZhuQueen’s University + XiaodanZhuQueen’s University IrynaGurevychMohamed bin Zayed University of Artificial Intelligence and Technical University of Darmstadt 3406-3432 Answering Questions over Knowledge Graphs (KGQA) is key to well-functioning autonomous language agents in various real-life applications. To improve the neural-symbolic reasoning capabilities of language agents powered by Large Language Models (LLMs) in KGQA, we propose the Decomposition-Alignment-Reasoning Agent (DARA) framework. DARA effectively parses questions into formal queries through a dual mechanism: high-level iterative task decomposition and low-level task grounding. Importantly, DARA can be efficiently trained with a small number of high-quality reasoning trajectories. Our experimental results demonstrate that DARA fine-tuned on LLMs (e.g. Llama-2-7B, Mistral) outperforms both in-context learning-based agents with GPT-4 and alternative fine-tuned agents, across different benchmarks, making such models more accessible for real-life applications. We also show that DARA attains performance comparable to state-of-the-art enumerating-and-ranking-based methods for KGQA. @@ -8905,7 +8905,7 @@ Compositional Generalization with Grounded Language Models SondreWold - ÉtienneSimon + ÉtienneSimon LucasCharpentierUniversity of Oslo EgorKostylevUniversity of Oslo, Norway ErikVelldalUniversity of Oslo @@ -8919,10 +8919,10 @@ Rethinking Negative Instances for Generative Named Entity Recognition YuyangDing - JuntaoLiSoochow University, China + JuntaoLiSoochow University, China PinzhengWang - ZechengTangSoochow University - YanBowen + ZechengTangSoochow University + YanBowen MinZhangHarbin Institute of Technology, Shenzhen 3461-3475 Large Language Models (LLMs) have demonstrated impressive capabilities for generalizing in unseen tasks. In the Named Entity Recognition (NER) task, recent advancements have seen the remarkable improvement of LLMs in a broad range of entity domains via instruction tuning, by adopting entity-centric schema. In this work, we explore the potential enhancement of the existing methods by incorporating negative instances into training. Our experiments reveal that negative instances contribute to remarkable improvements by (1) introducing contextual information, and (2) clearly delineating label boundaries. Furthermore, we introduce an efficient longest common subsequence (LCS) matching algorithm, which is tailored to transform unstructured predictions into structured entities. By integrating these components, we present GNER, a Generative NER system that shows improved zero-shot performance across unseen entity domains. Our comprehensive evaluation illustrates our system’s superiority, surpassing state-of-the-art (SoTA) methods by 9 F_1 score in zero-shot evaluation. @@ -8970,9 +8970,9 @@ How Much Does Nonverbal Communication Conform to Entropy Rate Constancy?: A Case Study on Listener Gaze in Interaction YuWangUniversität Bielefeld - YangXuSouthern University of Science and Technology - GabrielSkantzeKTH Royal Institute of Technology, Stockholm, Sweden - HendrikBuschmeierUniversität Bielefeld + YangXuSouthern University of Science and Technology + GabrielSkantzeKTH Royal Institute of Technology, Stockholm, Sweden + HendrikBuschmeierUniversität Bielefeld 3533-3545 According to the Entropy Rate Constancy (ERC) principle, the information density of a text is approximately constant over its length. Whether this principle also applies to nonverbal communication signals is still under investigation. We perform empirical analyses of video-recorded dialogue data and investigate whether listener gaze, as an important nonverbal communication signal, adheres to the ERC principle. Results show (1) that the ERC principle holds for listener gaze; and (2) that the two linguistic factors syntactic complexity and turn transition potential are weakly correlated with local entropy of listener gaze. 2024.findings-acl.210 @@ -9013,12 +9013,12 @@ Measuring Bargaining Abilities of <fixed-case>LLM</fixed-case>s: A Benchmark and A Buyer-Enhancement Method TianXia - ZhiweiHeShanghai Jiao Tong University + ZhiweiHeShanghai Jiao Tong University TongRen YiboMiao - ZhuoshengZhangShanghai Jiao Tong University + ZhuoshengZhangShanghai Jiao Tong University YangYang - RuiWangShanghai Jiao Tong University + RuiWangShanghai Jiao Tong University 3579-3602 Bargaining is an important and unique part of negotiation between humans. As LLM-driven agents learn to negotiate and act like real humans, how to evaluate agents’ bargaining abilities remains an open problem.For the first time, we formally described the Bargaining task as an asymmetric incomplete information game, defining the gains of the Buyer and Seller in multiple bargaining processes. It allows us to quantitatively assess an agent’s performance in the Bargain task.We collected a real product price dataset, AmazonHistoryPrice, and conducted evaluations of various LLM agents’ bargaining abilities. We find that playing a Buyer is much harder than a Seller, and increasing model size can not effectively improve the Buyer’s performance.To address the challenge, we propose a novel approach called OG-Narrator that integrates a deterministic Offer Generator to control the price range of Buyer’s offers, and an LLM Narrator to create natural language sentences for generated offers.Experimental results show that OG-Narrator improves the buyer’s deal rates from 26.67% to 88.88% and brings a ten times multiplication of profits on all baselines, even a model that has not been aligned. 2024.findings-acl.213 @@ -9041,7 +9041,7 @@ XuanmingZhang YuqiZhu YihongDongPeking University - ZhiJinPeking University and Peking University + ZhiJinPeking University and Peking University BinhuaLi FeiHuangAlibaba Group YongbinLiAlibaba Group @@ -9070,7 +9070,7 @@ Aligning Speech Segments Beyond Pure Semantics KevinHeffernanFacebook ArtyomKozhevnikov - LoicBarrault + LoicBarrault AlexandreMourachkoResearch, Facebook HolgerSchwenk 3626-3635 @@ -9088,7 +9088,7 @@ YicongLi Jay ZhangjieWuNational University of Singapore Cong-DuyNguyenSchool of Computer Science and Engineering, Nanyang Technological University - See-KiongNgNational University of Singapore + See-KiongNgNational University of Singapore Anh TuanLuuNanyang Technological University 3636-3657 Humans use multiple senses to comprehend the environment. Vision and language are two of the most vital senses since they allow us to easily communicate our thoughts and perceive the world around us. There has been a lot of interest in creating video-language understanding systems with human-like senses since a video-language pair can mimic both our linguistic medium and visual environment with temporal dynamics. In this survey, we review the key tasks of these systems and highlight the associated challenges. Based on the challenges, we summarize their methods from model architecture, model training, and data perspectives. We also conduct performance comparison among the methods, and discuss promising directions for future research. @@ -9111,12 +9111,12 @@ A + <fixed-case>B</fixed-case>: A General Generator-Reader Framework for Optimizing <fixed-case>LLM</fixed-case>s to Unleash Synergy Potential - WeiTang + WeiTang YixinCaoFudan University JiahaoYing - BoWangSchool of Computer Science & Technology, Beijing Institute of Technology + BoWangSchool of Computer Science & Technology, Beijing Institute of Technology YuyueZhao - YongLiaoUniversity of Science and Technology of China and China Academic of Electronics and Information Technology + YongLiaoUniversity of Science and Technology of China and China Academic of Electronics and Information Technology PengZhouAarhus University 3670-3685 Retrieval-Augmented Generation (RAG) is an effective solution to supplement necessary knowledge to large language models (LLMs). Targeting its bottleneck of retriever performance, “generate-then-read” pipeline is proposed to replace the retrieval stage with generation from the LLM itself. Although promising, this research direction is underexplored and still cannot work in the scenario when source knowledge is given. In this paper, we formalize a general “A + B” framework with varying combinations of foundation models and types for systematic investigation. We explore the efficacy of the base and chat versions of LLMs and found their different functionalities suitable for generator A and reader B, respectively. Their combinations consistently outperform single models, especially in complex scenarios. Furthermore, we extend the application of the “A + B” framework to scenarios involving source documents through continuous learning, enabling the direct integration of external knowledge into LLMs. This approach not only facilitates effective acquisition of new knowledge but also addresses the challenges of safety and helpfulness post-adaptation. The paper underscores the versatility of the “A + B” framework, demonstrating its potential to enhance the practical application of LLMs across various domains. @@ -9137,7 +9137,7 @@ Adversarial Preference Optimization: Enhancing Your Alignment via <fixed-case>RM</fixed-case>-<fixed-case>LLM</fixed-case> Game - PengyuChengTencent + PengyuChengTencent YifanYangTencent AI Lab JianLiTencent YongDaiTencent AI Lab @@ -9159,7 +9159,7 @@ ChenweiZhangUniversity of Hong Kong ZhechaoZhu ZehaiZhou - XiangjieKong + XiangjieKong 3717-3726 Aspect sentiment quad prediction (ASQP) has garnered significant attention in aspect-based sentiment analysis (ABSA). Current ASQP research primarily relies on pre-trained generative language models to produce templated sequences, often complemented by grid-based auxiliary methods. Despite these efforts, the persistent challenge of generation instability remains unresolved and the effectiveness of grid methods remains underexplored in current studies. To this end, we introduce Grid Noise Diffusion Pinpoint Network (GDP), a T5-based generative model aiming to tackle the issue of generation instability. The model consists of three novel modules, including Diffusion Vague Learning (DVL) to facilitate effective model learning and enhance overall robustness; Consistency Likelihood Learning (CLL) to discern the characteristics and commonalities of sentiment elements and thus reduce the impact of distributed noise; and GDP-FOR, a novel generation template, to enable models to generate outputs in a more natural way. Extensive experiments on four datasets demonstrate the remarkable effectiveness of our approach in addressing ASQP tasks. 2024.findings-acl.222 @@ -9170,9 +9170,9 @@ Continual Contrastive Spoken Language Understanding UmbertoCappellazzo EnricoFiniApple - MuqiaoYang + MuqiaoYang DanieleFalavignaFondazione Bruno Kessler - AlessioBruttiFondazione Bruno Kessler + AlessioBruttiFondazione Bruno Kessler BhikshaRajCarnegie Mellon University, Carnegie Mellon University and Mohamed bin Zayed University of Artificial Intelligence 3727-3741 Recently, neural networks have shown impressive progress across diverse fields, with speech processing being no exception. However, recent breakthroughs in this area require extensive offline training using large datasets and tremendous computing resources. Unfortunately, these models struggle to retain their previously acquired knowledge when learning new tasks continually. In this paper, we investigate the problem of learning sequence-to-sequence models for spoken language understanding in a class-incremental learning (CIL) setting and we propose COCONUT, a CIL method that relies on the combination of experience replay and contrastive learning. Through a modified version of the standard supervised contrastive loss, COCONUT preserves the learned representations by pulling closer samples from the same class and pushing away the others. Moreover, we leverage a multimodal contrastive loss that helps the model learn more discriminative representations of the new data by aligning audio and text features. We also investigate different contrastive designs to combine the strengths of the contrastive loss with teacher-student architectures used for distillation. Experiments on two established SLU datasets reveal the effectiveness of our proposed approach and significant improvements over the baselines. We also show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements. @@ -9185,7 +9185,7 @@ KaiWang YuweiXu ZhiyongWuShanghai Artificial Intelligence Laboratory - SiqiangLuoNanyang Technological University + SiqiangLuoNanyang Technological University 3742-3759 Knowledge Graph (KG) inductive reasoning, which aims to infer missing facts from new KGs that are not seen during training, has been widely adopted in various applications. One critical challenge of KG inductive reasoning is handling low-resource scenarios with scarcity in both textual and structural aspects. In this paper, we attempt to address this challenge with Large Language Models (LLMs). Particularly, we utilize the state-of-the-art LLMs to generate a graph-structural prompt to enhance the pre-trained Graph Neural Networks (GNNs), which brings us new methodological insights into the KG inductive reasoning methods, as well as high generalizability in practice. On the methodological side, we introduce a novel pretraining and prompting framework ProLINK, designed for low-resource inductive reasoning across arbitrary KGs without requiring additional training. On the practical side, we experimentally evaluate our approach on 36 low-resource KG datasets and find that ProLINK outperforms previous methods in three-shot, one-shot, and zero-shot reasoning tasks, exhibiting average performance improvements by 20%, 45%, and 147%, respectively. Furthermore, ProLINK demonstrates strong robustness for various LLM promptings as well as full-shot scenarios. 2024.findings-acl.224 @@ -9195,8 +9195,8 @@ Unsupervised Parsing by Searching for Frequent Word Sequences among Sentences with Equivalent Predicate-Argument Structures JunjieChenthe University of Tokyo - XianghengHe - DanushkaBollegalaAmazon and University of Liverpool + XianghengHe + DanushkaBollegalaAmazon and University of Liverpool YusukeMiyaoThe University of Tokyo 3760-3772 Unsupervised constituency parsing focuses on identifying word sequences that form a syntactic unit (i.e., constituents) in target sentences. Linguists identify the constituent by evaluating a set of Predicate-Argument Structure (PAS) equivalent sentences where we find the constituent appears more frequently than non-constituents (i.e., the constituent corresponds to a frequent word sequence within the sentence set). However, such frequency information is unavailable in previous parsing methods that identify the constituent by observing sentences with diverse PAS. In this study, we empirically show that constituents correspond to frequent word sequences in the PAS-equivalent sentence set. We propose a frequency-based parser, span-overlap, that (1) computes the span-overlap score as the word sequence’s frequency in the PAS-equivalent sentence set and (2) identifies the constituent structure by finding a constituent tree with the maximum span-overlap score. The parser achieves state-of-the-art level parsing accuracy, outperforming existing unsupervised parsers in eight out of ten languages. Additionally, we discover a multilingual phenomenon: participant-denoting constituents tend to have higher span-overlap scores than equal-length event-denoting constituents, meaning that the former tend to appear more frequently in the PAS-equivalent sentence set than the latter. The phenomenon indicates a statistical difference between the two constituent types, laying the foundation for future labeled unsupervised parsing research. @@ -9208,9 +9208,9 @@ Data-Centric Explainable Debiasing for Improving Fairness in Pre-trained Language Models YingjiLiJilin University MengnanDuNew Jersey Institute of Technology - RuiSongJilin University - XinWangJilin University - YingWangJilin University + RuiSongJilin University + XinWangJilin University + YingWangJilin University 3773-3786 Human-like social bias of pre-trained language models (PLMs) on downstream tasks have attracted increasing attention. The potential flaws in the training data are the main factor that causes unfairness in PLMs. Existing data-centric debiasing strategies mainly leverage explicit bias words (defined as sensitive attribute words specific to demographic groups) for counterfactual data augmentation to balance the training data. However, they lack consideration of implicit bias words potentially associated with explicit bias words in complex distribution data, which indirectly harms the fairness of PLMs. To this end, we propose a **Data**-Centric **Debias**ing method (named Data-Debias), which uses an explainability method to search for implicit bias words to assist in debiasing PLMs. Specifically, we compute the feature attributions of all tokens using the Integrated Gradients method, and then treat the tokens that have a large impact on the model’s decision as implicit bias words. To make the search results more precise, we iteratively train a biased model to amplify the bias with each iteration. Finally, we use the implicit bias words searched in the last iteration to assist in debiasing PLMs. Extensive experimental results on multiple PLMs debiasing on three different classification tasks demonstrate that Data-Debias achieves state-of-the-art debiasing performance and strong generalization while maintaining predictive abilities. 2024.findings-acl.226 @@ -9220,7 +9220,7 @@ Knowledge-Driven Cross-Document Relation Extraction MonikaJainIndraprastha Institute of Information Technology, Delhi - RaghavaMutharajuIndraprastha Institute of Information Technology, Delhi, India + RaghavaMutharajuIndraprastha Institute of Information Technology, Delhi, India KuldeepSinghCerence GmbH RamakanthKavuluruUniversity of Kentucky 3787-3797 @@ -9241,9 +9241,9 @@ <fixed-case>KG</fixed-case>-Adapter: Enabling Knowledge Graph Integration in Large Language Models through Parameter-Efficient Fine-Tuning - ShiyuTian + ShiyuTian YangyangLuoAlibaba Group - TianzeXuBeijing University of Posts and Telecommunications + TianzeXuBeijing University of Posts and Telecommunications CaixiaYuan HuixingJiangLi Auto ChenWei @@ -9261,7 +9261,7 @@ PengliLiukuaishou QingyangLi YanGong - JunchenWan + JunchenWan FuzhengZhang ZhongyuanWangKuaishou Inc. and Kuaishou DiZhangKuaishou Technology @@ -9288,11 +9288,11 @@ Improving In-Context Learning with Prediction Feedback for Sentiment Analysis HonglingXuHarbin Institute of Technology - QianlongWang + QianlongWang YiceZhang MinYangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences XiZeng - BingQinHarbin Institute of Technology + BingQinHarbin Institute of Technology RuifengXuHarbin Institute of Technology 3879-3890 Large language models (LLMs) have achieved promising results in sentiment analysis through the in-context learning (ICL) paradigm. However, their ability to distinguish subtle sentiments still remains a challenge. Inspired by the human ability to adjust understanding via feedback, this paper enhances ICL by incorporating prior predictions and feedback, aiming to rectify sentiment misinterpretation of LLMs. Specifically, the proposed framework consists of three steps: (1) acquiring prior predictions of LLMs, (2) devising predictive feedback based on correctness, and (3) leveraging a feedback-driven prompt to refine sentiment understanding. Experimental results across nine sentiment analysis datasets demonstrate the superiority of our framework over conventional ICL methods, with an average F1 improvement of 5.95%. @@ -9304,10 +9304,10 @@ Can Large Language Models Mine Interpretable Financial Factors More Effectively? A Neural-Symbolic Factor Mining Agent Model ZhiweiLiRenmin University of China RanSongKunmimg University of Science and Technology - CaihongSunRenmin University of China + CaihongSunRenmin University of China WeiXu - ZhengtaoYuKunming University of Science and Technology - Ji-RongWenRenmin University of China + ZhengtaoYuKunming University of Science and Technology + Ji-RongWenRenmin University of China 3891-3902 Finding interpretable factors for stock returns is the most vital issue in the empirical asset pricing domain. As data-driven methods, existing factor mining models can be categorized into symbol-based and neural-based models. Symbol-based models are interpretable but inefficient, while neural-based approaches are efficient but lack interpretability. Hence, mining interpretable factors effectively presents a significant challenge. Inspired by the success of Large Language Models (LLMs) in various tasks, we propose a FActor Mining Agent (FAMA) model that enables LLMs to integrate the strengths of both neural and symbolic models for factor mining. In this paper, FAMA consists of two main components: Cross-Sample Selection (CSS) and Chain-of-Experience (CoE). CSS addresses the homogeneity challenges in LLMs during factor mining by assimilating diverse factors as in-context samples, whereas CoE enables LLMs to leverage past successful mining experiences, expediting the mining of effective factors. Experimental evaluations on real-world stock market data demonstrate the effectiveness of our approach by surpassing the SOTA RankIC by 0.006 and RankICIR by 0.105 in predicting S&P 500 returns. Furthermore, the investment simulation shows that our model can achieve superior performance with an annualized return of 38.4% and a Sharpe ratio of 667.2%. 2024.findings-acl.233 @@ -9331,12 +9331,12 @@ <fixed-case>SALAD</fixed-case>-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models LijunLiShanghai Artificial Intelligence Laboratory - BowenDong + BowenDong RuohuiWang XuhaoHu WangmengZuoHarbin Institute of Technology DahuaLinThe Chinese University of Hong Kong - YuQiao + YuQiao JingShaoShanghai AI Laboratory 3923-3954 In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose SALAD-Bench, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH @@ -9348,9 +9348,9 @@ Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation PabloMessina ReneVidalUniversity of Pennsylvania and Amazon - DenisParraPontificia Universidad Catolica de Chile + DenisParraPontificia Universidad Catolica de Chile AlvaroSoto - VladimirAraujoKU Leuven + VladimirAraujoKU Leuven 3955-3986 Advancing representation learning in specialized fields like medicine remains challenging due to the scarcity of expert annotations for text and images. To tackle this issue, we present a novel two-stage framework designed to extract high-quality factual statements from free-text radiology reports in order to improve the representations of text encoders and, consequently, their performance on various downstream tasks.In the first stage, we propose a Fact Extractor that leverages large language models (LLMs) to identify factual statements from well-curated domain-specific datasets. In the second stage, we introduce a Fact Encoder (CXRFE) based on a BERT model fine-tuned with objective functions designed to improve its representations using the extracted factual data. Our framework also includes a new embedding-based metric (CXRFEScore) for evaluating chest X-ray text generation systems, leveraging both stages of our approach. Extensive evaluations show that our fact extractor and encoder outperform current state-of-the-art methods in tasks such as sentence ranking, natural language inference, and label extraction from radiology reports. Additionally, our metric proves to be more robust and effective than existing metrics commonly used in the radiology report generation literature. The code of this project is available at https://github.com/PabloMessina/CXR-Fact-Encoder. 2024.findings-acl.236 @@ -9360,8 +9360,8 @@ <fixed-case>GNN</fixed-case>avi: Navigating the Information Flow in Large Language Models by Graph Neural Network ShuzhouYuan - ErcongNie - MichaelFärberTechnische Universität Dresden + ErcongNie + MichaelFärberTechnische Universität Dresden HelmutSchmidCenter for Information and Language Processing HinrichSchuetze 3987-4001 @@ -9372,12 +9372,12 @@ <fixed-case>M</fixed-case>-<fixed-case>QALM</fixed-case>: A Benchmark to Assess Clinical Reading Comprehension and Knowledge Recall in Large Language Models via Question Answering - AnandSubramanian + AnandSubramanian ViktorSchlegelImperial College London AbhinavRamesh Kashyap Thanh-TungNguyenasus Vijay PrakashDwivedi - StefanWinklerNational University of Singapore + StefanWinklerNational University of Singapore 4002-4042 There is vivid research on adapting Large Language Models (LLMs) to perform a variety of tasks in high-stakes domains such as healthcare. Despite their popularity, there is a lack of understanding of the extent and contributing factors that allow LLMs to recall relevant knowledge and combine it with presented information in the clinical and biomedical domain: a fundamental pre-requisite for success on down-stream tasks.Addressing this gap, we use Multiple Choice and Abstractive Question Answering to conduct a large-scale empirical study on 22 datasets in three generalist and three specialist biomedical sub-domains. Our multifaceted analysis of the performance of 15 LLMs, further broken down by sub-domain, source of knowledge and model architecture, uncovers success factors such as instruction tuning that lead to improved recall and comprehension. We further show that while recently proposed domain-adapted models may lack adequate knowledge, directly fine-tuning on our collected medical knowledge datasets shows encouraging results, even generalising to unseen specialist sub-domains. We complement the quantitative results with a skill-oriented manual error analysis, which reveals a significant gap between the models’ capabilities to simply recall necessary knowledge and to integrate it with the presented context.To foster research and collaboration in this field we share M-QALM, our resources, standardised methodology, and evaluation results, with the research community to facilitate further advancements in clinical knowledge representation learning within language models. 2024.findings-acl.238 @@ -9387,7 +9387,7 @@ <fixed-case>M</fixed-case>ovie<fixed-case>S</fixed-case>um: An Abstractive Summarization Dataset for Movie Screenplays RohitSaxenaUniversity of Edinburgh, University of Edinburgh - FrankKellerUniversity of Edinburgh + FrankKellerUniversity of Edinburgh 4043-4050 Movie screenplay summarization is challenging, as it requires an understanding of long input contexts and various elements unique to movies. Large language models have shown significant advancements in document summarization, but they often struggle with processing long input contexts. Furthermore, while television transcripts have received attention in recent studies, movie screenplay summarization remains underexplored. To stimulate research in this area, we present a new dataset, MovieSum, for abstractive summarization of movie screenplays. This dataset comprises 2200 movie screenplays accompanied by their Wikipedia plot summaries. We manually formatted the movie screenplays to represent their structural elements. Compared to existing datasets, MovieSum possesses several distinctive features: 1) It includes movie screenplays which are longer than scripts of TV episodes. 2) It is twice the size of previous movie screenplay datasets. 3) It provides metadata with IMDb IDs to facilitate access to additional external knowledge. We also show the results of recently released large language models applied to summarization on our dataset to provide a detailed baseline. 2024.findings-acl.239 @@ -9396,18 +9396,18 @@ Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality - JiahuanPeiCentrum voor Wiskunde en Informatica - IreneViola + JiahuanPeiCentrum voor Wiskunde en Informatica + IreneViola HaochenHuang - JunxiaoWangKing Abdullah University of Science and Technology + JunxiaoWangKing Abdullah University of Science and Technology MoonisaAhsan FanghuaYe JiangYiming YaoSai - DiWangKAUST + DiWangKAUST ZhuminChenShandong University PengjieRenShandong University - PabloCesarDelft University of Technology and Centrum Wiskunde & Informatica (CWI) + PabloCesarDelft University of Technology and Centrum Wiskunde & Informatica (CWI) 4051-4066 Autonomous artificial intelligence (AI) agents have emerged as promising protocols for automatically understanding the language-based environment, particularly with the exponential development of large language models (LLMs). However, a fine-grained, comprehensive understanding of multimodal environments remains under-explored. This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment. Specifically, we design a cerebral language agent that integrates LLM with memory, planning, and interaction with XR tools and a vision-language agent, enabling agents to decide their actions based on past experiences. Furthermore, we introduce LEGO-MRTA, a multimodal fine-grained assembly dialogue dataset synthesized automatically in the workflow served by a commercial LLM. This dataset comprises multimodal instruction manuals, conversations, XR responses, and vision question answering. Last, we present several prevailing open-resource LLMs as benchmarks, assessing their performance with and without fine-tuning on the proposed dataset. We anticipate that the broader impact of this workflow will advance the development of smarter assistants for seamless user interaction in XR environments, fostering research in both AI and HCI communities. 2024.findings-acl.240 @@ -9431,7 +9431,7 @@ AbhaySheshadri VictorLevoso PaulSwobodaHeinrich-Heine University Düsseldorf - ChristianBarteltUniversität Mannheim + ChristianBarteltUniversität Mannheim 4082-4102 Transformers demonstrate impressive performance on a range of reasoning benchmarks. To evaluate the degree to which these abilities are a result of actual reasoning, existing work has focused on developing sophisticated benchmarks for behavioral studies. However, these studies do not provide insights into the internal mechanisms driving the observed capabilities. To improve our understanding of the internal mechanisms of transformers, we present a comprehensive mechanistic analysis of a transformer trained on a synthetic reasoning task. We identify a set of interpretable mechanisms the model uses to solve the task, and validate our findings using correlational and causal evidence. Our results suggest that it implements a depth-bounded recurrent mechanisms that operates in parallel and stores intermediate results in selected token positions. We anticipate that the motifs we identified in our synthetic setting can provide valuable insights into the broader operating principles of transformers and thus provide a basis for understanding more complex models. 2024.findings-acl.242 @@ -9441,11 +9441,11 @@ Optimal Transport Guided Correlation Assignment for Multimodal Entity Linking ZefengZhang - JiaweiShengInstitute of Information Engineering, Chinese Academy of Sciences + JiaweiShengInstitute of Information Engineering, Chinese Academy of Sciences ZhangChuang - LiangyunzhiLiangyunzhi - WenyuanZhang - SiqiWang + LiangyunzhiLiangyunzhi + WenyuanZhang + SiqiWang TingwenLiuInstitute of Information Engineering, Chinese Academy of Sciences 4103-4117 Multimodal entity linking (MEL) aims to link ambiguous mentions in multimodal contexts to entities in a multimodal knowledge graph. A pivotal challenge is to fully leverage multi-element correlations between mentions and entities to bridge modality gap and enable fine-grained semantic matching. Existing methods attempt several local correlative mechanisms, relying heavily on the automatically learned attention weights, which may over-concentrate on partial correlations. To mitigate this issue, we formulate the correlation assignment problem as an optimal transport (OT) problem, and propose a novel MEL framework, namely OT-MEL, with OT-guided correlation assignment. Thereby, we exploit the correlation between multimodal features to enhance multimodal fusion, and the correlation between mentions and entities to enhance fine-grained matching. To accelerate model prediction, we further leverage knowledge distillation to transfer OT assignment knowledge to attention mechanism. Experimental results show that our model significantly outperforms previous state-of-the-art baselines and confirm the effectiveness of the OT-guided correlation assignment. @@ -9456,7 +9456,7 @@ On Efficiently Representing Regular Languages as <fixed-case>RNN</fixed-case>s AnejSveteDepartment of Computer Science, ETHZ - ETH Zurich - RobinChan + RobinChan RyanCotterellSwiss Federal Institute of Technology 4118-4135 Recent work by Hewitt et al. (2020) provides an interpretation of the empirical success of recurrent neural networks (RNNs) as language models (LMs). It shows that RNNs can efficiently represent bounded hierarchical structures that are prevalent in human language.This suggests that RNNs’ success might be linked to their ability to model hierarchy. However, a closer inspection of hewitt-etal-2020-rnns construction shows that it is not inherently limited to hierarchical structures. This poses a natural question: What other classes of LMs RNNs can efficiently represent? To this end, we generalize Hewitt et al.’s (2020) construction and show that RNNs can efficiently represent a larger class of LMs than previously claimed—specifically, those that can be represented by a pushdown automaton with a bounded stack and a specific stack update function. Altogether, the efficiency of representing this diverse class of LMs with RNN LMs suggests novel interpretations of their inductive bias. @@ -9469,7 +9469,7 @@ InesReinig MariaBeckerRuprecht-Karls-Universität Heidelberg InesRehbeinUniversität Mannheim - SimonePonzettoUniversity of Mannheim + SimonePonzettoUniversity of Mannheim 4136-4155 In this survey, we provide a systematic review of recent work on modelling morality in text, an area of research that has garnered increasing attention in recent years. Our survey is motivated by the importance of modelling decisions on the created resources, the models trained on these resources and the analyses that result from the models’ predictions. We review work at the interface of NLP, Computational Social Science and Psychology and give an overview of the different goals and research questions addressed in the papers, their underlying theoretical backgrounds and the methods that have been applied to pursue these goals. We then identify and discuss challenges and research gaps, such as the lack of a theoretical framework underlying the operationalisation of morality in text, the low IAA reported for manyhuman-annotated resulting resources and the lack of validation of newly proposed resources and analyses. 2024.findings-acl.245 @@ -9499,12 +9499,12 @@ YiningYe YujiaQin XinCong - YankaiLinRenmin University of China + YankaiLinRenmin University of China YinxuPan YesaiWu HuiHaotian LiuWeichuanSiemens Corporate Research - ZhiyuanLiuTsinghua University + ZhiyuanLiuTsinghua University MaosongSun 4173-4198 Large Language Models (LLMs) have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs’ debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce ‘DebugBench’, an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging. @@ -9515,9 +9515,9 @@ <fixed-case>POP</fixed-case>-<fixed-case>CEE</fixed-case>: Position-oriented Prompt-tuning Model for Causal Emotion Entailment ZhihanZhouJilin University - XueGuUniversidade do Minho - YujieZhao - HaoXuJilin University + XueGuUniversidade do Minho + YujieZhao + HaoXuJilin University 4199-4210 The objective of the Causal Emotion Entailment (CEE) task is to identify the causes of the target emotional utterances in a given conversation. Most existing studies have focused on a fine-tuning paradigm based on a pretrained model, e.g., the BERT model. However, there are gaps between the pretrained task and the CEE task. Although a pretrained model enhances contextual comprehension to some extent, it cannot acquire specific knowledge that is relevant to the CEE task. In addition, in a typical CEE task, there are peculiarities in the distribution of the positions with different emotion types of emotion utterances and cause utterances in conversations. Existing methods employ a fixed-size window to capture the relationship between neighboring conversations; however, these methods ignore the specific semantic associations between emotions and cause utterances. To address these issues, we propose the Position-oriented Prompt-tuning (POP-CEE) model to solve the CEE task in an end-to-end manner. Specifically, we can model the CEE task by designing prompts with multiple unified goals and by exploring the positional relationship between emotion and cause utterances using a position constraint module. Experimental results demonstrate that the proposed POP-CEE model achieves state-of-the-art performance on a benchmark dataset. Ourcode and data can be found at: https://github.com/Zh0uzh/POP-CEE. 2024.findings-acl.248 @@ -9526,8 +9526,8 @@ Context Length Extension via Generalized Extrapolation Scale - LinhanLi - ZhangHuapingBeijing Institute of Technology + LinhanLi + ZhangHuapingBeijing Institute of Technology 4211-4218 2024.findings-acl.249 li-huaping-2024-context @@ -9537,8 +9537,8 @@ Selectively Answering Visual Questions JulianEisenschlosGoogle DeepMind HernánMainaUniversidad Nacional de Córdoba, Argentina - GuidoIvettaUniversidad Nacional de Córdoba - LucianaBenottiUniversidad nacional de Córdoba + GuidoIvettaUniversidad Nacional de Córdoba + LucianaBenottiUniversidad nacional de Córdoba 4219-4229 Recently, large multi-modal models (LMMs) have emerged with the capacity to perform vision tasks such as captioning and visual question answering (VQA) with unprecedented accuracy. Applications such as helping the blind or visually impaired have a critical need for precise answers. It is specially important for models to be well calibrated and be able to quantify their uncertainty in order to selectively decide when to answer and when to abstain or ask for clarifications. We perform the first in-depth analysis of calibration methods and metrics for VQA with in-context learning LMMs. Studying VQA on two answerability benchmarks, we show that the likelihood score of visually grounded models is better calibrated than in their text-only counterparts for in-context learning, where sampling based methods are generally superior, but no clear winner arises. We propose Avg BLEU, a calibration score combining the benefits of both sampling and likelihood methods across modalities. 2024.findings-acl.250 @@ -9552,8 +9552,8 @@ JinzhengHeZhejiang University GangSun RanShen - XizeCheng - ZhouZhaoZhejiang University and Zhejiang University + XizeCheng + ZhouZhaoZhejiang University and Zhejiang University 4230-4242 We release a multi-accent dataset and propose speech-programming and gradient reversal classifier to improve the generalization.Abstract: Speech-to-SQL (S2SQL) aims to convert spoken questions into SQL queries given relational databases, which has been traditionally implemented in a cascaded manner while facing the following challenges: 1) model training is faced with the major issue of data scarcity, where limited parallel data is available; and 2) the systems should be robust enough to handle diverse out-of-domain speech samples that differ from the source data. In this work, we propose the direct generalizable speech-to-SQL parsing model Wav2SQL which avoids error compounding across cascaded systems. Specifically, 1) to accelerate speech-driven SQL parsing research in the community, we release a large-scale and multi-accent dataset MASpider; 2) leveraging the recent progress in the large-scale pre-training, we show that it alleviates the data scarcity issue and allow for direct speech-to-SQL parsing; and 3) we include the speech re-programming and gradient reversal classifier techniques to reduce acoustic variance and learned style-agnostic representation, improving generalization to unseen out-of-domain custom data. Experimental results demonstrate that Wav2SQL avoids error compounding and achieves state-of-the-art results by up to 4.7% accuracy improvement over the baseline. 2024.findings-acl.251 @@ -9564,7 +9564,7 @@ E2-<fixed-case>LLM</fixed-case>: Efficient and Extreme Length Extension of Large Language Models JiahengLiu ZhiqiBaiZhiqiBai - YuanxingZhang + YuanxingZhang ChenchenZhangBeijing University of Posts and Telecommunications YuangZhYuangZh GeZhang @@ -9572,10 +9572,10 @@ HaoranQue YukangChen WenboSu - TiezhengGeAlibaba Group - JieFuHong Kong University of Science and Technology + TiezhengGeAlibaba Group + JieFuHong Kong University of Science and Technology WenhuChenUniversity of Waterloo and Google - BoZhengAlibaba Group + BoZhengAlibaba Group 4243-4253 Training Large Language Models (LLMs) to process extensive context lengths incurs prohibitive computational costs. Prevailing techniques for extending context capabilities in LLMs typically require not only additional training procedures but also access to datasets with long context (e.g., sequences of 32K tokens), presupposing substantial GPU expenditures. To address the aforementioned issues, we introduce a novel solution named Efficient and Extreme length extension for Large Language Models (E2-LLM). E2-LLM entails a singular training process over considerably short sequences (e.g., 4K tokens), which greatly mitigates the cost of continual-pretraining or fine-tuning. Within the training phase, we incorporate a dual augmentation strategy with Rotary Position Embeddings (RoPE) that adjusts the scale and position indices across distinct training samples. E 2 -LLM is meticulously designed to enhance the model’s robustness to diverse relative positions. The experimental results on multiple benchmark datasets demonstrate the superior performance of E 2 -LLM on demanding tasks of processing long contexts. 2024.findings-acl.252 @@ -9586,7 +9586,7 @@ Are Female Carpenters like Blue Bananas? A Corpus Investigation of Occupation Gender Typicality DaJuFacebook KarenUllrichMeta AI - AdinaWilliamsFAIR (Meta Platforms Inc.) + AdinaWilliamsFAIR (Meta Platforms Inc.) 4254-4274 People tend to use language to mention surprising properties of events: for example, when a banana is blue, we are more likely to mention color than when it is yellow. This fact is taken to suggest that yellowness is somehow a typical feature of bananas, and blueness is exceptional. Similar to how a yellow color is typical of bananas, there may also be genders that are typical of occupations. In this work, we explore this question using information theoretic techniques coupled with corpus statistic analysis. In two distinct large corpora, we do not find strong evidence that occupations and gender display the same patterns of mentioning as do bananas and color. Instead, we find that gender mentioning is correlated with femaleness of occupation in particular, suggesting perhaps that woman-dominated occupations are seen as somehow “more gendered” than male-dominated ones, and thereby they encourage more gender mentioning overall. 2024.findings-acl.253 @@ -9598,13 +9598,13 @@ SitaoCheng ZiyuanZhuang YongXu - FangkaiYangMicrosoft + FangkaiYangMicrosoft ChaoyunZhang - XiaotingQinMicrosoft + XiaotingQinMicrosoft XiangHuang LingChen - QingweiLinMicrosoft Research - DongmeiZhangMicrosoft and Microsoft + QingweiLinMicrosoft Research + DongmeiZhangMicrosoft and Microsoft SaravanRajmohanMicrosoft QiZhang 4275-4295 @@ -9615,12 +9615,12 @@ Legal Judgment Reimagined: <fixed-case>P</fixed-case>red<fixed-case>E</fixed-case>x and the Rise of Intelligent <fixed-case>AI</fixed-case> Interpretation in <fixed-case>I</fixed-case>ndian Courts - Shubham KumarNigamIIT Kanpur + Shubham KumarNigamIIT Kanpur AnuragSharmaIISER Kolkata DanushKhanna NoelShallumSymbiosis Law School Pune KripabandhuGhoshIndian Institute of Science Education and Research Kolkata - ArnabBhattacharyaIIT Kanpur + ArnabBhattacharyaIIT Kanpur 4296-4315 In the era of Large Language Models (LLMs), predicting judicial outcomes poses significant challenges due to the complexity of legal proceedings and the scarcity of expert-annotated datasets. Addressing this, we introduce Prediction with Explanation (PredEx), the largest expert-annotated dataset for legal judgment prediction and explanation in the Indian context, featuring over 15,000 annotations. This groundbreaking corpus significantly enhances the training and evaluation of AI models in legal analysis, with innovations including the application of instruction tuning to LLMs. This method has markedly improved the predictive accuracy and explanatory depth of these models for legal judgments. We employed various transformer-based models, tailored for both general and Indian legal contexts. Through rigorous lexical, semantic, and expert assessments, our models effectively leverage PredEx to provide precise predictions and meaningful explanations, establishing it as a valuable benchmark for both the legal profession and the NLP community. 2024.findings-acl.255 @@ -9643,7 +9643,7 @@ Multi-Objective Linguistic Control of Large Language Models DangNguyenUniversity of Maryland, College Park JiuhaiChen - TianyiZhouUniversity of Maryland, College Park + TianyiZhouUniversity of Maryland, College Park 4336-4347 Large language models (LLMs), despite their breakthroughs on many challenging benchmark tasks, prefer to generate verbose responses and lack the controllability of output complexity, which is usually preferred by human users in practice. In this paper, we study how to precisely control multiple linguistic complexities of LLM output by finetuning using off-the-shelf data. To this end, we propose multi-control tuning (MCTune), which includes multiple linguistic complexity values of ground-truth responses as controls in the input for instruction tuning. We finetune LLaMA2-7B on Alpaca-GPT4 and WizardLM datasets. Evaluations on widely used benchmarks demonstrate that our method does not only improve LLMs’ multi-complexity controllability substantially but also retains or even enhances the quality of the responses as a side benefit. 2024.findings-acl.257 @@ -9653,7 +9653,7 @@ Evaluating the Smooth Control of Attribute Intensity in Text Generation with <fixed-case>LLM</fixed-case>s ShangZhou - FengYao + FengYao ChengyuDongUniversity of California, San Diego ZihanWang JingboShangUniversity of California, San Diego @@ -9670,14 +9670,14 @@ JianqiaoLu QiZhu JiahuiGao - WeiwenLiuHuawei Technologies Ltd. - YutaiHou + WeiwenLiuHuawei Technologies Ltd. + YutaiHou XingshanZengHuawei Technologies Ltd. YashengWang LifengShangHuawei Technologies Ltd. - XinJiang + XinJiang RuifengXuHarbin Institute of Technology - QunLiuHuawei Noah’s Ark Lab + QunLiuHuawei Noah’s Ark Lab 4363-4400 The recent trend of using Large Language Models (LLMs) as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs’ ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at https://github.com/JoeYing1019/UltraTool. 2024.findings-acl.259 @@ -9688,7 +9688,7 @@ Do Androids Know They’re Only Dreaming of Electric Sheep? SkyCH-WangColumbia University BenjaminVan DurmeJohns Hopkins University, Johns Hopkins University, Johns Hopkins University and Microsoft - JasonEisnerMicrosoft and Johns Hopkins University + JasonEisnerMicrosoft and Johns Hopkins University ChrisKedzieRasa Technologies, Inc. 4401-4420 We design probes trained on the internal representations of a transformer language model to predict its hallucinatory behavior on three grounded generation tasks. To train the probes, we annotate for span-level hallucination on both sampled (organic) and manually edited (synthetic) reference outputs. Our probes are narrowly trained and we find that they are sensitive to their training domain: they generalize poorly from one task to another or from synthetic to organic hallucinations. However, on in-domain data, they can reliably detect hallucinations at many transformer layers, achieving 95% of their peak performance as early as layer 4. Here, probing proves accurate for evaluating hallucination, outperforming several contemporary baselines and even surpassing an expert human annotator in response-level detection F1. Similarly, on span-level labeling, probes are on par or better than the expert annotator on two out of three generation tasks. Overall, we find that probing is a feasible and efficient alternative to language model hallucination evaluation when model states are available. @@ -9699,9 +9699,9 @@ <fixed-case>URG</fixed-case>: A Unified Ranking and Generation Method for Ensembling Language Models BoLv - ChenTang - YananZhang - XinLiu + ChenTang + YananZhang + XinLiu PingLuoInstitute of Computing Technology, Chinese Academy of Sciences YueYuNational University of Defense Technology and PengCheng Lab 4421-4434 @@ -9728,7 +9728,7 @@ <fixed-case>L</fixed-case>ora<fixed-case>R</fixed-case>etriever: Input-Aware <fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case> Retrieval and Composition for Mixed Tasks in the Wild - ZiyuZhao + ZiyuZhao LeileiGanZhejiang University GuoyinWangBytedance WangchunshuZhouAIWaves Inc. @@ -9743,11 +9743,11 @@ <fixed-case>ELAD</fixed-case>: Explanation-Guided Large Language Models Active Distillation - YifeiZhangEmory University - BoPan - ChenLing - YuntongHuEmory University - LiangZhaoEmory University + YifeiZhangEmory University + BoPan + ChenLing + YuntongHuEmory University + LiangZhaoEmory University 4463-4475 The deployment and application of Large Language Models (LLMs) is hindered by their memory inefficiency, computational demands, and the high costs of API inferences. Traditional distillation methods, which transfer the capabilities of LLMs to smaller models, often fail to determine whether the knowledge has been sufficiently transferred, potentially resulting in high costs or incomplete distillation. In this paper, we propose an Explanation-Guided LLMs Active Distillation (ELAD) framework that employs an active learning strategy to optimize the balance between annotation costs and model performance. To improve the efficiency of sample selection, we introduce an explanation-guided sample selection method that identifies samples challenging its reasoning by exploiting uncertainties in reasoning explanation steps. Additionally, we present a customized LLM-annotated explanation revision technique where the teacher model detects and corrects flaws in the student model’s reasoning. Our experiments across various reasoning datasets demonstrate that our framework significantly enhances the efficiency of LLMs knowledge distillation. 2024.findings-acl.264 @@ -9756,8 +9756,8 @@ Evaluating the Elementary Multilingual Capabilities of Large Language Models with <fixed-case>M</fixed-case>ulti<fixed-case>Q</fixed-case> - CarolinHoltermannUniversität Hamburg - PaulRöttgerBocconi University + CarolinHoltermannUniversität Hamburg + PaulRöttgerBocconi University TimmDillUniversität Hamburg AnneLauscherUniversität Hamburg 4476-4494 @@ -9770,7 +9770,7 @@ Semantics or spelling? Probing contextual word embeddings with orthographic noise Jacob A.Matthews John R.Starr - Martenvan Schijndel + Martenvan Schijndel 4495-4504 Pretrained language model (PLM) hidden states are frequently employed as contextual word embeddings (CWE): high-dimensional representations that encode semantic information given linguistic context. Across many areas of computational linguistics research, similarity between CWEs is interpreted as semantic similarity. However, it remains unclear exactly what information is encoded in PLM hidden states. We investigate this practice by probing PLM representations using minimal orthographic noise. We expect that if CWEs primarily encode semantic information, a single character swap in the input word will not drastically affect the resulting representation, given sufficient linguistic context. Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data, and that this sensitivity is related to subword tokenization: the fewer tokens used to represent a word at input, the more sensitive its corresponding CWE. This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data. We conclude that these PLM-derived CWEs may not be reliable semantic proxies, and that caution is warranted when interpreting representational similarity. 2024.findings-acl.266 @@ -9784,10 +9784,10 @@ PengfeiHeMichigan State University YidingLiuBaidu YueXingMichigan State University - HanXuUniversity of Arizona + HanXuUniversity of Arizona JieRenBaidu and Michigan State University - YiChangJilin University, China - ShuaiqiangWang + YiChangJilin University, China + ShuaiqiangWang DaweiYinBaidu JiliangTangMichigan State University 4505-4524 @@ -9798,13 +9798,13 @@ <fixed-case>E</fixed-case>mpathic<fixed-case>S</fixed-case>tories++: A Multimodal Dataset for Empathy Towards Personal Experiences - JocelynShenMassachusetts Institute of Technology - YubinKimMassachusetts Institute of Technology + JocelynShenMassachusetts Institute of Technology + YubinKimMassachusetts Institute of Technology MohitHulse WazeerZulfikar SharifaAlghowinem - CynthiaBreazeal - HaeParkAmazon and Massachusetts Institute of Technology + CynthiaBreazeal + HaeParkAmazon and Massachusetts Institute of Technology 4525-4536 Modeling empathy is a complex endeavor that is rooted in interpersonal and experiential dimensions of human interaction, and remains an open problem within AI. Existing empathy datasets fall short in capturing the richness of empathy responses, often being confined to in-lab or acted scenarios, lacking longitudinal data, and missing self-reported labels. We introduce a new multimodal dataset for empathy during personal experience sharing: the EmpathicStories++ dataset containing 53 hours of video, audio, and text data of 41 participants sharing vulnerable experiences and reading empathically resonant stories with an AI agent. EmpathicStories++ is the first longitudinal dataset on empathy, collected over a month-long deployment of social robots in participants’ homes, as participants engage in natural, empathic storytelling interactions with AI agents. We then introduce a novel task of predicting individuals’ empathy toward others’ stories based on their personal experiences, evaluated in two contexts: participants’ own personal shared story context and their reflections on stories they read. We benchmark this task using state-of-the-art models to pave the way for future improvements in contextualized and longitudinal empathy modeling. Our work provides a valuable resource for further research in developing empathetic AI systems and understanding the intricacies of human empathy within genuine, real-world settings. 2024.findings-acl.268 @@ -9825,9 +9825,9 @@ <fixed-case>S</fixed-case>yntax<fixed-case>S</fixed-case>hap: Syntax-aware Explainability Method for Text Generation - KenzaAmara + KenzaAmara RitaSevastjanovaETHZ - ETH Zurich - MennatallahEl-AssadyDepartment of Computer Science, ETHZ - ETH Zurich + MennatallahEl-AssadyDepartment of Computer Science, ETHZ - ETH Zurich 4551-4566 To harness the power of large language models in safety-critical domains, we need to ensure the explainability of their predictions. However, despite the significant attention to model interpretability, there remains an unexplored domain in explaining sequence-to-sequence tasks using methods tailored for textual data. This paper introduces *SyntaxShap*, a local, model-agnostic explainability method for text generation that takes into consideration the syntax in the text data. The presented work extends Shapley values to account for parsing-based syntactic dependencies. Taking a game theoric approach, SyntaxShap only considers coalitions constraint by the dependency tree. We adopt a model-based evaluation to compare SyntaxShap and its weighted form to state-of-the-art explainability methods adapted to text generation tasks, using diverse metrics including faithfulness, coherency, and semantic alignment of the explanations to the model. We show that our syntax-aware method produces explanations that help build more faithful and coherent explanations for predictions by autoregressive models. Confronted with the misalignment of human and AI model reasoning, this paper also highlights the need for cautious evaluation strategies in explainable AI. 2024.findings-acl.270 @@ -9837,11 +9837,11 @@ Automated Detection and Analysis of Data Practices Using A Real-World Corpus MukundSrinath - PranavNarayanan Venkit + PranavNarayanan Venkit MariaBadillo FlorianSchaubUniversity of Michigan - Ann Arbor C.GilesPennsylvania State University - ShomirWilsonPennsylvania State University + ShomirWilsonPennsylvania State University 4567-4574 Privacy policies are crucial for informing users about data practices, yet their length and complexity often deter users from reading them. In this paper, we propose an automated approach to identify and visualize data practices within privacy policies at different levels of detail. Leveraging crowd-sourced annotations from the ToS;DR platform, we experiment with various methods to match policy excerpts with predefined data practice descriptions. We further conduct a case study to evaluate our approach on a real-world policy, demonstrating its effectiveness in simplifying complex policies. Experiments show that our approach accurately matches data practice descriptions with policy excerpts, facilitating the presentation of simplified privacy information to users. 2024.findings-acl.271 @@ -9852,7 +9852,7 @@ Enhancing Hyperbolic Knowledge Graph Embeddings via Lorentz Transformations XiranFanVISA MinghuaXu - HuiyuanChenVISA + HuiyuanChenVISA YuzhongChen MahashwetaDas HaoYangVisa Research @@ -9876,7 +9876,7 @@ Probing the Uniquely Identifiable Linguistic Patterns of Conversational <fixed-case>AI</fixed-case> Agents IqraZahid - TharinduMadusanka + TharinduMadusanka RizaBatista-NavarroUniversity of Manchester YouchengSunThe University of Manchester 4612-4628 @@ -9898,7 +9898,7 @@ <fixed-case>X</fixed-case>-Shot: A Unified System to Handle Frequent, Few-shot and Zero-shot Learning Simultaneously in Classification HanziXu - MuhaoChenUniversity of California, Davis and University of Southern California + MuhaoChenUniversity of California, Davis and University of Southern California LifuHuangVirginia Tech SlobodanVuceticTemple University and Temple University WenpengYinPennsylvania State University @@ -9911,10 +9911,10 @@ <fixed-case>SPIN</fixed-case>: Sparsifying and Integrating Internal Neurons in Large Language Models for Text Classification DifanJiao - YilunLiuTechnische Universität München - ZhenweiTangUniversity of Toronto - DanielMatterTechnische Universität München - JürgenPfefferTechnische Universität München + YilunLiuTechnische Universität München + ZhenweiTangUniversity of Toronto + DanielMatterTechnische Universität München + JürgenPfefferTechnische Universität München AshtonAndersonDepartment of Computer Science, University of Toronto 4666-4682 Among the many tasks that Large Language Models (LLMs) have revolutionized is text classification. Current text classification paradigms, however, rely solely on the output of the final layer in the LLM, with the rich information contained in internal neurons largely untapped. In this study, we present SPIN: a model-agnostic framework that sparsifies and integrates internal neurons of intermediate layers of LLMs for text classification. Specifically, SPIN sparsifies internal neurons by linear probing-based salient neuron selection layer by layer, avoiding noise from unrelated neurons and ensuring efficiency. The cross-layer salient neurons are then integrated to serve as multi-layered features for the classification head. Extensive experimental results show our proposed SPIN significantly improves text classification accuracy, efficiency, and interpretability. @@ -9925,8 +9925,8 @@ Decomposing Co-occurrence Matrices into Interpretable Components as Formal Concepts AkihiroMaedaJapan Advanced Institute of Science and Technology - TakumaToriiTokyo Denki University, Tokyo Institute of Technology - ShoheiHidakaJapan Advanced Institute of Science and Technology, Tokyo Institute of Technology + TakumaToriiTokyo Denki University, Tokyo Institute of Technology + ShoheiHidakaJapan Advanced Institute of Science and Technology, Tokyo Institute of Technology 4683-4700 This study addresses the interpretability of word representations through an investigation of a count-based co-occurrence matrix. Employing the mathematical methodology of Formal Concept Analysis, we reveal an underlying structure that is amenable to human interpretation. Furthermore, we unveil the emergence of hierarchical and geometrical structures within word vectors as consequences of word usage. Our experiments on the PPMI matrix demonstrate that the formal concepts that we identified align with interpretable categories, as shown in the category completion task. 2024.findings-acl.278 @@ -9959,7 +9959,7 @@ YanmingLiu XinyuePeng XuhongZhangZhejiang University - WeihaoLiu + WeihaoLiu JianweiYinZhejiang University JiannanCao TianyuDuZhejiang University @@ -9973,7 +9973,7 @@ <fixed-case>M</fixed-case>r<fixed-case>R</fixed-case>ank: Improving Question Answering Retrieval System through Multi-Result Ranking Model DanupatKhamnuansinChulalongkorn University and KASIKORN Business-Technology Group TawunratChalothornKASIKORN Business-Technology Group - EkapolChuangsuwanichChulalongkorn University + EkapolChuangsuwanichChulalongkorn University 4750-4762 Large Language Models (LLMs) often struggle with hallucinations and outdated information. To address this, Information Retrieval (IR) systems can be employed to augment LLMs with up-to-date knowledge. However, existing IR techniques contain deficiencies, posing a performance bottleneck. Given the extensive array of IR systems, combining diverse approaches presents a viable strategy. Nevertheless, prior attempts have yielded restricted efficacy. In this work, we propose an approach that leverages learning-to-rank techniques to combine heterogeneous IR systems. We demonstrate the method on two Retrieval Question Answering (ReQA) tasks. Our empirical findings exhibit a significant performance enhancement, outperforming previous approaches and achieving state-of-the-art results on ReQA SQuAD. 2024.findings-acl.282 @@ -9984,7 +9984,7 @@ Chain-of-Question: A Progressive Question Decomposition Approach for Complex Knowledge Base Question Answering PengYixingUniversity of Science and Technology of China QuanWangBeijing University of Posts and Telecommunications - LichengZhang + LichengZhang YiLiuState Key Laboratory of Communication Content Cognition ZhendongMaoUniversity of Science and Technology of China 4763-4776 @@ -10032,7 +10032,7 @@ Locating and Extracting Relational Concepts in Large Language Models ZijianWang BritneyWhyteUniversity of New South Wales - ChangXuUniversity of Sydney + ChangXuUniversity of Sydney 4818-4832 Relational concepts are indeed foundational to the structure of knowledge representation, as they facilitate the association between various entity concepts, allowing us to express and comprehend complex world knowledge.By expressing relational concepts in natural language prompts, people can effortlessly interact with large language models (LLMs) and recall desired factual knowledge. However, the process of knowledge recall lacks interpretability, and representations of relational concepts within LLMs remain unknown to us. In this paper, we identify hidden states that can express entity and relational concepts through causal mediation analysis in fact recall processes. Our finding reveals that at the last token position of the input prompt, there are hidden states that solely express the causal effects of relational concepts. Based on this finding, we assume that these hidden states can be treated as relational representations and we can successfully extract them from LLMs. The experimental results demonstrate high credibility of the relational representations: they can be flexibly transplanted into other fact recall processes, and can also be used as robust entity connectors. Moreover, we also show that the relational representations exhibit significant potential for controllable fact recall through relation rewriting. 2024.findings-acl.287 @@ -10055,8 +10055,8 @@ <fixed-case>S</fixed-case>entic<fixed-case>V</fixed-case>ec: Toward Robust and Human-Centric Neurosymbolic Sentiment Analysis XulangZhang - RuiMao - ErikCambriaNanyang Technological University + RuiMao + ErikCambriaNanyang Technological University 4851-4863 The success of state-of-the-art Natural Language Processing (NLP) systems heavily depends on deep neural networks, which excel in various tasks through strong data fitting and latent feature modeling abilities. However, certain challenges linked to deep neural networks and supervised deep learning deserve considerations, e.g., extensive computing resources, knowledge forgetting, etc. Previous research attempted to tackle these challenges individually through irrelative techniques. However, they do not instigate fundamental shifts in the learning paradigm. In this work, we propose a novel neurosymbolic method for sentiment analysis to tackle these issues. We also propose a novel sentiment-pragmatic knowledge base that places emphasis on human subjectivity within varying domain annotations. We conducted extensive experiments to show that our neurosymbolic framework for sentiment analysis stands out for its lightweight nature, robustness across domains and languages, efficient few-shot training, and rapid convergence. 2024.findings-acl.289 @@ -10068,10 +10068,10 @@ ChenQian JieZhang WeiYao - DongruiLiuShanghai Artificial Intelligence Laboratory - ZhenfeiYinUniversity of Sydney and Shanghai AI Laboratory - YuQiao - YongLiuRenmin University of China and Institute of information engineering, CAS + DongruiLiuShanghai Artificial Intelligence Laboratory + ZhenfeiYinUniversity of Sydney and Shanghai AI Laboratory + YuQiao + YongLiuRenmin University of China and Institute of information engineering, CAS JingShaoShanghai AI Laboratory 4864-4888 Ensuring the trustworthiness of large language models (LLMs) is crucial. Most studies concentrate on fully pre-trained LLMs to better understand and improve LLMs’ trustworthiness. In this paper, to reveal the untapped potential of pre-training, we pioneer the exploration of LLMs’ trustworthiness during this period, focusing on five key dimensions: reliability, privacy, toxicity, fairness, and robustness. To begin with, we apply linear probing to LLMs. The high probing accuracy suggests that LLMs in early pre-training can already distinguish concepts in each trustworthiness dimension. Therefore, to further uncover the hidden possibilities of pre-training, we extract steering vectors from a LLM’s pre-training checkpoints to enhance the LLM’s trustworthiness. Finally, inspired by the theoretical result that mutual information estimation is bounded by linear probing accuracy, we also probe LLMs with mutual information to investigate the dynamics of trustworthiness during pre-training. We are the first to observe a similar two-phase phenomenon: fitting and compression. This research provides an initial exploration of trustworthiness modeling during LLM pre-training, seeking to unveil new insights and spur further developments in the field. @@ -10082,9 +10082,9 @@ Language Models can Evaluate Themselves via Probability Discrepancy TingyuXia - BowenYuAlibaba Group + BowenYuAlibaba Group YuanWuJilin University - YiChangJilin University, China + YiChangJilin University, China ChangZhou 4889-4901 In this paper, we begin by illustrating that, when presented with a query, Large Language Models (LLMs) capable of providing accurate responses tend to exhibit a more uniform probability distribution compared to their less proficient counterparts. Building upon this observation, we introduce a novel self-assessment criterion termed ProbDiff for evaluating the performance of diverse LLMs. This method eliminates the need for training an additional evaluation model or relying on external proprietary models such as GPT-4 as a judger. Instead, it solely relies on the LLMs under evaluation to compute the probability discrepancy between the original response generation and its revised versions. A higher discrepancy in two LLMs for the same query suggests a relatively weaker ability. We discover that ProbDiff yields comparable results to mainstream GPT-4-based evaluations on various scenarios including NLG tasks like translation and summarization, as well as LLM evaluation benchmarks such as AlignBench, MT-Bench, and AlpacaEval, across LLMs of different sizes. @@ -10094,12 +10094,12 @@ Evaluating the Validity of Word-level Adversarial Attacks with Large Language Models - HuichiZhou + HuichiZhou ZhaoyangWangMicrosoft HongtaoWangNorth China Electric Power University DongpingChen WenhanMu - FangyuanZhang + FangyuanZhang 4902-4922 Deep neural networks exhibit vulnerability to word-level adversarial attacks in natural language processing. Most of these attack methods adopt synonymous substitutions to perturb original samples for crafting adversarial examples while attempting to maintain semantic consistency with the originals. Some of them claim that they could achieve over 90% attack success rate, thereby raising serious safety concerns. However, our investigation reveals that many purportedly successful adversarial examples are actually invalid due to significant changes in semantic meanings compared to their originals. Even when equipped with semantic constraints such as BERTScore, existing attack methods can generate up to 87.9% invalid adversarial examples. Building on this insight, we first curate a 13K dataset for adversarial validity evaluation with the help of GPT-4. Then, an open-source large language model is fine-tuned to offer an interpretable validity score for assessing the semantic consistency between original and adversarial examples. Finally, this validity score can serve as a guide for existing adversarial attack methods to generate valid adversarial examples. Comprehensive experiments demonstrate the effectiveness of our method in evaluating and refining the quality of adversarial examples. 2024.findings-acl.292 @@ -10113,11 +10113,11 @@ ZhiZhongSony Group Corporation Chieh-HsinLaiSony AI YuhtaTakidaSony AI - NaokiMurataSony AI and Sony Group Corporation + NaokiMurataSony AI and Sony Group Corporation Wei-HsiangLiaoSony Corporation - TakashiShibuyaSony AI + TakashiShibuyaSony AI HiromiWakakiSony Group Corporation - YukiMitsufujiSony AI, Sony Group Corporation, Tokyo Institute of Technology, Tokyo Institute of Technology and Sony Group Corporation + YukiMitsufujiSony AI, Sony Group Corporation, Tokyo Institute of Technology, Tokyo Institute of Technology and Sony Group Corporation 4923-4940 Contrastive cross-modal models such as CLIP and CLAP aid various vision-language (VL) and audio-language (AL) tasks. However, there has been limited investigation of and improvement in their language encoder – the central component of encoding natural language descriptions of image/audio into vector representations. We extensively evaluate how unsupervised and supervised sentence embedding training affect language encoder quality and cross-modal task performance. In VL pretraining, we found that sentence embedding training enhances language encoder quality and aids in cross-modal tasks, improving contrastive VL models such as CyCLIP. Sentence embedding training benefits AL tasks when the amount of training data is large. We analyze the representation spaces to understand the strengths of sentence embedding training, and find that it improves text-space uniformity, at the cost of decreased cross-modal alignment. 2024.findings-acl.293 @@ -10140,10 +10140,10 @@ Anchor-based Large Language Models JianhuiPang FanghuaYe - DerekWongUniversity of Macau + DerekWongUniversity of Macau XinHe WanshunChen - LongyueWang + LongyueWang 4958-4976 Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading to an urgent need for more efficient methods of information storage and processing. This study introduces Anchor-based LLMs (AnLLMs), which utilize an innovative anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. This approach enables LLMs to compress sequence information into an anchor token, reducing the keys/values cache and enhancing inference efficiency. Experiments on question-answering benchmarks reveal that AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference. Despite a minor compromise in accuracy, the substantial enhancements of AnLLMs employing the AnSAN technique in resource utilization and computational efficiency underscore their potential for practical LLM applications. 2024.findings-acl.295 @@ -10152,15 +10152,15 @@ <fixed-case>ML</fixed-case>e<fixed-case>VLM</fixed-case>: Improve Multi-level Progressive Capabilities based on Multimodal Large Language Model for Medical Visual Question Answering - DexuanXu - YanyuanChen - JieyiWang + DexuanXu + YanyuanChen + JieyiWang YueHuang HanpinWangPeking University - ZhiJinPeking University and Peking University - HongxingWangCapital Medical University + ZhiJinPeking University and Peking University + HongxingWangCapital Medical University WeihuaYue - JingHe + JingHe HangLiPeking University First Hospital YuHuangPeking University 4977-4997 @@ -10185,8 +10185,8 @@ <fixed-case>MIKE</fixed-case>: A New Benchmark for Fine-grained Multimodal Entity Knowledge Editing JiaqiLiSoutheast University MiaozengDu - ChuanyiZhangHohai University - YongruiChen + ChuanyiZhangHohai University + YongruiChen NanHuSoutheast University GuilinQi HaiyunJiangSUN YAT-SEN UNIVERSITY @@ -10215,8 +10215,8 @@ <fixed-case>M</fixed-case>eme<fixed-case>MQA</fixed-case>: Multimodal Question Answering for Memes via Rationale-Based Inferencing SiddhantAgarwal ShivamSharmaIndian Institute of Technology, Delhi - PreslavNakovMohamed bin Zayed University of Artificial Intelligence - TanmoyChakrabortyIndian Institute of Technology, Delhi + PreslavNakovMohamed bin Zayed University of Artificial Intelligence + TanmoyChakrabortyIndian Institute of Technology, Delhi 5042-5078 Memes have evolved as a prevalent medium for diverse communication, ranging from humour to propaganda. With the rising popularity of image-focused content, there is a growing need to explore its potential harm from different aspects. Previous studies have analyzed memes in closed settings - detecting harm, applying semantic labels, and offering natural language explanations. To extend this research, we introduce MemeMQA, a multimodal question-answering framework aiming to solicit accurate responses to structured questions while providing coherent explanations. We curate MemeMQACorpus, a new dataset featuring 1,880 questions related to 1,122 memes with corresponding answer-explanation pairs. We further propose ARSENAL, a novel two-stage multimodal framework that leverages the reasoning capabilities of LLMs to address MemeMQA. We benchmark MemeMQA using competitive baselines and demonstrate its superiority - ~18% enhanced answer prediction accuracy and distinct text generation lead across various metrics measuring lexical and semantic alignment over the best baseline. We analyze ARSENAL’s robustness through diversification of question-set, confounder-based evaluation regarding MemeMQA’s generalizability, and modality-specific assessment, enhancing our understanding of meme interpretation in the multimodal communication landscape. 2024.findings-acl.300 @@ -10227,7 +10227,7 @@ Improving Attributed Text Generation of Large Language Models via Preference Learning DongfangLiHarbin Institute of Technology ZetianSun - BaotianHuHarbin Institute of Technology, Shenzhen + BaotianHuHarbin Institute of Technology, Shenzhen ZhenyuLiu XinshuoHu XueboLiuHarbin Institute of Technolgy, Shenzhen @@ -10243,7 +10243,7 @@ SungHoKimKorea University JuhyeongParkKorea University YeachanKimKorea University - SangKeunLeeKorea University + SangKeunLeeKorea University 5102-5119 The Korean writing system, Hangeul, has a unique character representation rigidly following the invention principles recorded in Hunminjeongeum. However, existing pre-trained language models (PLMs) for Korean have overlooked these principles. In this paper, we introduce a novel framework for Korean PLMs called KOMBO, which firstly brings the invention principles of Hangeul to represent character. Our proposed method, KOMBO, exhibits notable experimental proficiency across diverse NLP tasks. In particular, our method outperforms the state-of-the-art Korean PLM by an average of 2.11% in five Korean natural language understanding tasks. Furthermore, extensive experiments demonstrate that our proposed method is suitable for comprehending the linguistic features of the Korean language. Consequently, we shed light on the superiority of using subcharacters over the typical subword-based approach for Korean PLMs. Our code is available at: https://github.com/SungHo3268/KOMBO. 2024.findings-acl.302 @@ -10254,7 +10254,7 @@ Tree-Planted Transformers: Unidirectional Transformer Language Models with Implicit Syntactic Supervision RyoYoshidaThe University of Tokyo TaigaSomeya - YoheiOsekiUniversity of Tokyo + YoheiOsekiUniversity of Tokyo 5120-5134 Syntactic Language Models (SLMs) can be trained efficiently to reach relatively high performance; however, they have trouble with inference efficiency due to the explicit generation of syntactic structures. In this paper, we propose a new method dubbed tree-planting: instead of explicitly generating syntactic structures, we “plant” trees into attention weights of unidirectional Transformer LMs to implicitly reflect syntactic structures of natural language. Specifically, unidirectional Transformer LMs trained with tree-planting will be called Tree-Planted Transformers (TPT), which inherit the training efficiency from SLMs without changing the inference efficiency of their underlying Transformer LMs. Targeted syntactic evaluations on the SyntaxGym benchmark demonstrated that TPTs, despite the lack of explicit generation of syntactic structures, significantly outperformed not only vanilla Transformer LMs but also various SLMs that generate hundreds of syntactic structures in parallel. This result suggests that TPTs can learn human-like syntactic knowledge as data-efficiently as SLMs while maintaining the modeling space of Transformer LMs unchanged. 2024.findings-acl.303 @@ -10268,7 +10268,7 @@ YiLiu JunjieWang QingWangInstitute of Software, Chinese Academy of Sciences - YangLiuNanyang Technological University + YangLiuNanyang Technological University 5135-5147 With the development of LLMs, the security threats of LLMs are getting more and more attention. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks primarily utilize scenario camouflage techniques. However their explicitly mention of malicious intent will be easily recognized and defended by LLMs. In this paper, we propose an indirect jailbreak attack approach, Puzzler, which can bypass the LLM’s defensive strategies and obtain malicious response by implicitly providing LLMs with some clues about the original malicious query. In addition, inspired by the wisdom of “When unable to attack, defend” from Sun Tzu’s Art of War, we adopt a defensive stance to gather clues about the original malicious query through LLMs. The experimental results indicate that the Query Success Rate of the Puzzler is 14.0%-82.7% higher than baselines on the most prominent LLMs. Furthermore, when tested against the state-of-the-art jailbreak detection approaches, Puzzler proves to be more effective at evading detection compared to baselines. 2024.findings-acl.304 @@ -10278,18 +10278,18 @@ Publicly Shareable Clinical Large Language Model Built on Synthetic Clinical Notes SunjunKweon - JunuKimKorea Advanced Institute of Science & Technology + JunuKimKorea Advanced Institute of Science & Technology JiyounKimKorea Advanced Institute of Science & Technology SujeongImKorea Advanced Institute of Science & Technology EunbyeolChoKorea Advanced Institute of Science & Technology SeongsuBaeKorea Advanced Institute of Science and Technology - JungwooOhKorea Advanced Institute of Science and Technology + JungwooOhKorea Advanced Institute of Science and Technology GyubokLeeKorea Advanced Institute of Science and Technology Jong HakMoonKorea Advanced Institute of Science & Technology Seng ChanYouYonsei University SeungjinBaekYonsei university - Chang HoonHan - Yoon BinJungYonsei University + Chang HoonHan + Yoon BinJungYonsei University YohanJoSeoul National University EdwardChoiKorea Advanced Institute of Science and Technology 5148-5168 @@ -10301,12 +10301,12 @@ Extending Context Window of Large Language Models via Semantic Compression WeizhiFeiThe Department of Mathematics, Tsinghua University - XueyanNiuHuawei Technologies Ltd. + XueyanNiuHuawei Technologies Ltd. PingyiZhouHuawei Technologies Ltd. LuHouHuawei Technologies Ltd. BoBai LeiDeng - WeiHanHuawei Tech. Investment Co., Limited + WeiHanHuawei Tech. Investment Co., Limited 5169-5181 Transformer based Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses due to the quadratic complexity. These constraints restrict their applicability in long text scenarios. In this paper, we propose a novel semantic compression method that enables generalization to texts that are 6-8 times longer without incurring significant computational costs or requiring fine-tuning. Our proposed framework draws inspiration from source coding in information theory and employs a pre-trained model to reduce the semantic redundancy of long inputs before passing them to the LLMs for downstream tasks. Experimental results demonstrate that our method effectively extends the context window of LLMs across a range of tasks including question answering, summarization, few-shot learning, and information retrieval. Furthermore, the proposed semantic compression method exhibits consistent fluency in text generation while reducing the associated computational overhead. 2024.findings-acl.306 @@ -10317,7 +10317,7 @@ Plausible Extractive Rationalization through Semi-Supervised Entailment Signal YeoWei JieSchool of Computer Science and Engineering, Nanyang Technological University RanjanSatapathy - ErikCambriaNanyang Technological University + ErikCambriaNanyang Technological University 5182-5192 The increasing use of complex and opaque black box models requires the adoption of interpretable measures, one such option is extractive rationalizing models, which serve as a more interpretable alternative. These models, also known as Explain-Then-Predict models, employ an explainer model to extract rationales and subsequently condition the predictor with the extracted information. Their primary objective is to provide precise and faithful explanations, represented by the extracted rationales. In this paper, we take a semi-supervised approach to optimize for the plausibility of extracted rationales. We adopt a pre-trained natural language inference (NLI) model and further fine-tune it on a small set of supervised rationales (10%). The NLI predictor is leveraged as a source of supervisory signals to the explainer via entailment alignment. We show that, by enforcing the alignment agreement between the explanation and answer in a question-answering task, the performance can be improved without access to ground truth labels. We evaluate our approach on the ERASER dataset and show that our approach achieves comparable results with supervised extractive models and outperforms unsupervised approaches by > 100%. 2024.findings-acl.307 @@ -10327,7 +10327,7 @@ Translation Deserves Better: Analyzing Translation Artifacts in Cross-lingual Visual Question Answering ChaeHunPark - KoanhoLeeKorea Advanced Institute of Science & Technology + KoanhoLeeKorea Advanced Institute of Science & Technology HyesuLimKorea Advanced Institute of Science & Technology JaeseokKimKorea Telecom Research JunmoParkSaltlux @@ -10356,7 +10356,7 @@ Fast Randomized Low-Rank Adaptation of Pre-trained Language Models with <fixed-case>PAC</fixed-case> Regularization ZijianLeiHong Kong Baptist University DongQianLinköping University - WilliamCheungHong Kong Baptist University + WilliamCheungHong Kong Baptist University 5236-5249 Low-rank adaptation (LoRA) achieves parameter efficient fine-tuning for large language models (LLMs) by decomposing the model weight update into a pair of low-rank projection matrices. Yet, the memory overhead restricts it to scale up when the model size increases. We propose Randomized LoRA (RLoRA) which adopts Randomized Walsh-Hadamard Transform to achieve significant reduction in the size of trainable parameters compared to LoRA. At the same time, it allows a PAC-Bayes regularizer to be efficiently incorporated to improve generalization. We evaluate the effectiveness of RLoRA on LLMs RoBERTa, GPT-2 and LLaMA-7B using GLUE, E2E and math reasoning benchmarks. With a much lower memory requirement, RLoRA can give similar performance as the SOTA low-rank adaptation methods for these three tasks and significantly better performance under few-shot settings. 2024.findings-acl.310 @@ -10366,8 +10366,8 @@ <fixed-case>SDA</fixed-case>: Semantic Discrepancy Alignment for Text-conditioned Image Retrieval YuchenYang - YuWangShanghai Jiao Tong University - YanfengWangShanghai Jiao Tong University + YuWangShanghai Jiao Tong University + YanfengWangShanghai Jiao Tong University 5250-5261 In the realm of text-conditioned image retrieval, models utilize a query composed of a reference image and modification text to retrieve corresponding images. Despite its significance, this task is fraught with challenges, including small-scale datasets due to labeling costs and the complexity of attributes in modification texts. These challenges often result in models learning a generalized representation of the query, thereby missing the semantic correlations of image and text attributes.In this paper, we introduce a general boosting framework designed to address these issues by employing semantic discrepancy alignment. Our framework first leverages the ChatGPT to augment text data by modifying the original modification text’s attributes. The augmented text is then combined with the original reference image to create an augmented composed query. Then we generate corresponding images using GPT-4 for the augmented composed query.We realize the cross-modal semantic discrepancy alignment by formulating distance consistency and neighbor consistency between the image and text domains. Through this novel approach, attribute in the text domain can be more effectively transferred to the image domain, enhancing retrieval performance. Extensive experiments on three prominent datasets validate the effectiveness of our approach, with state-of-the-art results on a majority of evaluation metrics compared to various baseline methods. 2024.findings-acl.311 @@ -10393,7 +10393,7 @@ Generation Meets Verification: Accelerating Large Language Model Inference with Smart Parallel Auto-Correct Decoding HanlingYi - FengLinIntelliFusion Co., Ltd + FengLinIntelliFusion Co., Ltd HongbinLi NingPeiyangIntellifusion Inc. XiaotianYu @@ -10424,7 +10424,7 @@ XinweiWu WeilongDong ShaoyangXu - DeyiXiongTianjin University + DeyiXiongTianjin University 5319-5332 Protecting privacy leakage in large language models remains a paramount challenge. In this paper, we reveal Privacy Seesaw in LLM privacy safeguarding, a phenomenon where measures to secure specific private information inadvertently heighten exposure risks for other privacy. Through comprehensive analysis, we identify the amount of targeted privacy data and the volume of edited privacy neurons as the two central triggers to this issue. To mitigate privacy seesaw, we propose Augmented Privacy Neuron Editing via Activation Patching (APNEAP), a novel framework designed to well balance model performance with privacy protection. The proposed APNEAP augments collected private data by automatically synthesizing new private data, which deactivates the first trigger to the privacy seesaw issue. Additionally, it adapts activation patching to privacy neuron editing for switching off the second trigger to the privacy seesaw problem. Experimental results show that the proposed APNEAP is capable of alleviating the privacy seesaw phenomenon and offers a more stable and reliable approach to privacy protection in LLMs than previous methods. 2024.findings-acl.315 @@ -10446,7 +10446,7 @@ <fixed-case>B</fixed-case>ad<fixed-case>A</fixed-case>cts: A Universal Backdoor Defense in the Activation Space BiaoYi SishuoChenAlibaba Group - YimingLiNanyang Technological University + YimingLiNanyang Technological University TongLiNankai University BaoleiZhang ZheliLiu @@ -10462,10 +10462,10 @@ YaoruiShi AnZhangNational University of Singapore SihangLi - EnzhiZhang - XiangWangUniversity of Science and Technology of China + EnzhiZhang + XiangWangUniversity of Science and Technology of China KenjiKawaguchiNational University of Singapore - Tat-SengChuaNational University of Singapore + Tat-SengChuaNational University of Singapore 5353-5377 Molecule-text modeling, which aims to facilitate molecule-relevant tasks with a textual interface and textual knowledge, is an emerging research direction. Beyond single molecules, studying reaction-text modeling holds promise for helping the synthesis of new materials and drugs. However, previous works mostly neglect reaction-text modeling: they primarily focus on modeling individual molecule-text pairs or learning chemical reactions without texts in context. Additionally, one key task of reaction-text modeling – experimental procedure prediction – is less explored due to the absence of an open-source dataset. The task is to predict step-by-step actions of conducting chemical experiments and is crucial to automating chemical synthesis. To resolve the challenges above, we propose a new pretraining method, ReactXT, for reaction-text modeling, and a new dataset, OpenExp, for experimental procedure prediction. Specifically, ReactXT features three types of input contexts to incrementally pretrain LMs. Each of the three input contexts corresponds to a pretraining task to improve the text-based understanding of either reactions or single molecules. ReactXT demonstrates consistent improvements in experimental procedure prediction and molecule captioning and offers competitive results in retrosynthesis. Our code is available at https://github.com/syr-cn/ReactXT. 2024.findings-acl.318 @@ -10476,7 +10476,7 @@ Multi-modal Concept Alignment Pre-training for Generative Medical Visual Question Answering QuanYanCentral South University JunwenDuanCentral South University, China - JianxinWangCentral South University + JianxinWangCentral South University 5378-5389 Medical Visual Question Answering (Med-VQA) seeks to accurately respond to queries regarding medical images, a task particularly challenging for open-ended questions. This study unveils the Multi-modal Concept Alignment Pre-training (MMCAP) approach for generative Med-VQA, leveraging a knowledge graph sourced from medical image-caption datasets and the Unified Medical Language System. MMCAP advances the fusion of visual and textual medical knowledge via a graph attention network and a transformer decoder. Additionally, it incorporates a Type Conditional Prompt in the fine-tuning phase, markedly boosting the accuracy and relevance of answers to open-ended questions. Our tests on benchmark datasets illustrate MMCAP’s superiority over existing methods, demonstrating its high efficiency in data-limited settings and effective knowledge-image alignment capability. 2024.findings-acl.319 @@ -10504,8 +10504,8 @@ HangJiang RuiYang QingchengZengNorthwestern University, Northwestern University - JinghuiLuByteDance Inc. - MoritzBlum + JinghuiLuByteDance Inc. + MoritzBlum TianweiSheThe University of Tokyo, Tokyo Institute of Technology YuangJiang IreneLi @@ -10518,8 +10518,8 @@ The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse WanliYang - FeiSunInstitute of Computing Technology, Chinese Academy of Sciences - XinyuMaBaidu + FeiSunInstitute of Computing Technology, Chinese Academy of Sciences + XinyuMaBaidu XunLiu DaweiYinBaidu XueqiCheng, Chinese Academy of Sciences @@ -10546,7 +10546,7 @@ BowenLi BowenQinShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences NanHuo - ChenhaoMaThe Chinese University of Hong Kong, Shenzhen + ChenhaoMaThe Chinese University of Hong Kong, Shenzhen ReynoldCheng 5456-5471 Large Language Models (LLMs) driven by In-Context Learning (ICL) have significantly improved the performance of text-to-SQL. Previous methods generally employ a two-stage reasoning framework, namely 1) schema linking and 2) logical synthesis, making the framework not only effective but also interpretable. Despite these advancements, the inherent bad nature of the generalization of LLMs often results in hallucinations, which limits the full potential of LLMs. In this work, we first identify and categorize the common types of hallucinations at each stage in text-to-SQL. We then introduce a novel strategy, Task Alignment (TA), designed to mitigate hallucinations at each stage. TA encourages LLMs to take advantage of experiences from similar tasks rather than starting the tasks from scratch. This can help LLMs reduce the burden of generalization, thereby mitigating hallucinations effectively. We further propose TA-SQL, a text-to-SQL framework based on this strategy. The experimental results and comprehensive analysis demonstrate the effectiveness and robustness of our framework. Specifically, it enhances the performance of the GPT-4 baseline by 21.23% relatively on BIRD dev and it yields significant improvements across six models and four mainstream, complex text-to-SQL benchmarks. @@ -10558,8 +10558,8 @@ Translatotron-<fixed-case>V</fixed-case>(ison): An End-to-End Model for In-Image Machine Translation ZhibinLan LiqiangNiu - FandongMengWeChat AI, Tencent Inc. - JieZhou + FandongMengWeChat AI, Tencent Inc. + JieZhou MinZhangHarbin Institute of Technology, Shenzhen JinsongSuXiamen University 5472-5485 @@ -10572,9 +10572,9 @@ FarhadNooralahzadehUniversity of Zurich and ZHAW - Zürcher Hochschule für Angewandte Wissenschaften YiZhangUniversity of Zurich and ZHAW - Zürcher Hochschule für Angewandte Wissenschaften EllerySmith - SabineMaennelETHZ - ETH Zurich - CyrilMatthey-DoretEPFL - EPF Lausanne - RaphaëlDe FondevilleFederal Office of Statistics + SabineMaennelETHZ - ETH Zurich + CyrilMatthey-DoretEPFL - EPF Lausanne + RaphaëlDe FondevilleFederal Office of Statistics KurtStockingerZHAW - Zürcher Hochschule für Angewandte Wissenschaften 5486-5507 The potential for improvements brought by Large Language Models (LLMs) in Text-to-SQL systems is mostly assessed on monolingual English datasets. However, LLMs’ performance for other languages remains vastly unexplored. In this work, we release the StatBot.Swiss dataset, the first bilingual benchmark for evaluating Text-to-SQL systems based on real-world applications. The StatBot.Swiss dataset contains 455 natural language/SQL-pairs over 35 big databases with varying level of complexity for both English and German.We evaluate the performance of state-of-the-art LLMs such as GPT-3.5-Turbo and mixtral-8x7b-instruct for the Text-to-SQL translation task using an in-context learning approach. Our experimental analysis illustrates that current LLMs struggle to generalize well in generating SQL queries on our novel bilingual dataset. @@ -10610,7 +10610,7 @@ Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning ShuzhengSiTsinghua University HelanHu - HaozheZhao + HaozheZhao ShuangZeng KaikaiAn ZefanCai @@ -10626,7 +10626,7 @@ HarriRowlandsInfluenceMap GakuMorioHitachi America, Ltd., Stanford University and Hitachi, ltd. DylanTanner - ChristopherManningComputer Science Department, Stanford University + ChristopherManningComputer Science Department, Stanford University 5547-5558 Social media advertising offers a platform for fossil fuel value chain companies and their agents to reinforce their narratives, often emphasizing economic, labor market, and energy security benefits to promote oil and gas policy and products. Whether such narratives can be detected automatically and the extent to which the cost of human annotation can be reduced is our research question. We introduce a task of classifying narratives into seven categories, based on existing definitions and data.Experiments showed that RoBERTa-large outperforms other methods, while GPT-4 Turbo can serve as a viable annotator for the task, thereby reducing human annotation costs. Our findings and insights provide guidance to automate climate-related ad analysis and lead to more scalable ad scrutiny. 2024.findings-acl.330 @@ -10637,10 +10637,10 @@ <fixed-case>SSS</fixed-case>: Editing Factual Knowledge in Language Models towards Semantic Sparse Space HuazhengWang HaifengSunBeijing University of Posts and Telecommunications, Beijing University of Posts and Telecommunications and Beijing University of Posts and Telecommunications - JingyuWangBeijing University of Post and Telecommunication, Tsinghua University + JingyuWangBeijing University of Post and Telecommunication, Tsinghua University QiQiBeijing University of Posts and Telecommunications ZixuanXiaBeijing University of Posts and Telecommunications - MenghaoZhangBeijing University of Posts and Telecommunications + MenghaoZhangBeijing University of Posts and Telecommunications JianxinLiao 5559-5570 Language Models (LMs) acquire factual knowledge during pre-training and store it in the parameters, which can be valuable for downstream tasks. As world evolves, some facts may be incorrectly induced or become obsolete over time. Various model editing methods have been proposed to modify specific examples in LMs. However, existing training-based methods still suffer from sub-optimal locality, where irrelevant neighborhood examples can be adversely influenced. Model’s gradients are still struggling to identify the appropriate direction when updating the parameters. To address this issue, we find that directing the hidden state of the edit example towards spaces where semantics are sparse tends to help preserve the semantics of irrelevant neighborhood examples. Based on this hypothesis, we propose a novel metric, named SSS, to evaluate the degree of sparsity around a sentence embedding in the semantic space without any human or machine annotation. Subsequently, we incorporate SSS into the original loss function of the existing training-based methods to enhance locality. Experiments conducted on two datasets across various models demonstrate that SSS is effective in improving both locality and reasoning capability. @@ -10664,8 +10664,8 @@ Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models Sheng-LunWeiDepartment of computer science and informational engineering, National Taiwan University Cheng-KuangWuAppier - Hen-HsenHuangInstitute of Information Science, Academia Sinica - Hsin-HsiChenNational Taiwan University + Hen-HsenHuangInstitute of Information Science, Academia Sinica + Hsin-HsiChenNational Taiwan University 5598-5621 In this paper, we investigate the phenomena of “selection biases” in Large Language Models (LLMs), focusing on problems where models are tasked with choosing the optimal option from an ordered sequence. We delve into biases related to option order and token usage, which significantly impact LLMs’ decision-making processes. We also quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks. Furthermore, we propose mitigation strategies to enhance model performance. Our key contributions are threefold: 1) Precisely quantifying the influence of option order and token on LLMs, 2) Developing strategies to mitigate the impact of token and order sensitivity to enhance robustness, and 3) Offering a detailed analysis of sensitivity across models and tasks, which informs the creation of more stable and reliable LLM applications for selection problems. 2024.findings-acl.333 @@ -10675,7 +10675,7 @@ <fixed-case>A</fixed-case>rabic<fixed-case>MMLU</fixed-case>: Assessing Massive Multitask Language Understanding in <fixed-case>A</fixed-case>rabic FajriKotoMohamed bin Zayed University of Artificial Intelligence - HaonanLi + HaonanLi SaraShatnawi JadDoughman AbdelrahmanSadallah @@ -10683,10 +10683,10 @@ KhalidAlmubarakPrince Sattam bin Abdulaziz University ZaidAlyafeai NehaSengupta - ShadyShehataMohamed bin Zayed University of Artificial Intelligence - NizarHabashNew York University Abu Dhabi - PreslavNakovMohamed bin Zayed University of Artificial Intelligence - TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne + ShadyShehataMohamed bin Zayed University of Artificial Intelligence + NizarHabashNew York University Abu Dhabi + PreslavNakovMohamed bin Zayed University of Artificial Intelligence + TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne 5622-5640 The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present ArabicMMLU, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%. 2024.findings-acl.334 @@ -10696,10 +10696,10 @@ On the Relationship Between <fixed-case>RNN</fixed-case> Hidden-State Vectors and Semantic Structures EdiMuskardin - MartinTapplerTechnische Universität Wien + MartinTapplerTechnische Universität Wien IngoPillTechnische Universität Graz - BernhardAichernigTechnische Universität Graz - ThomasPockGraz University of Technology + BernhardAichernigTechnische Universität Graz + ThomasPockGraz University of Technology 5641-5658 We examine the assumption that hidden-state vectors of recurrent neural networks (RNNs) tend to form clusters of semantically similar vectors, which we dub the clustering hypothesis. While this hypothesis has been assumed in RNN analyses in recent years, its validity has not been studied thoroughly on modern RNN architectures. We first consider RNNs that were trained to recognize regular languages. This enables us to draw on perfect ground-truth automata in our evaluation, against which we can compare the RNN’s accuracy and the distribution of the hidden-state vectors. Then, we consider context-free languages to examine if RNN states form clusters for more expressive languages.For our analysis, we fit (generalized) linear models to classify RNN states into automata states and we apply different unsupervised clustering techniques. With a new ambiguity score, derived from information entropy, we measure how well an abstraction function maps the hidden state vectors to abstract clusters. Our evaluation supports the validity of the clustering hypothesis for regular languages, especially if RNNs are well-trained, i.e., clustering techniques succeed in finding clusters of similar state vectors. However, the clustering accuracy decreases substantially for context-free languages. This suggests that clustering is not a reliable abstraction technique for RNNs used in tasks like natural language processing. 2024.findings-acl.335 @@ -10710,11 +10710,11 @@ <fixed-case>XMC</fixed-case>-Agent : Dynamic Navigation over Scalable Hierarchical Index for Incremental Extreme Multi-label Classification YanjiangLiu TianyunZhong - YaojieLuInstitute of Software, Chinese Academy of Sciences + YaojieLuInstitute of Software, Chinese Academy of Sciences HongyuLinInstitute of Software, Chinese Academy of Sciences BenHe ShuhengZhouAnt Group - HuijiaZhu + HuijiaZhu WeiqiangWangAnt Group ZhongyiLiuAnt Group XianpeiHanInstitute of Software, CAS @@ -10741,11 +10741,11 @@ Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint ZhipengChen KunZhouRenmin University of China - XinZhaoRenmin University of China - JunchenWan + XinZhaoRenmin University of China + JunchenWan FuzhengZhang DiZhangKuaishou Technology - Ji-RongWenRenmin University of China + Ji-RongWenRenmin University of China 5694-5711 Reinforcement learning (RL) has been widely used in training large language models (LLMs) for preventing unexpected outputs, e.g., reducing harmfulness and errors. However, existing RL methods mainly adopt instance-level reward, which cannot provide fine-grained supervision for complex reasoning tasks. As a result, the RL training cannot be fully aware of the specific part or step that actually leads to the incorrectness in model response. To address it, we propose a new RL method named RLMEC that incorporates a generative model as the reward model, which is trained by the erroneous solution rewriting task under the minimum editing constraint, which can produce token-level supervision for RL training. Based 0on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process. And these two objectives focus on the revision of the key tokens for the erroneous solution, reducing the effect of other unimportant tokens. Experiment results on 8 tasks have demonstrated the effectiveness of our approach. Our code and data will be publicly released. 2024.findings-acl.338 @@ -10755,7 +10755,7 @@ Definition generation for lexical semantic change detection MariiaFedorova - AndreyKutuzovUniversity of Oslo + AndreyKutuzovUniversity of Oslo YvesScherrerUniversity of Oslo 5712-5724 We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as ‘senses’, and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling. @@ -10767,8 +10767,8 @@ <fixed-case>M</fixed-case>u<fixed-case>T</fixed-case>ox: Universal <fixed-case>MU</fixed-case>ltilingual Audio-based <fixed-case>TOX</fixed-case>icity Dataset and Zero-shot Detector MartaCosta-jussàMeta MarianoMeglioliMeta - PierreAndrews - DavidDaleFAIR at Meta + PierreAndrews + DavidDaleFAIR at Meta PrangthipHansanti ElaheKalbassi AlexandreMourachkoResearch, Facebook @@ -10782,8 +10782,8 @@ Phased Instruction Fine-Tuning for Large Language Models - WeiPang - ChuanZhouPeking University + WeiPang + ChuanZhouPeking University Xiao-HuaZhou XiaojieWangBeijing University of Post and Telecommunication 5735-5748 @@ -10802,11 +10802,11 @@ XinhaoChen TuHu YangChen - YupeiRen + YupeiRen YadongZhang YouqiSong BinxuanLiu - ManLan + ManLan 5749-5765 Topic relevance of an essay demands that the composition adheres to a clear theme and aligns well with the essay prompt requirements, a critical aspect of essay quality evaluation. However, existing research of Automatic Essay Scoring (AES) for Chinese essays has overlooked topic relevance and lacks detailed feedback, while Automatic Essay Comment Generation (AECG) faces much complexity and difficulty. Additionally, current Large Language Models, including GPT-4, often make incorrect judgments and provide overly impractical feedback when evaluating topic relevance. This paper introduces TOREE (Topic Relevance Evaluation), a comprehensive dataset developed to assess topic relevance in Chinese primary and middle school students’ essays, which is beneficial for AES, AECG and other applications. Moreover, our proposed two-step method utilizes TOREE through a combination of Supervised Fine-tuning and Preference Learning. Experimental results demonstrate that TOREE is of high quality, and our method significantly enhances models’ performance on two designed tasks for topic relevance evaluation, improving both automatic and human evaluations across four diverse LLMs. 2024.findings-acl.342 @@ -10815,13 +10815,13 @@ Predicting the Unpredictable: Uncertainty-Aware Reasoning over Temporal Knowledge Graphs via Diffusion Process - YuxiangCai - QiaoLiuUESTC - YangleiGan - ChanglinLiUniversity of Electronic Science and Technology of China - XueyiLiu - RunLin - DaLuo + YuxiangCai + QiaoLiuUESTC + YangleiGan + ChanglinLiUniversity of Electronic Science and Technology of China + XueyiLiu + RunLin + DaLuo JiayeYangJiayeYang 5766-5778 Temporal Knowledge Graph (TKG) reasoning seeks to predict future incomplete facts leveraging historical data. While existing approaches have shown effectiveness in addressing the task through various perspectives, such as graph learning and logic rules, they are limited in capturing the indeterminacy in future events, particularly in the case of rare/unseen facts. To tackle the highlighted issues, we introduce a novel approach by conceptualizing TKG reasoning as a sequence denoising process for future facts, namely DiffuTKG. Concretely, we first encodes the historical events as the conditional sequence. Then we gradually introduce Gaussian noise to corrupt target facts during the forward process and then employ a transformer-based conditional denoiser to restore them in the reverse phase. Moreover, we introduce an uncertainty regularization loss to mitigate the risk of prediction biases by favoring frequent scenarios over rare/unseen facts. Empirical results on four real-world datasets show that DiffuTKG outperforms state-of-the-art methods across multiple evaluation metrics. @@ -10845,12 +10845,12 @@ Controlled Text Generation for Large Language Model with Dynamic Attribute Graphs XunLiangRenmin University of China HanyuWang - ShichaoSong - MengtingHuNankai University + ShichaoSong + MengtingHuNankai University XunzhiWangNankai University - ZhiyuLi - FeiyuXiongInstitute for Advanced Algorithms Research, Shanghai - BoTang + ZhiyuLi + FeiyuXiongInstitute for Advanced Algorithms Research, Shanghai + BoTang 5797-5814 Controlled Text Generation (CTG) aims to produce texts that exhibit specific desired attributes. In this study, we introduce a pluggable CTG framework for Large Language Models (LLMs) named Dynamic Attribute Graphs-based controlled text generation (DATG). This framework utilizes an attribute scorer to evaluate the attributes of sentences generated by LLMs and constructs dynamic attribute graphs. DATG modulates the occurrence of key attribute words and key anti-attribute words, achieving effective attribute control without compromising the original capabilities of the model. We conduct experiments across four datasets in two tasks: toxicity mitigation and sentiment transformation, employing five LLMs as foundational models. Our findings highlight a remarkable enhancement in control accuracy, achieving a peak improvement of 19.29% over baseline methods in the most favorable task across four datasets. Additionally, we observe a significant decrease in perplexity, markedly improving text fluency. 2024.findings-acl.345 @@ -10859,10 +10859,10 @@ Coconut: Contextualized Commonsense Unified Transformers for Graph-Based Commonsense Augmentation of Language Models - Jun-HyungPark + Jun-HyungPark MingyuLeeKorea University JunhoKimKorea University - SangKeunLeeKorea University + SangKeunLeeKorea University 5815-5830 In this paper, we introduce COCONUT to effectively guide the contextualization of structured commonsense knowledge based on largelanguage models. COCONUT employs a contextualized knowledge prompting scheme to gather high-quality contextualization examplesfrom a large language model. These examples are subsequently distilled into small language models to enhance their contextualization capability. Extensive evaluations show that COCONUT considerably improves commonsense reasoning performance across diverse benchmarks, models, and settings, exhibiting its flexibility and universality in generating contextualized commonsense knowledge. Notably,COCONUT consistently outperforms the state-of-the-art technique by an average of 5.8%. 2024.findings-acl.346 @@ -10874,7 +10874,7 @@ DanielTamayo AitorGonzalez-Agirre JavierHernando - MartaVillegas + MartaVillegas 5831-5847 Recent research has explored methods for updating and modifying factual knowledge in large language models, often focusing on specific multi-layer perceptron blocks. This study expands on this work by examining the effectiveness of existing knowledge editing methods across languages and delving into the role of attention mechanisms in this process. Drawing from the insights gained, we propose Mass-Editing Memory with Attention in Transformers (MEMAT), a method that achieves significant improvements in all metrics while requiring minimal parameter modifications. MEMAT delivers a remarkable 10% increase in magnitude metrics, benefits languages not included in the training data and also demonstrates a high degree of portability. Our code and data are at https://github.com/dtamayo-nlp/MEMAT. 2024.findings-acl.347 @@ -10883,12 +10883,12 @@ <fixed-case>B</fixed-case>io<fixed-case>M</fixed-case>istral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains - YanisLabrak - AdrienBazogeNantes Université + YanisLabrak + AdrienBazogeNantes Université EmmanuelMorin - Pierre-AntoineGourraudUniversité de Nantes + Pierre-AntoineGourraudUniversité de Nantes MickaelRouvierUniversité d’Avignon - RichardDufourNantes University + RichardDufourNantes University 5848-5864 Large Language Models (LLMs) have demonstrated remarkable versatility in recent years, offering potential applications across specialized domains such as healthcare and medicine. Despite the availability of various open-source LLMs tailored for health contexts, adapting general-purpose LLMs to the medical domain presents significant challenges.In this paper, we introduce BioMistral, an open-source LLM tailored for the biomedical domain, utilizing Mistral as its foundation model and further pre-trained on PubMed Central. We conduct a comprehensive evaluation of BioMistral on a benchmark comprising 10 established medical question-answering (QA) tasks in English. We also explore lightweight models obtained through quantization and model merging approaches. Our results demonstrate BioMistral’s superior performance compared to existing open-source medical models and its competitive edge against proprietary counterparts. Finally, to address the limited availability of data beyond English and to assess the multilingual generalization of medical LLMs, we automatically translated and evaluated this benchmark into 7 other languages. This marks the first large-scale multilingual evaluation of LLMs in the medical domain. Datasets, multilingual evaluation benchmarks, scripts, and all the models obtained during our experiments are freely released. 2024.findings-acl.348 @@ -10901,9 +10901,9 @@ ZhaopengTuTencent AI Lab ChangChen YouliangYuanThe Chinese University of Hong Kong-Shenzhen - Jen-tseHuang + Jen-tseHuang WenxiangJiaoTencent AI Lab - MichaelLyuThe Chinese University of Hong Kong + MichaelLyuThe Chinese University of Hong Kong 5865-5877 Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose a simple and effective prompting method to improve the multilingual safety of ChatGPT by enhancing cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses by 42% for non-English queries. We will release all the data and results to facilitate future research on LLMs’ safety. 2024.findings-acl.349 @@ -10915,7 +10915,7 @@ YuanZhang WanhongHuang YiFengNanjing University - ChuanyiLinanjing university + ChuanyiLinanjing university ZhiweiFeiFudan University, Harbin Institute of Technology, Dalian University of Technology, Shanghai Jiaotong University, Shandong University, Peking University, Zhejiang University, University of Science and Technology of China, Hunan University, Beijing Institute of Technology, University of the Chinese Academy of Sciences, Southeast University, Sichuan University, Monash University, Malaysia Campus, Tianjin University, Beijing University of Aeronautics and Astronautics, Wuhan University of Technology, Yale University, Technische Universität München, Wuhan University, nanjing university, Tsinghua University and Wuhan University JidongGeNanjing University BinLuonanjing university @@ -10930,7 +10930,7 @@ <fixed-case>CMDL</fixed-case>: A Large-Scale <fixed-case>C</fixed-case>hinese Multi-Defendant Legal Judgment Prediction Dataset WanhongHuang YiFengNanjing University - ChuanyiLinanjing university + ChuanyiLinanjing university HonghanWu JidongGeNanjing University VincentNgUniversity of Texas at Dallas @@ -10952,18 +10952,18 @@ <fixed-case>A</fixed-case>bstract <fixed-case>M</fixed-case>eaning <fixed-case>R</fixed-case>epresentation-Based Logic-Driven Data Augmentation for Logical Reasoning - QimingBao + QimingBao Alex YuxuanPeng ZhenyunDeng WanjunZhong - GaëlGendron + GaëlGendron TimothyPistotti NeşetTan NathanYoung YangChen YonghuaZhu PaulDenny - MichaelWitbrock + MichaelWitbrock JiamouLiu 5914-5934 Combining large language models with logical reasoning enhances their capacity to address problems in a robust and reliable manner. Nevertheless, the intricate nature of logical reasoning poses challenges when gathering reliable data from the web to build comprehensive training datasets, subsequently affecting performance on downstream tasks. To address this, we introduce a novel logic-driven data augmentation approach, AMR-LDA. AMR-LDA converts the original text into an Abstract Meaning Representation (AMR) graph, a structured semantic representation that encapsulates the logical structure of the sentence, upon which operations are performed to generate logically modified AMR graphs. The modified AMR graphs are subsequently converted back into text to create augmented data. Notably, our methodology is architecture-agnostic and enhances both generative large language models, such as GPT-3.5 and GPT-4, through prompt augmentation, and discriminative large language models through contrastive learning with logic-driven data augmentation. Empirical evidence underscores the efficacy of our proposed method with improvement in performance across seven downstream tasks, such as reading comprehension requiring logical reasoning, textual entailment, and natural language inference. Furthermore, our method leads on the ReClor leaderboard at https://eval.ai/web/challenges/challenge-page/503/leaderboard/1347. The source code and data are publicly available at https://github.com/Strong-AI-Lab/Logical-Equivalence-driven-AMR-Data-Augmentation-for-Representation-Learning. @@ -10985,7 +10985,7 @@ <fixed-case>V</fixed-case>i<fixed-case>H</fixed-case>ate<fixed-case>T</fixed-case>5: Enhancing Hate Speech Detection in <fixed-case>V</fixed-case>ietnamese With a Unified Text-to-Text Transformer Model - LuanThanh NguyenUniversity of Information Technology, Vietnam National University Ho Chi Minh City + LuanThanh NguyenUniversity of Information Technology, Vietnam National University Ho Chi Minh City 5948-5961 Recent advancements in hate speech detection (HSD) in Vietnamese have made significant progress, primarily attributed to the emergence of transformer-based pre-trained language models, particularly those built on the BERT architecture. However, the necessity for specialized fine-tuned models has resulted in the complexity and fragmentation of developing a multitasking HSD system. Moreover, most current methodologies focus on fine-tuning general pre-trained models, primarily trained on formal textual datasets like Wikipedia, which may not accurately capture human behavior on online platforms. In this research, we introduce ViHateT5, a T5-based model pre-trained on our proposed large-scale domain-specific dataset named VOZ-HSD. By harnessing the power of a text-to-text architecture, ViHateT5 can tackle multiple tasks using a unified model and achieve state-of-the-art performance across all standard HSD benchmarks in Vietnamese. Our experiments also underscore the significance of label distribution in pre-training data on model efficacy. We provide our experimental materials for research purposes, including the VOZ-HSD dataset, pre-trained checkpoint, the unified HSD-multitask ViHateT5 model, and related source code on GitHub publicly. 2024.findings-acl.355 @@ -11009,7 +11009,7 @@ HanxingDing YuexiangXieAlibaba Group QiCaoInstitute of Computing Technology, Chinese Academy of Sciences, China - FeiSunInstitute of Computing Technology, Chinese Academy of Sciences + FeiSunInstitute of Computing Technology, Chinese Academy of Sciences JinyangGao HuaweiShenInstitute of Computing Technology, Chinese Academy of Sciences BolinDingAlibaba Group @@ -11021,8 +11021,8 @@ Zero-shot Cross-lingual Alignment for Embedding Initialization - XiAi - ZhiyongHuangNUS School of Computing + XiAi + ZhiyongHuangNUS School of Computing 5997-6007 For multilingual training, we present CrossInit, an initialization method that initializes embeddings into similar geometrical structures across languages in an unsupervised manner. CrossInit leverages a common cognitive linguistic mechanism, Zipf’s law, which indicates that similar concepts across languages have similar word ranks or frequencies in their monolingual corpora. Instead of considering point-to-point alignments based on ranks, CrossInit considers the same span of consecutive ranks in each language as the Positive pairs for alignment, while others out of the span are used as Negative pairs. CrossInit then employs Contrastive Learning to iteratively refine randomly initialized embeddings for similar geometrical structures across languages. Our experiments on Unsupervised NMT, XNLI, and MLQA showed significant gains in low-resource and dissimilar languages after applying CrossInit. 2024.findings-acl.358 @@ -11045,10 +11045,10 @@ It takes two to borrow: a donor and a recipient. Who’s who? LiviuDinuUniversity of Bucharest AnaUbanUniversitatea Bucuresti - AncaDinu + AncaDinu Ioan-BogdanIordache SimonaGeorgescuUniversity of Bucharest - LaurentiuZoicasUniversity of Bucharest + LaurentiuZoicasUniversity of Bucharest 6023-6035 We address the open problem of automatically identifying the direction of lexical borrowing, given word pairs in the donor and recipient languages. We propose strong benchmarks for this task, by applying a set of machine learning models. We extract and publicly release a comprehensive borrowings dataset from the recent RoBoCoP cognates and borrowings database for five Romance languages. We experiment on this dataset with both graphic and phonetic representations and with different features, models and architectures. We interpret the results, in terms of F1 score, commenting on the influence of features and model choice, of the imbalanced data and of the inherent difficulty of the task for particular language pairs. We show that automatically determining the direction of borrowing is a feasible task, and propose additional directions for future work. 2024.findings-acl.360 @@ -11058,7 +11058,7 @@ Advancing Post-<fixed-case>OCR</fixed-case> Correction: A Comparative Study of Synthetic Data ShuhaoGuan - DerekGreeneUniversity College Dublin + DerekGreeneUniversity College Dublin 6036-6047 This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages. 2024.findings-acl.361 @@ -11067,11 +11067,11 @@ <fixed-case>G</fixed-case>eo<fixed-case>A</fixed-case>gent: To Empower <fixed-case>LLM</fixed-case>s using Geospatial Tools for Address Standardization - ChenghuaHuangFudan University + ChenghuaHuangFudan University ShisongChen ZhixuLi JianfengQuSoochow University - YanghuaXiaoFudan University + YanghuaXiaoFudan University JiaxinLiu ZhigangCheniFLYTEK Research 6048-6063 @@ -11083,7 +11083,7 @@ <fixed-case>HQP</fixed-case>: A Human-Annotated Dataset for Detecting Online Propaganda AbdurahmanMaarouf - DominikBärLudwig-Maximilians-Universität München + DominikBärLudwig-Maximilians-Universität München DominiqueGeisslerLudwig-Maximilians-Universität München StefanFeuerriegelLMU Munich 6064-6089 @@ -11107,7 +11107,7 @@ Exploring Spatial Schema Intuitions in Large Language and Vision Models - PhilippWickeLudwig-Maximilians-Universität München + PhilippWickeLudwig-Maximilians-Universität München LennartWachowiakKing’s College London, University of London 6102-6117 Despite the ubiquity of large language models (LLMs) in AI research, the question of embodiment in LLMs remains underexplored, distinguishing them from embodied systems in robotics where sensory perception directly informs physical action.Our investigation navigates the intriguing terrain of whether LLMs, despite their non-embodied nature, effectively capture implicit human intuitions about fundamental, spatial building blocks of language. We employ insights from spatial cognitive foundations developed through early sensorimotor experiences, guiding our exploration through the reproduction of three psycholinguistic experiments. Surprisingly, correlations between model outputs and human responses emerge, revealing adaptability without a tangible connection to embodied experiences. Notable distinctions include polarized language model responses and reduced correlations in vision language models. This research contributes to a nuanced understanding of the interplay between language, spatial experiences, and the computations made by large language models.Project Website: https://cisnlp.github.io/Spatial_Schemas/ @@ -11120,7 +11120,7 @@ YiboMiao HongchengGao HaoZhangUniversity of California, San Diego, Petuum, Inc and Carnegie Mellon University - ZhijieDengShanghai Jiaotong University + ZhijieDengShanghai Jiaotong University 6118-6130 The detection of machine-generated text, especially from large language models (LLMs), is crucial in preventing serious social problems resulting from their misuse. Some methods train dedicated detectors on specific datasets but fall short in generalizing to unseen test data, while other zero-shot ones often yield suboptimal performance. Although the recent DetectGPT has shown promising detection performance, it suffers from significant inefficiency issues, as detecting a single candidate requires querying the source LLM with hundreds of its perturbations. This paper aims to bridge this gap. Concretely, we propose to incorporate a Bayesian surrogate model, which allows us to select typical samples based on Bayesian uncertainty and interpolate scores from typical samples to other samples, to improve query efficiency. Empirical results demonstrate that our method significantly outperforms existing approaches under a low query budget. Notably, when detecting the text generated by LLaMA family models, our method with just 2 or 3 queries can outperform DetectGPT with 200 queries. 2024.findings-acl.366 @@ -11129,7 +11129,7 @@ Decoding the Narratives: Analyzing Personal Drug Experiences Shared on <fixed-case>R</fixed-case>eddit - LaylaBouzoubaaDrexel University + LaylaBouzoubaaDrexel University ElhamAghakhani MaxSong QuangTrinh @@ -11145,7 +11145,7 @@ ShaoboCuiEPFL - EPF Lausanne YiyangFeng YisongMao - YifanHouDepartment of Computer Science, Swiss Federal Institute of Technology + YifanHouDepartment of Computer Science, Swiss Federal Institute of Technology BoiFaltings 6149-6174 Crafting an appealing heading is crucial for attracting readers and marketing work or products. A popular way is to summarize the main idea with a refined description and a memorable acronym. However, there lacks a systematic study and a formal benchmark including datasets and metrics. Motivated by this absence, we introduce LOgogram, a novel benchmark comprising 6,653 paper abstracts with corresponding descriptions and acronyms. To measure the quality of heading generation, we propose a set of evaluation metrics from three aspects: summarization, neology, and algorithm. Additionally, we explore three strategies for heading generation(generation ordering, tokenization of acronyms, and framework design) under various prevalent learning paradigms(supervised fine-tuning, in-context learning with Large Language Models(LLMs), and reinforcement learning) on our benchmark. Our experimental results indicate the difficulty in identifying a practice that excels across all summarization, neologistic, and algorithmic aspects. @@ -11156,9 +11156,9 @@ Understanding Fine-grained Distortions in Reports of Scientific Findings AmelieWuehrlUniversity of Stuttgart, Universität Stuttgart - DustinWrightUniversity of Copenhagen - RomanKlingerOtto-Friedrich Universität Bamberg - IsabelleAugensteinUniversity of Copenhagen + DustinWrightUniversity of Copenhagen + RomanKlingerOtto-Friedrich Universität Bamberg + IsabelleAugensteinUniversity of Copenhagen 6175-6191 Distorted science communication harms individuals and society as it can lead to unhealthy behavior change and decrease trust in scientific institutions. Given the rapidly increasing volume of science communication in recent years, a fine-grained understanding of how findings from scientific publications are reported to the general public, and methods to detect distortions from the original work automatically, are crucial. Prior work focused on individual aspects of distortions or worked with unpaired data. In this work, we make three foundational contributions towards addressing this problem: (1) annotating 1,600 instances of scientific findings from academic papers paired with corresponding findings as reported in news articles and tweets wrt. four characteristics: causality, certainty, generality and sensationalism; (2) establishing baselines for automatically detecting these characteristics; and (3) analyzing the prevalence of changes in these characteristics in both human-annotated and large-scale unlabeled data. Our results show that scientific findings frequently undergo subtle distortions when reported. Tweets distort findings more often than science news reports. Detecting fine-grained distortions automatically poses a challenging task. In our experiments, fine-tuned task-specific models consistently outperform few-shot LLM prompting. 2024.findings-acl.369 @@ -11167,11 +11167,11 @@ <fixed-case>MM</fixed-case>-<fixed-case>SOC</fixed-case>: Benchmarking Multimodal Large Language Models in Social Media Platforms - YiqiaoJin + YiqiaoJin MinjeChoiGeorgia Institute of Technology - GauravVermaGeorgia Institute of Technology + GauravVermaGeorgia Institute of Technology JindongWangMicrosoft Research - SrijanKumarGeorgia Institute of Technology + SrijanKumarGeorgia Institute of Technology 6192-6210 Social media platforms are hubs for multimodal information exchange, encompassing text, images, and videos, making it challenging for machines to comprehend the information or emotions associated with interactions in online spaces. Multimodal Large Language Models (MLLMs) have emerged as a promising solution to address these challenges, yet struggle with accurately interpreting human emotions and complex contents like misinformation. This paper introduces MM-Soc, a comprehensive benchmark designed to evaluate MLLMs’ understanding of multimodal social media content. MM-Soc compiles prominent multimodal datasets and incorporates a novel large-scale YouTube tagging dataset, targeting a range of tasks from misinformation detection, hate speech detection, and social context generation. Through our exhaustive evaluation on ten size-variants of four open-source MLLMs, we have identified significant performance disparities, highlighting the need for advancements in models’ social understanding capabilities. Our analysis reveals that, in a zero-shot setting, various types of MLLMs generally exhibit difficulties in handling social media tasks. However, MLLMs demonstrate performance improvements post fine-tuning, suggesting potential pathways for improvement. 2024.findings-acl.370 @@ -11182,7 +11182,7 @@ Instances Need More Care: Rewriting Prompts for Instances with <fixed-case>LLM</fixed-case>s in the Loop Yields Better Zero-Shot Performance SaurabhSrivastavaGeorge Mason University ChengyueHuang - WeiguoFanUniversity of Iowa + WeiguoFanUniversity of Iowa ZiyuYaoGeorge Mason University 6211-6232 Large language models (LLMs) have revolutionized zero-shot task performance, mitigating the need for task-specific annotations while enhancing task generalizability. Despite its advancements, current methods using trigger phrases such as “Let’s think step by step” remain limited. This study introduces PRomPTed, an approach that optimizes the zero-shot prompts for individual task instances following an innovative manner of “LLMs in the loop”.Our comprehensive evaluation across 13 datasets and 10 task types based on GPT-4 reveals that PRomPTed significantly outperforms both the naive zero-shot approaches and a strong baseline (i.e., “Output Refinement”) which refines the task output instead of the input prompt. Our experimental results also confirmed the generalization of this advantage to the relatively weaker GPT-3.5. Even more intriguingly, we found that leveraging GPT-3.5 to rewrite prompts for the stronger GPT-4 not only matches but occasionally exceeds the efficacy of using GPT-4 as the prompt rewriter. Our research thus presents a huge value in not only enhancing zero-shot LLM performance but also potentially enabling supervising LLMs with their weaker counterparts, a capability attracting much interest recently. Finally, our additional experiments confirm the generalization of the advantages to open-source LLMs such as Mistral 7B and Mixtral 8x7B. @@ -11192,8 +11192,8 @@ Benchmarking Retrieval-Augmented Generation for Medicine - GuangzhiXiong - QiaoJinNational Institutes of Health + GuangzhiXiong + QiaoJinNational Institutes of Health ZhiyongLuNational Institutes of Health AidongZhang 6233-6251 @@ -11207,9 +11207,9 @@ RuibinYuan HanfengLinBeijing Jiaotong University YiWang - ZeyueTianHong Kong University of Science and Technology + ZeyueTianHong Kong University of Science and Technology ShangdaWu - TianhaoShen + TianhaoShen GeZhang YuhangWu CongLiu @@ -11218,19 +11218,19 @@ ZiyangMa QinLiu TianyuZheng - YizhiLiUniversity of Manchester and University of Sheffield - YinghaoMaQueen Mary University of London + YizhiLiUniversity of Manchester and University of Sheffield + YinghaoMaQueen Mary University of London YimingLiang XiaoweiChi RuiboLiuGoogle DeepMind ZiliWang ChenghuaLinUniversity of Manchester - QifengLiuThe Hong Kong University of Science and Technology + QifengLiuThe Hong Kong University of Science and Technology TaoJiang WenhaoHuang WenhuChenUniversity of Waterloo and Google - JieFuHong Kong University of Science and Technology - EmmanouilBenetos + JieFuHong Kong University of Science and Technology + EmmanouilBenetos GusXiaNew York University RogerDannenbergCarnegie Mellon University WeiXueHong Kong University of Science and Technology @@ -11266,7 +11266,7 @@ Knowledge Graph-Enhanced Large Language Models via Path Selection HaochenLiu - SongWangUniversity of Virginia + SongWangUniversity of Virginia YaochenZhu YushunDong JundongLiUniversity of Virginia @@ -11278,12 +11278,12 @@ <fixed-case>OTTAWA</fixed-case>: Optimal <fixed-case>T</fixed-case>ranspor<fixed-case>T</fixed-case> Adaptive Word Aligner for Hallucination and Omission Translation Errors Detection - ChenyangHuang + ChenyangHuang AbbasGhaddarHuawei Technologies Ltd. IvanKobyzevHuawei Noah’s Ark Lab MehdiRezagholizadeh - OsmarZaianeUniversity of Alberta - BoxingChenHuawei Technologies Ltd. + OsmarZaianeUniversity of Alberta + BoxingChenHuawei Technologies Ltd. 6322-6334 Recently, there has been considerable attention on detecting hallucinations and omissions in Machine Translation (MT) systems. The two dominant approaches to tackle this task involve analyzing the MT system’s internal states or relying on the output of external tools, such as sentence similarity or MT quality estimators. In this work, we introduce OTTAWA, a novel Optimal Transport (OT)-based word aligner specifically designed to enhance the detection of hallucinations and omissions in MT systems. Our approach explicitly models the missing alignments by introducing a “null” vector, for which we propose a novel one-side constrained OT setting to allow an adaptive null alignment. Our approach yields competitive results compared to state-of-the-art methods across 18 language pairs on the HalOmi benchmark. In addition, it shows promising features, such as the ability to distinguish between both error types and perform word-level detection without accessing the MT system’s internal states. 2024.findings-acl.377 @@ -11294,7 +11294,7 @@ <fixed-case>ONSEP</fixed-case>: A Novel Online Neural-Symbolic Framework for Event Prediction Based on Large Language Model XuanqingYuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences WangtaoSun - JingweiLi + JingweiLi KangLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences ChengbaoLiuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences JieTan @@ -11320,7 +11320,7 @@ Too Big to Fail: Larger Language Models are Disproportionately Resilient to Induction of Dementia-Related Linguistic Anomalies - ChangyeLiUniversity of Washington + ChangyeLiUniversity of Washington ZhechengSheng TrevorCohenUniversity of Washington SergueiPakhomovUniversity of Minnesota - Twin Cities @@ -11345,7 +11345,7 @@ <fixed-case>TRAM</fixed-case>: Benchmarking Temporal Reasoning for Large Language Models YuqingWangStanford University - YunZhaoMeta Platforms, Inc + YunZhaoMeta Platforms, Inc 6389-6415 Reasoning about time is essential for understanding the nuances of events described in natural language. Previous research on this topic has been limited in scope, characterized by a lack of standardized benchmarks that would allow for consistent evaluations across different studies. In this paper, we introduce TRAM, a temporal reasoning benchmark composed of ten datasets, encompassing various temporal aspects of events such as order, arithmetic, frequency, and duration, designed to facilitate a comprehensive evaluation of the TeR capabilities of large language models (LLMs). We evaluate popular LLMs like GPT-4 and Llama2 in zero-shot and few-shot scenarios, and establish baselines with BERT-based and domain-specific models. Our findings indicate that the best-performing model lags significantly behind human performance. It is our aspiration that TRAM will spur further progress in enhancing the TeR capabilities of LLMs. 2024.findings-acl.382 @@ -11371,7 +11371,7 @@ LazarMilikicEPFL - EPF Lausanne YiyangFeng MeteIsmayilzadaEPFL - EPF Lausanne - DebjitPaulEPFL - EPF Lausanne + DebjitPaulEPFL - EPF Lausanne AntoineBosselutSwiss Federal Institute of Technology Lausanne BoiFaltings 6433-6452 @@ -11398,7 +11398,7 @@ YanzhengXiang HanqiYan LinGuiKing’s College London, University of London - YulanHeKing’s College London, University of London + YulanHeKing’s College London, University of London 6467-6481 In-context learning has become a popular paradigm in natural language processing. However, its performance can be significantly influenced by the order of in-context demonstration examples. In this paper, we found that causal language models (CausalLMs) are more sensitive to this order compared to prefix language models (PrefixLMs). We attribute this phenomenon to the auto-regressive attention masks within CausalLMs, which restrict each token from accessing information from subsequent tokens. This results in different receptive fields for samples at different positions, thereby leading to representation disparities across positions. To tackle this challenge, we introduce an unsupervised fine-tuning method, termed the Information-Augmented and Consistency-Enhanced approach. This approach utilizes contrastive learning to align representations of in-context examples across different positions and introduces a consistency loss to ensure similar representations for inputs with different permutations. This enhances the model’s predictive consistency across permutations. Experimental results on five benchmarks suggest that our proposed method can reduce the sensitivity of CausalLMs to the order of in-context examples and exhibit robust generalizability, particularly when demonstrations are sourced from a candidate pool different from that used in the training phase, or when the number of in-context examples differs from what is used during training. 2024.findings-acl.386 @@ -11408,8 +11408,8 @@ Perspective Taking through Generating Responses to Conflict Situations JoanPlepiRheinische Friedrich-Wilhelms Universität Bonn - CharlesWelchMcMaster University - LucieFlekRheinische Friedrich-Wilhelms Universität Bonn + CharlesWelchMcMaster University + LucieFlekRheinische Friedrich-Wilhelms Universität Bonn 6482-6497 Although language model performance across diverse tasks continues to improve, these models still struggle to understand and explain the beliefs of other people. This skill requires perspective-taking, the process of conceptualizing the point of view of another person. Perspective taking becomes challenging when the text reflects more personal and potentially more controversial beliefs.We explore this task through natural language generation of responses to conflict situations. We evaluate novel modifications to recent architectures for conditioning generation on an individual’s comments and self-disclosure statements. Our work extends the Social-Chem-101 corpus, using 95k judgements written by 6k authors from English Reddit data, for each of whom we obtained 20-500 self-disclosure statements. Our evaluation methodology borrows ideas from both personalized generation and theory of mind literature. Our proposed perspective-taking models outperform recent work, especially the twin encoder model conditioned on self-disclosures with high similarity to the conflict situation. 2024.findings-acl.387 @@ -11425,7 +11425,7 @@ ShengShenUniversity of California Berkeley GopalaAnumanchipalliUniversity of California, Berkeley MichaelMahoneyUniversity of California Berkeley - KurtKeutzerUniversity of California Berkeley + KurtKeutzerUniversity of California Berkeley AmirGholamiUniversity of California Berkeley 6498-6526 Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a Llama-2-7B student model. Our code is available at https://github.com/SqueezeAILab/LLM2LLM. @@ -11441,7 +11441,7 @@ SharonAdarAmazon MohitBansalUniversity of North Carolina at Chapel Hill JacobGoldbergerBar-Ilan University - RanLevyAmazon + RanLevyAmazon IdoDaganBar-Ilan University 6527-6548 Multi-document summarization (MDS) is a challenging task, often decomposed to subtasks of salience and redundancy detection, followed by text generation.In this context, alignment of corresponding sentences between a reference summary and its source documents has been leveraged to generate training data for some of the component tasks. Yet, this enabling alignment step has usually been applied heuristically on the sentence level on a limited number of subtasks.In this paper, we propose extending the summary-source alignment framework by (1) applying it at the more fine-grained proposition span level, (2) annotating alignment manually in a multi-document setup, and (3) revealing the great potential of summary-source alignments to yield several datasets for at least six different tasks. Specifically, for each of the tasks, we release a manually annotated test set that was derived automatically from the alignment annotation. We also release development and train sets in the same way, but from automatically derived alignments.Using the datasets, each task is demonstrated with baseline models and corresponding evaluation metrics to spur future research on this broad challenge. @@ -11484,7 +11484,7 @@ Text Simplification via Adaptive Teaching Seyed AliBahrainian JonathanDou - CarstenEickhoffEberhard-Karls-Universität Tübingen + CarstenEickhoffEberhard-Karls-Universität Tübingen 6574-6584 Text simplification is the process of rewriting a piece of text using simpler vocabulary and grammatical structure in order to make the text more accessible and understandable for a larger audience. In this paper, we introduce a new text simplification model based on the notion of adaptive teaching using a teacher network and a text generation network. We name this new model Simplification via Adaptive Teaching (SAT). Our proposed model sets a new state-of-the-art performance in terms of standard simplification metrics such as SARI and D-SARI with a significant improvement over the previous state of the art on the D-Wikipedia dataset and the Wiki-Doc benchmark dataset. Moreover, we conduct a human evaluation in terms of text simplicity, correctness, and fluency to substantiate SAT’s performance. 2024.findings-acl.392 @@ -11495,8 +11495,8 @@ A multi-level multi-label text classification dataset of 19th century Ottoman and <fixed-case>R</fixed-case>ussian literary and critical texts GokcenGokceogluMETU DevrimÇavuşoğlu - EmreAkbasMiddle East Technical University - ÖzenDolceroccaUniversity of Bologna + EmreAkbasMiddle East Technical University + ÖzenDolceroccaUniversity of Bologna 6585-6596 This paper introduces a multi-level, multi-label text classification dataset comprising over 3000 documents. The dataset features literary and critical texts from 19th-century Ottoman Turkish and Russian. It is the first study to apply large language models (LLMs) to this dataset, sourced from prominent literary periodicals of the era. The texts have been meticulously organized and labeled. This was done according to a taxonomic framework that takes into account both their structural and semantic attributes. Articles are categorized and tagged with bibliometric metadata by human experts. We present baseline classification results using a classical bag-of-words (BoW) naive Bayes model and three modern LLMs: multilingual BERT, Falcon, and Llama-v2. We found that in certain cases, Bag of Words (BoW) outperforms Large Language Models (LLMs), emphasizing the need for additional research, especially in low-resource language settings. This dataset is expected to be a valuable resource for researchers in natural language processing and machine learning, especially for historical and low-resource languages. The dataset is publicly available. 2024.findings-acl.393 @@ -11505,7 +11505,7 @@ It is Simple Sometimes: A Study On Improving Aspect-Based Sentiment Analysis Performance - LauraCabelloCopenhagen University + LauraCabelloCopenhagen University UchennaAkujuobiSony Research 6597-6610 Aspect-Based Sentiment Analysis (ABSA) involves extracting opinions from textual data about specific entities and their corresponding aspects through various complementary subtasks. Several prior research has focused on developing ad hoc designs of varying complexities for these subtasks. In this paper, we build upon the instruction tuned model proposed by Scaria et al. (2023), who present an instruction-based model with task descriptions followed by in-context examples on ABSA subtasks. We propose PFInstruct, an extension to this instruction learning paradigm by appending an NLP-related task prefix to the task description. This simple approach leads to improved performance across all tested SemEval subtasks, surpassing previous state-of-the-art (SOTA) on the ATE subtask (Rest14) by +3.28 F1-score, and on the AOOE subtask by an average of +5.43 F1-score across SemEval datasets. Furthermore, we explore the impact of the prefix-enhanced prompt quality on the ABSA subtasks and find that even a noisy prefix enhances model performance compared to the baseline. Our method also achieves competitive results on a biomedical domain dataset (ERSA). @@ -11515,10 +11515,10 @@ Whose Emotions and Moral Sentiments do Language Models Reflect? - ZihaoHe - SiyiGuo + ZihaoHe + SiyiGuo AshwinRao - KristinaLermanUniversity of Southern California and USC Information Sciences Institute + KristinaLermanUniversity of Southern California and USC Information Sciences Institute 6611-6631 Language models (LMs) are known to represent the perspectives of some social groups better than others, which may impact their performance, especially on subjective tasks such as content moderation and hate speech detection. To explore how LMs represent different perspectives, existing research focused on positional alignment, i.e., how closely the models mimic the opinions and stances of different groups, e.g., liberals or conservatives. However, human communication also encompasses emotional and moral dimensions. We define the problem of affective alignment, which measures how LMs’ emotional and moral tone represents those of different groups. By comparing the affect of responses generated by 36 LMs to the affect of Twitter messages written by two ideological groups, we observe significant misalignment of LMs with both ideological groups. This misalignment is larger than the partisan divide in the U.S. Even after steering the LMs towards specific ideological perspectives, the misalignment and liberal tendencies of the model persist, suggesting a systemic bias within LMs. 2024.findings-acl.395 @@ -11533,8 +11533,8 @@ JinlanFu QinyuanCheng JiashengYe - JunjieYe - XipengQiuFudan University + JunjieYe + XipengQiuFudan University XuanjingHuangFudan University 6632-6646 In the realm of Large Language Models (LLMs), users commonly employ diverse decoding strategies and adjust hyperparameters to control the generated text. However, a critical question emerges: Are LLMs conscious of the existence of these decoding strategies and capable of regulating themselves? The current decoding generation process often relies on empirical and heuristic manual adjustments to hyperparameters based on types of tasks and demands. However, this process is typically cumbersome, and the decoding hyperparameters may not always be optimal for each sample. To address the aforementioned challenges, we propose a novel text generation paradigm termed Hyperparameter Aware Generation (HAG). By leveraging hyperparameter-aware instruction tuning, the LLM autonomously determines the optimal decoding strategy and configs based on the input samples, enabling self-regulation. Our approach eliminates the need for extensive manual tuning, offering a more autonomous, self-regulate model behavior. Experimental results spanning six datasets across reasoning, creativity, translation, and mathematics tasks demonstrate that hyperparameter-aware instruction tuning empowers the LLMs to self-regulate the decoding strategy and hyperparameter. HAG extends the current paradigm in the text generation process, highlighting the feasibility of endowing the LLMs with self-regulate decoding strategies. @@ -11560,7 +11560,7 @@ Towards Uncertainty-Aware Language Agent JiuzhouHan - WrayBuntineVinUniversity + WrayBuntineVinUniversity EhsanShareghiMonash University and University of Cambridge 6662-6685 While Language Agents have achieved promising success by placing Large Language Models at the core of a more versatile design that dynamically interacts with the external world, the existing approaches neglect the notion of uncertainty during these interactions. We present the Uncertainty-Aware Language Agent (UALA), a framework that orchestrates the interaction between the agent and the external world using uncertainty quantification. Compared with other well-known counterparts like ReAct, our extensive experiments across 3 representative tasks (HotpotQA, StrategyQA, MMLU) and various LLM sizes demonstrate that UALA brings a significant improvement of performance, while having a substantially lower reliance on the external world (i.e., reduced number of tool calls and tokens). Our analyses provide various insights including the great potential of UALA compared with agent fine-tuning, and underscore the unreliability of verbalised confidence of LLMs as a proxy for uncertainty. @@ -11570,7 +11570,7 @@ Detection and Positive Reconstruction of Cognitive Distortion Sentences: <fixed-case>M</fixed-case>andarin Dataset and Evaluation - ShuyaLin + ShuyaLin YuxiongWang JonathanDong ShiguangNiTsinghua University, Tsinghua University @@ -11583,8 +11583,8 @@ <fixed-case>P</fixed-case>i<fixed-case>V</fixed-case>e: Prompting with Iterative Verification Improving Graph-based Generative Capability of <fixed-case>LLM</fixed-case>s JiuzhouHan - NigelCollierUniversity of Cambridge - WrayBuntineVinUniversity + NigelCollierUniversity of Cambridge + WrayBuntineVinUniversity EhsanShareghiMonash University and University of Cambridge 6702-6718 Large language models (LLMs) have shown great abilities of solving various natural language tasks in different domains. Due to the training objective of LLMs and their pre-training data, LLMs are not very well equipped for tasks involving structured data generation. We propose a framework, Prompting with Iterative Verification (PiVe), to improve graph-based generative capability of LLMs. We show how a small language model could be trained to act as a verifier module for the output of an LLM(i.e., ChatGPT, GPT-4), and to iteratively improve its performance via fine-grained corrective instructions. We also show how the verifier module could apply iterative corrections offline for a more cost-effective solution to the text-to-graph generation task. Experiments on three graph-based datasets show consistent improvement gained via PiVe. Additionally, we create GenWiki-HIQ and highlight that the verifier module can be used as a data augmentation tool to help improve the quality of automatically generated parallel text-graph datasets. @@ -11594,12 +11594,12 @@ Two-stage Generative Question Answering on Temporal Knowledge Graph Using Large Language Models - YifuGaoNational University of Defense Technology + YifuGaoNational University of Defense Technology LinboQiao ZhigangKanNational University of Defense Technology - ZhihuaWenNational University of Defence Technology - YongquanHe - DongshengLi + ZhihuaWenNational University of Defence Technology + YongquanHe + DongshengLi 6719-6734 Temporal knowledge graph question answering (TKGQA) poses a significant challenge task, due to the temporal constraints hidden in questions and the answers sought from dynamic structured knowledge. Although large language models (LLMs) have made considerable progress in their reasoning ability over structured data, their application to the TKGQA task is a relatively unexplored area. This paper first proposes a novel generative temporal knowledge graph question answering framework, GenTKGQA, which guides LLMs to answer temporal questions through two phases: Subgraph Retrieval and Answer Generation. First, we exploit LLM’s intrinsic knowledge to mine temporal constraints and structural links in the questions without extra training, thus narrowing down the subgraph search space in both temporal and structural dimensions. Next, we design virtual knowledge indicators to fuse the graph neural network signals of the subgraph and the text representations of the LLM in a non-shallow way, which helps the open-source LLM deeply understand the temporal order and structural dependencies among the retrieved facts through instruction tuning. Experimental results on two widely used datasets demonstrate the superiority of our model. 2024.findings-acl.401 @@ -11611,7 +11611,7 @@ Syeda NahidaAkter SangwuLeeUniversity of Rochester YingshanChang - YonatanBiskMeta and Carnegie Mellon University + YonatanBiskMeta and Carnegie Mellon University EricNybergCarnegie Mellon University 6735-6752 Verifying a question’s validity before answering is crucial in real-world applications, where users may provide imperfect instructions. In this scenario, an ideal model should address the discrepancies in the query and convey them to the users rather than generating the best possible answer. Addressing this requirement, we introduce a new compositional visual question-answering dataset, VisReas, that consists of answerable and unanswerable visual queries formulated by traversing and perturbing commonalities and differences among objects, attributes, and relations. VisReas contains 2.07M semantically diverse queries generated automatically using Visual Genome scene graphs. The unique feature of this task, validating question answerability with respect to an image before answering, and the poor performance of state-of-the-art models inspired the design of a new modular baseline, Logic2Vision that reasons by producing and executing pseudocode without any external modules to generate the answer. Logic2Vision outperforms generative models in VisReas (+4.82% over LLaVA-1.5; +12.23% over InstructBLIP) and achieves a significant gain in performance against the classification models. @@ -11622,7 +11622,7 @@ A Unified Generative Framework for Bilingual Euphemism Detection and Identification YuxueHu - JunsongLi + JunsongLi TongguanWang DongyuSu GuixinSu @@ -11638,7 +11638,7 @@ GaoxiangCong YuankaiQiMacquarie University LiangLi - AminBeheshtiMacquarie University + AminBeheshtiMacquarie University ZhedongZhang AntonHengelUniversity of Adelaide Ming-HsuanYangGoogle and University of California at Merced @@ -11652,8 +11652,8 @@ <fixed-case>ETAS</fixed-case>: Zero-Shot Transformer Architecture Search via Network Trainability and Expressivity - JiechaoYangRenmin University of China - YongLiuRenmin University of China and Institute of information engineering, CAS + JiechaoYangRenmin University of China + YongLiuRenmin University of China and Institute of information engineering, CAS 6780-6795 Transformer Architecture Search (TAS) methods aim to automate searching for the optimal Transformer architecture configurations for a given task. However, they are impeded by the prohibitive cost of evaluating Transformer architectures. Recently, several Zero-Shot TAS methods have been proposed to mitigate this problem by utilizing zero-cost proxies to evaluate Transformer architectures without training. Unfortunately, they are limited to specific computer vision or natural language processing tasks. Nonetheless, most of them are developed based on empirical observations and lack theoretical guarantees. To solve this problem, we develop a new zero-cost proxy called NTSR that combines two theoretically-inspired indicators to measure the trainability and expressivity of Transformer networks separately. We then integrate it into an effective regularized evolution framework called ETAS to demonstrate its efficacy on various tasks. The results show that our proposed NTSR proxy can consistently achieve a higher correlation with the true performance of Transformer networks on both computer vision and natural language processing tasks. Further, it can significantly accelerate the search process for finding the best-performing Transformer architecture configurations. 2024.findings-acl.405 @@ -11665,8 +11665,8 @@ KaishuaiXuHong Kong Polytechnic University YiChengThe Hong Kong Polytechnic University WenjunHou - QiaoyuTanNew York University Shanghai - WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University + QiaoyuTanNew York University Shanghai + WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University 6796-6814 Medical dialogue systems have attracted significant attention for their potential to act as medical assistants. Enabling these medical systems to emulate clinicians’ diagnostic reasoning process has been the long-standing research focus. Previous studies rudimentarily realized the simulation of clinicians’ diagnostic process by fine-tuning language models on high-quality dialogue datasets. Nonetheless, they overly focus on the outcomes of the clinician’s reasoning process while ignoring their internal thought processes and alignment with clinician preferences. Our work aims to build a medical dialogue system that aligns with clinicians’ diagnostic reasoning processes. We propose a novel framework, Emulation, designed to generate an appropriate response that relies on abductive and deductive diagnostic reasoning analyses and aligns with clinician preferences through thought process modeling. Experimental results on two datasets confirm the efficacy of Emulation. Crucially, our framework furnishes clear explanations for the generated responses, enhancing its transparency in medical consultations. 2024.findings-acl.406 @@ -11676,18 +11676,18 @@ <fixed-case>C</fixed-case>oncept<fixed-case>M</fixed-case>ath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models YananWu - JieLiuThe Chinese University of Hong Kong - XingyuanBuAlibaba Group + JieLiuThe Chinese University of Hong Kong + XingyuanBuAlibaba Group JiahengLiu ZhanhuiZhouShanghai Artificial Intelligence Laboratory - YuanxingZhang + YuanxingZhang ChenchenZhangBeijing University of Posts and Telecommunications ZhiqiBaiZhiqiBai HaibinChen - TiezhengGeAlibaba Group - WanliOuyangShanghai AI Lab + TiezhengGeAlibaba Group + WanliOuyangShanghai AI Lab WenboSu - BoZhengAlibaba Group + BoZhengAlibaba Group 6815-6839 This paper introduces ConceptMath, a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systemically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with concept-wise accuracies. Based on our ConcepthMath, we then evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones. Besides, we also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs. Finally, we hope ConceptMath could guide the developers to understand the fine-grained mathematical abilities of their models and facilitate the growth of foundation models. Code is available at https://github.com/conceptmath/conceptmath. 2024.findings-acl.407 @@ -11698,7 +11698,7 @@ <fixed-case>REI</fixed-case>nstruct: Building Instruction Data from Unlabeled Corpus ShuChen XinyanGuan - YaojieLuInstitute of Software, Chinese Academy of Sciences + YaojieLuInstitute of Software, Chinese Academy of Sciences HongyuLinInstitute of Software, Chinese Academy of Sciences XianpeiHanInstitute of Software, CAS LeSunInstitute of Software, Chinese Academy of Sciences @@ -11710,10 +11710,10 @@ Learning to Maximize Mutual Information for Chain-of-Thought Distillation - XinChenIntel Corp + XinChenIntel Corp HanxianHuang - YanjunGaoUniversity of Colorado Anschutz Medical Campus - YiWang + YanjunGaoUniversity of Colorado Anschutz Medical Campus + YiWang JishenZhaoUniversity of California, San Diego KeDingIntel 6857-6868 @@ -11727,7 +11727,7 @@ ZhishengLin HanFu ChenghaoLiuSalesForce.com - ZhuoLiZhejiang University + ZhuoLiZhejiang University JianlingSun 6869-6883 Parameter-efficient fine-tuning (PEFT) has emerged as an effective method for adapting pre-trained language models to various tasks efficiently. Recently, there has been a growing interest in transferring knowledge from one or multiple tasks to the downstream target task to achieve performance improvements. However, current approaches typically either train adapters on individual tasks or distill shared knowledge from source tasks, failing to fully exploit task-specific knowledge and the correlation between source and target tasks. To overcome these limitations, we propose PEMT, a novel parameter-efficient fine-tuning framework based on multi-task transfer learning. PEMT extends the mixture-of-experts (MoE) framework to capture the transferable knowledge as a weighted combination of adapters trained on source tasks. These weights are determined by a gated unit, measuring the correlation between the target and each source task using task description prompt vectors. To fully exploit the task-specific knowledge, we also propose the Task Sparsity Loss to improve the sparsity of the gated unit. We conduct experiments on a broad range of tasks over 17 datasets. The experimental results demonstrate our PEMT yields stable improvements over full fine-tuning, and state-of-the-art PEFT and knowledge transferring methods on various tasks. The results highlight the effectiveness of our method which is capable of sufficiently exploiting the knowledge and correlation features across multiple tasks. @@ -11739,14 +11739,14 @@ <fixed-case>M</fixed-case>ath<fixed-case>B</fixed-case>ench: Evaluating the Theory and Application Proficiency of <fixed-case>LLM</fixed-case>s with a Hierarchical Mathematics Benchmark HongweiLiu ZilongZheng - YuxuanQiao + YuxuanQiao HaodongDuanShanghai Artificial Intelligence Laboratory ZhiweiFeiFudan University, Harbin Institute of Technology, Dalian University of Technology, Shanghai Jiaotong University, Shandong University, Peking University, Zhejiang University, University of Science and Technology of China, Hunan University, Beijing Institute of Technology, University of the Chinese Academy of Sciences, Southeast University, Sichuan University, Monash University, Malaysia Campus, Tianjin University, Beijing University of Aeronautics and Astronautics, Wuhan University of Technology, Yale University, Technische Universität München, Wuhan University, nanjing university, Tsinghua University and Wuhan University FengzheZhou - WenweiZhangShanghai AI Laboratory + WenweiZhangShanghai AI Laboratory SongyangZhangShanghai AI Laboratory DahuaLinThe Chinese University of Hong Kong - KaiChenShanghai AI Laboratory + KaiChenShanghai AI Laboratory 6884-6915 Recent advancements in large language models (LLMs) have showcased significant improvements in mathematics. However, traditional math benchmarks like GSM8k offer a unidimensional perspective, which fall short in providing a holistic assessment of the LLMs’ math capabilities. To address this gap, we introduce MathBench, a new benchmark that rigorously assesses the mathematical capabilities of large language models. MathBench spans a wide range of mathematical disciplines, offering a detailed evaluation of both theoretical understanding and practical problem-solving skills. The benchmark progresses through five distinct stages, from basic arithmetic to college mathematics, and is structured to evaluate models at various depths of knowledge. Each stage includes theoretical questions and application problems, allowing us to measure a model’s mathematical proficiency and its ability to apply concepts in practical scenarios. MathBench aims to enhance the evaluation of LLMs’ mathematical abilities, providing a nuanced view of their knowledge understanding levels and problem solving skills in a bilingual context. 2024.findings-acl.411 @@ -11755,12 +11755,12 @@ Identifying Semantic Induction Heads to Understand In-Context Learning - JieRenShanghai Jiao Tong University + JieRenShanghai Jiao Tong University QipengGuoShanghai AI Laboratory HangYanAI lab - DongruiLiuShanghai Artificial Intelligence Laboratory + DongruiLiuShanghai Artificial Intelligence Laboratory QuanshiZhangShanghai Jiao Tong University - XipengQiuFudan University + XipengQiuFudan University DahuaLinThe Chinese University of Hong Kong 6916-6932 Although large language models (LLMs) have demonstrated remarkable performance, the lack of transparency in their inference logic raises concerns about their trustworthiness. To gain a better understanding of LLMs, we conduct a detailed analysis of the operations of attention heads and aim to better understand the in-context learning of LLMs. Specifically, we investigate whether attention heads encode two types of relationships between tokens present in natural languages: the syntactic dependency parsed from sentences and the relation within knowledge graphs. We find that certain attention heads exhibit a pattern where, when attending to subject tokens, they recall object tokens and increase the output logits of those object tokens. More crucially, the formulation of such semantic induction heads has a close correlation with the emergence of the in-context learning ability of language models. The study of semantic attention heads advances our understanding of the intricate operations of attention heads in transformers, and further provides new insights into the in-context learning of LLMs. @@ -11783,7 +11783,7 @@ Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models JunfeiWu - QiangLiuInstitute of Automation, Chinese Academy of Sciences + QiangLiuInstitute of Automation, Chinese Academy of Sciences DingWang JinghaoZhang ShuWuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences @@ -11810,7 +11810,7 @@ XiChen SongyangZhangShanghai AI Laboratory QibingBai - KaiChenShanghai AI Laboratory + KaiChenShanghai AI Laboratory SatoshiNakamuraThe Chinese University of Hong Kong 6976-6987 We introduces ***LLaST***, a framework for building high-performance Large Language model based Speech-to-text Translation systems. We address the limitations of end-to-end speech translation (E2E ST) models by exploring model architecture design and optimization techniques tailored for LLMs. Our approach includes LLM-based speech translation architecture design, ASR-augmented training, multilingual data augmentation, and dual-LoRA optimization. Our approach demonstrates superior performance on the CoVoST-2 benchmark and showcases exceptional scaling capabilities powered by LLMs.We believe this effective method will serve as a strong baseline for speech translation and provide insights for futureimprovements of the LLM-based speech translation framework. @@ -11820,8 +11820,8 @@ Plan, Generate and Complicate: Improving Low-resource Dialogue State Tracking via Easy-to-Difficult Zero-shot Data Augmentation - MingGu - YanYangEast China Normal University + MingGu + YanYangEast China Normal University 6988-7005 Data augmentation methods have been a promising direction to improve the performance of small models for low-resource dialogue state tracking. However, traditional methods rely on pre-defined user goals and neglect the importance of data complexity in this task. In this paper, we propose EDZ-DA, an Easy-to-Difficult Zero-shot Data Augmentation framework for low-resource dialogue state tracking that utilizes large language models to automatically catch the relationships of different domains and then generate the dialogue data. We also complicate the dialogues based on the domain relation to enhance the model’s capability for co-reference slot tracking. Furthermore, we permute slot values to mitigate the influence of output orders and the problem of incomplete value generation. Experimental results illustrate the superiority of our proposed method compared to previous strong data augmentation baselines on MultiWOZ. 2024.findings-acl.417 @@ -11830,7 +11830,7 @@ <fixed-case>DM</fixed-case>o<fixed-case>ERM</fixed-case>: Recipes of Mixture-of-Experts for Effective Reward Modeling - ShanghaoranQuan + ShanghaoranQuan 7006-7028 The performance of the reward model (RM) is a critical factor in improving the effectiveness of the large language model (LLM) during alignment fine-tuning. There remain two challenges in RM training: 1) training the same RM using various categories of data may cause its generalization performance to suffer from multi-task disturbance, and 2) the human annotation consistency rate is generally only 60% to 75%, causing training data to contain a lot of noise. To tackle these two challenges, we introduced the idea of Mixture-of-Experts (MoE) into the field of RM for the first time. We propose the Double-Layer MoE RM (DMoERM). The outer layer MoE is a sparse model. After classifying an input into task categories, we route it to the corresponding inner layer task-specific model. The inner layer MoE is a dense model. We decompose the specific task into multiple capability dimensions and individually fine-tune a LoRA expert on each one. Their outputs are then synthesized by an MLP to compute the final rewards. To minimize costs, we call a public LLM API to obtain the capability preference labels. The validation on manually labeled datasets confirms that our model attains superior consistency with human preference and outstrips advanced generative approaches. Meanwhile, through BoN sampling and RL experiments, we demonstrate that our model outperforms state-of-the-art ensemble methods of RM and mitigates the overoptimization problem. Our code is available at: https://github.com/quanshr/DMoERM. 2024.findings-acl.418 @@ -11851,10 +11851,10 @@ Comments as Natural Logic Pivots: Improve Code Generation via Comment Perspective YijieChen YijinLiuWechat AI - FandongMengWeChat AI, Tencent Inc. + FandongMengWeChat AI, Tencent Inc. YufengChen JinanXuBeijing Jiaotong University - JieZhou + JieZhou 7040-7051 Code generation aims to understand the problem description and generate corresponding code snippets, where existing works generally decompose such complex tasks into intermediate steps by prompting strategies, such as Chain-of-Thought and its variants. While these studies have achieved some success, their effectiveness is highly dependent on the capabilities of advanced Large Language Models (LLMs) such as GPT-4, particularly in terms of API calls, which significantly limits their practical applicability. Consequently, how to enhance the code generation capabilities of small and medium-scale code LLMs without significantly increasing training costs is an appealing challenge. In this paper, we suggest that code comments are the natural logic pivot between natural language and code language and propose using comments to boost the code generation ability of code LLMs. Concretely, we propose MANGO (comMents As Natural loGic pivOts), including a comment contrastive training strategy and a corresponding logical comment decoding strategy. Experiments are performed on HumanEval and MBPP, utilizing StarCoder and WizardCoder as backbone models, and encompassing model parameter sizes between 3B and 7B. The results indicate that MANGO significantly improves the code pass rate based on the strong baselines. Meanwhile, the robustness of the logical comment decoding strategy is notably higher than the Chain-of-thoughts prompting. 2024.findings-acl.420 @@ -11863,15 +11863,15 @@ Cocktail: A Comprehensive Information Retrieval Benchmark with <fixed-case>LLM</fixed-case>-Generated Documents Integration - SunhaoDai - WeihaoLiuRenmin University of China - YuqiZhou - LiangPangInstitute of Computing Technology, Chinese Academy of Sciences + SunhaoDai + WeihaoLiuRenmin University of China + YuqiZhou + LiangPangInstitute of Computing Technology, Chinese Academy of Sciences RongjuRuanHuawei Technologies Ltd. - GangWangHuawei Technologies Ltd. + GangWangHuawei Technologies Ltd. ZhenhuaDong JunXuRenmin University of China - Ji-RongWenRenmin University of China + Ji-RongWenRenmin University of China 7052-7074 The proliferation of Large Language Models (LLMs) has led to an influx of AI-generated content (AIGC) on the internet, transforming the corpus of Information Retrieval (IR) systems from solely human-written to a coexistence with LLM-generated content. The impact of this surge in AIGC on IR systems remains an open question, with the primary challenge being the lack of a dedicated benchmark for researchers. In this paper, we introduce Cocktail, a comprehensive benchmark tailored for evaluating IR models in this mixed-sourced data landscape of the LLM era. Cocktail consists of 16 diverse datasets with mixed human-written and LLM-generated corpora across various text retrieval tasks and domains. Additionally, to avoid the potential bias from previously included dataset information in LLMs, we also introduce an up-to-date dataset, named NQ-UTD, with queries derived from recent events. Through conducting over 1,000 experiments to assess state-of-the-art retrieval models against the benchmarked datasets in Cocktail, we uncover a clear trade-off between ranking performance and source bias in neural retrieval models, highlighting the necessity for a balanced approach in designing future IR systems. We hope Cocktail can serve as a foundational resource for IR research in the LLM era, with all data and code publicly available at https://github.com/KID-22/Cocktail. 2024.findings-acl.421 @@ -11880,13 +11880,13 @@ Continual Dialogue State Tracking via Reason-of-Select Distillation - YujieFengHong Kong Polytechnic University + YujieFengHong Kong Polytechnic University BoLiu - XiaoyuDong - ZexinLuHong Kong Polytechnic University + XiaoyuDong + ZexinLuHong Kong Polytechnic University Li-MingZhanThe Hong Kong Polytechnic University Xiao-MingWuHong Kong Polytechnic University - AlbertLamUniversity of Hong Kong and Fano Labs + AlbertLamUniversity of Hong Kong and Fano Labs 7075-7087 An ideal dialogue system requires continuous skill acquisition and adaptation to new tasks while retaining prior knowledge. Dialogue State Tracking (DST), vital in these systems, often involves learning new services, confronting catastrophic forgetting and a critical capability loss termed the “Value Selection Quandary”. To address these challenges, we introduce the Reason-of-Select (RoS) distillation method by enhancing smaller models with a novel “meta-reasoning” capability. Meta-reasoning, employing an enhanced multi-domain perspective, combines fragments of meta-knowledge from domain-specific dialogues during continual learning, transcending traditional single-perspective reasoning. This domain bootstrapping process enhances the model’s ability to dissect intricate dialogues from multiple possible values, and its domain-agnostic property aligns data distribution across different domains, effectively mitigating forgetting. Besides, two novel improvements, “multi-value resolution” strategy and Semantic Contrastive Reasoning Selection method, significantly enhance RoS by generating DST-specific selection chains and mitigating hallucinations in teachers’ reasoning, ensuring effective and reliable knowledge transfer. Extensive experiments validate the exceptional performance and robust generalization capabilities of our method. 2024.findings-acl.422 @@ -11898,9 +11898,9 @@ YafuLiWestlake University ZhilinWang LeyangCui - WeiBiTencent AI Lab + WeiBiTencent AI Lab ShumingShiTencent AI Lab - YueZhangWestlake University + YueZhangWestlake University 7088-7107 AI-generated text detection has attracted increasing attention as powerful language models approach human-level generation. Limited work is devoted to detecting (partially) AI-paraphrased texts. However, AI paraphrasing is commonly employed in various application scenarios for text refinement and diversity. To this end, we propose a novel detection framework, paraphrased text span detection (PTD), aiming to identify paraphrased text spans within a text. Different from text-level detection, PTD takes in the full text and assigns each of the sentences with a score indicating the paraphrasing degree. We construct a dedicated dataset, PASTED, for paraphrased text span detection. Both in-distribution and out-of-distribution results demonstrate the effectiveness of PTD models in identifying AI-paraphrased text spans. Statistical and model analysis explains the crucial role of the surrounding context of the paraphrased text spans. Extensive experiments show that PTD models can generalize to versatile paraphrasing prompts as well as multiple paraphrased text spans. 2024.findings-acl.423 @@ -11910,8 +11910,8 @@ <fixed-case>S</fixed-case>o<fixed-case>FA</fixed-case>: Shielded On-the-fly Alignment via Priority Rule Following XinyuLu - BowenYuAlibaba Group - YaojieLuInstitute of Software, Chinese Academy of Sciences + BowenYuAlibaba Group + YaojieLuInstitute of Software, Chinese Academy of Sciences HongyuLinInstitute of Software, Chinese Academy of Sciences HaiyangYu LeSunInstitute of Software, Chinese Academy of Sciences @@ -11936,10 +11936,10 @@ Modeling Emotional Trajectories in Written Stories Utilizing Transformers and Weakly-Supervised Learning LukasChristUniversity of Augsburg, Universität Augsburg - ShahinAmiriparianTechnical University of Munich + ShahinAmiriparianTechnical University of Munich ManuelMillingUniversity of Augsburg - IlhanAslan - BjörnSchullerTechnische Universität München and Imperial College London + IlhanAslan + BjörnSchullerTechnische Universität München and Imperial College London 7144-7159 Telling stories is an integral part of human communication which can evoke emotions and influence the affective states of the audience. Automatically modeling emotional trajectories in stories has thus attracted considerable scholarly interest. However, as most existing works have been limited to unsupervised dictionary-based approaches, there is no benchmark for this task. We address this gap by introducing continuous valence and arousal labels for an existing dataset of children’s stories originally annotated with discrete emotion categories. We collect additional annotations for this data and map the categorical labels to the continuous valence and arousal space. For predicting the thus obtained emotionality signals, we fine-tune a DeBERTa model and improve upon this baseline via a weakly supervised learning approach. The best configuration achieves a Concordance Correlation Coefficient (CCC) of .8221 for valence and .7125 for arousal on the test set, demonstrating the efficacy of our proposed approach. A detailed analysis shows the extent to which the results vary depending on factors such as the author, the individual story, or the section within the story. In addition, we uncover the weaknesses of our approach by investigating examples that prove to be difficult to predict. 2024.findings-acl.426 @@ -11948,13 +11948,13 @@ <fixed-case>RAP</fixed-case>: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter - MengCaoMohamed bin Zayed University of Artificial Intelligence + MengCaoMohamed bin Zayed University of Artificial Intelligence HaoranTang JinfaHuang - PengJin - CanZhangTencent MediaLab + PengJin + CanZhangTencent MediaLab RuyangLiuPeking University - LongChenThe Hong Kong University of Science and Technology + LongChenThe Hong Kong University of Science and Technology XiaodanLiang LiYuanPeking University GeLiPeking University Shenzhen Graduate School @@ -11966,14 +11966,14 @@ Benchmarking and Improving Long-Text Translation with Large Language Models - LongyueWang + LongyueWang ZefengDu WenxiangJiaoTencent AI Lab ChenyangLyuMohamed bin Zayed University of Artificial Intelligence JianhuiPang LeyangCui KaiqiangSongTencent AI Lab - DerekWongUniversity of Macau + DerekWongUniversity of Macau ShumingShiTencent AI Lab ZhaopengTuTencent AI Lab 7175-7187 @@ -11984,11 +11984,11 @@ Personalized Topic Selection Model for Topic-Grounded Dialogue - ShixuanFan - WeiWeiHuazhong University of Science and Technology + ShixuanFan + WeiWeiHuazhong University of Science and Technology XiaofeiWen Xian-LingMaoBeijing Institute of Technology - JixiongChen + JixiongChen DangyangChenPingan Technology 7188-7202 Recently, the topic-grounded dialogue (TGD) system has become increasingly popular as its powerful capability to actively guide users to accomplish specific tasks through topic-guided conversations. Most existing works utilize side information (e.g. topics or personas) in isolation to enhance the topic selection ability. However, due to disregarding the noise within these auxiliary information sources and their mutual influence, current models tend to predict user-uninteresting and contextually irrelevant topics. To build user-engaging and coherent dialogue agent, we propose a personalized topic selection model for topic-grounded dialogue, named PETD, which takes account of the interaction of side information to selectively aggregate such information for more accurately predicting subsequent topics. Specifically, we evaluate the correlation between global topics and personas and selectively incorporate the global topics aligned with user personas. Furthermore, we propose a contrastive learning based persona selector to filter relevant personas under the constraint of lacking pertinent persona annotations. Throughout the selection and generation, diverse relevant side information is considered. Extensive experiments demonstrate that our proposed method can generate engaging and diverse responses, outperforming state-of-the-art baselines across various evaluation metrics. @@ -12001,10 +12001,10 @@ LvxueLi JiaqiChen XinyuLu - YaojieLuInstitute of Software, Chinese Academy of Sciences + YaojieLuInstitute of Software, Chinese Academy of Sciences HongyuLinInstitute of Software, Chinese Academy of Sciences ShuhengZhouAnt Group - HuijiaZhu + HuijiaZhu WeiqiangWangAnt Group ZhongyiLiuAnt Group XianpeiHanInstitute of Software, CAS @@ -12029,9 +12029,9 @@ <fixed-case>MS</fixed-case>2<fixed-case>SL</fixed-case>: Multimodal Spoken Data-Driven Continuous Sign Language Production JianMa - WenguanWangZhejiang University + WenguanWangZhejiang University YiYangZhejiang University - FengZhengSouthern University of Science and Technology + FengZhengSouthern University of Science and Technology 7241-7254 Sign language understanding has made significant strides; however, there is still no viable solution for generating sign sequences directlyfrom entire spoken content, e.g., text or speech. In this paper, we propose a unified framework for continuous sign language production, easing communication between sign and non-sign language users. In particular, a sequence diffusion model, utilizing embeddings extracted from text or speech, is crafted to generate sign predictions step by step. Moreover, by creating a joint embedding space for text, audio, and sign, we bind these modalities and leverage the semantic consistency among them to provide informative feedback for the model training. This embedding-consistency learning strategy minimizes the reliance on sign triplets and ensures continuous model refinement, evenwith a missing audio modality. Experiments on How2Sign and PHOENIX14T datasets demonstrate that our model achieves competitive performance in sign language production. 2024.findings-acl.432 @@ -12044,9 +12044,9 @@ XintingHuangTencent AI Lab TingchenFu QintongLi - ShansanGong + ShansanGong LemaoLiuTencent - WeiBiTencent AI Lab + WeiBiTencent AI Lab LingpengKongDepartment of Computer Science, The University of Hong Kong 7255-7279 Multimodal reasoning stands as a pivotal capability for large vision-language models (LVLMs). The integration with Domain-Specific Languages (DSL), offering precise visual representations, equips these models with the opportunity to execute more accurate reasoning in complex and professional domains. However, the vanilla Chain-of-Thought (CoT) prompting method faces challenges in effectively leveraging the unique strengths of visual and DSL representations, primarily due to their differing reasoning mechanisms. Additionally, it often falls short in addressing critical steps in multi-step reasoning tasks. To mitigate these challenges, we introduce the Bi-Modal Behavioral Alignment (BBA) prompting method, designed to maximize the potential of DSL in augmenting complex multi-modal reasoning tasks. This method initiates by guiding LVLMs to create separate reasoning chains for visual and DSL representations. Subsequently, it aligns these chains by addressing any inconsistencies, thus achieving a cohesive integration of behaviors from different modalities. Our experiments demonstrate that BBA substantially improves the performance of GPT-4V(ision) on geometry problem solving (28.34% \to 34.22%), chess positional advantage prediction (42.08% \to 46.99%) and molecular property prediction (77.47% \to 83.52%). @@ -12056,7 +12056,7 @@ <fixed-case>P</fixed-case>artial<fixed-case>F</fixed-case>ormer: Modeling Part Instead of Whole for Machine Translation - TongZheng + TongZheng BeiLiMeituan HuiwenBaoNortheastern University JialeWang @@ -12071,12 +12071,12 @@ Self-Consistent Reasoning-based Aspect-Sentiment Quad Prediction with Extract-Then-Assign Strategy - JieyongKim + JieyongKim RyangHeoYonsei University YongsikSeoYonsei University - SeongKuKangUniversity of Illinois Urbana-Champaign - JinyoungYeoYonsei University - DonghaLeeYonsei University + SeongKuKangUniversity of Illinois Urbana-Champaign + JinyoungYeoYonsei University + DonghaLeeYonsei University 7295-7303 In the task of aspect sentiment quad prediction (ASQP), generative methods for predicting sentiment quads have shown promisingresults. However, they still suffer from imprecise predictions and limited interpretability, caused by data scarcity and inadequate modeling of the quadruplet composition process. In this paper, we propose Self-Consistent Reasoning-based Aspect sentiment quadruple Prediction (SCRAP), optimizing its model to generate reasonings and the corresponding sentiment quadruplets in sequence. SCRAP adopts the Extract-Then-Assign reasoning strategy, which closely mimics human cognition. In the end, SCRAP significantly improves the model’s ability to handle complex reasoning tasks and correctly predict quadruplets through consistency voting, resulting in enhanced interpretability and accuracy in ASQP. 2024.findings-acl.435 @@ -12088,7 +12088,7 @@ YihongDongPeking University KangchengLuoPeking University XueJiangPeking University - ZhiJinPeking University and Peking University + ZhiJinPeking University and Peking University GeLiPeking University Shenzhen Graduate School 7304-7323 Large language models (LLMs) have showcased remarkable potential across various tasks by conditioning on prompts. However, the quality of different human-written prompts leads to substantial discrepancies in LLMs’ performance, and improving prompts usually necessitates considerable human effort and expertise. To this end, this paper proposes Prompt with Actor-Critic Editing (PACE) for LLMs to enable automatic prompt editing. Drawing inspiration from the actor-critic algorithm in reinforcement learning, PACE leverages LLMs as the dual roles of actors and critics, conceptualizing prompt as a type of policy. PACE refines prompt, taking into account the feedback from both actors performing prompt and critics criticizing response. This process helps LLMs better align prompt to a specific task, thanks to real responses and thinking from LLMs.We conduct extensive experiments on 24 instruction induction tasks and 21 big-bench tasks. Experimental results indicate that PACE elevates the relative performance of medium/low-quality human-written prompts by up to 98%, which has comparable performance to high-quality human-written prompts. Moreover, PACE also exhibits notable efficacy for prompt generation. @@ -12099,10 +12099,10 @@ Penetrative <fixed-case>AI</fixed-case>: Making <fixed-case>LLM</fixed-case>s Comprehend the Physical World HuataoXuHong Kong University of Science and Technology - LiyingHanUniversity of California, Los Angeles + LiyingHanUniversity of California, Los Angeles QiruiYangDepartment of Computer Science and Engineering, Hong Kong University of Science and Technology MoLiThe Hong Kong University of Science and Technology and National Technological University - ManiSrivastavaAmazon and University of California, Los Angeles + ManiSrivastavaAmazon and University of California, Los Angeles 7324-7341 Recent developments in Large Language Models (LLMs) have demonstrated their remarkable capabilities across a range of tasks. Questions, however, persist about the nature of LLMs and their potential to integrate common-sense human knowledge when performing tasks involving information about the real physical world. This paper delves into these questions by exploring how LLMs can be extended to interact with and reason about the physical world through IoT sensors and actuators, a concept that we term “Penetrative AI”. The paper explores such an extension at two levels of LLMs’ ability to penetrate into the physical world via the processing of sensory signals. Our preliminary findings indicate that LLMs, with ChatGPT being the representative example in our exploration, have considerable and unique proficiency in employing the embedded world knowledge for interpreting IoT sensor data and reasoning over them about tasks in the physical realm. Not only this opens up new applications for LLMs beyond traditional text-based tasks, but also enables new ways of incorporating human knowledge in cyber-physical systems. 2024.findings-acl.437 @@ -12112,11 +12112,11 @@ The Impact of Demonstrations on Multilingual In-Context Learning: A Multidimensional Analysis MiaoranZhangSaarland University - VagrantGautamSaarland University + VagrantGautamSaarland University MingyangWang - JesujobaAlabiUniversität des Saarlandes + JesujobaAlabiUniversität des Saarlandes XiaoyuShenAmazon - DietrichKlakowSaarland University + DietrichKlakowSaarland University MariusMosbachMcGill University and Mila - Quebec Artificial Intelligence Institute 7342-7371 In-context learning is a popular inference strategy where large language models solve a task using only a few labeled demonstrations without needing any parameter updates. Although there have been extensive studies on English in-context learning, multilingual in-context learning remains under-explored, and we lack an in-depth understanding of the role of demonstrations in this context. To address this gap, we conduct a multidimensional analysis of multilingual in-context learning, experimenting with 5 models from different model families, 9 datasets covering classification and generation tasks, and 56 typologically diverse languages. Our results reveal that the effectiveness of demonstrations varies significantly across models, tasks, and languages. We also find that strong instruction-following models including Llama 2-Chat, GPT-3.5, and GPT-4 are largely insensitive to the quality of demonstrations. Instead, a carefully crafted template often eliminates the benefits of demonstrations for some tasks and languages altogether. These findings show that the importance of demonstrations might be overestimated. Our work highlights the need for granular evaluation across multiple axes towards a better understanding of in-context learning. @@ -12126,10 +12126,10 @@ Rich Semantic Knowledge Enhanced Large Language Models for Few-shot <fixed-case>C</fixed-case>hinese Spell Checking - MingDongCentral China Normal University + MingDongCentral China Normal University YujingChenCentral China Normal University MiaoZhang - HaoSun + HaoSun TingtingHeCentral China Normal University 7372-7383 2024.findings-acl.439 @@ -12138,7 +12138,7 @@ An Empirical Study of In-context Learning in <fixed-case>LLM</fixed-case>s for Machine Translation - PranjalChitaleMicrosoft Research + PranjalChitaleMicrosoft Research JayGalaMohamed bin Zayed University of Artificial Intelligence RajDabreNational Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology 7384-7406 @@ -12153,9 +12153,9 @@ BoleiMaLudwig-Maximilians-Universität München ChengzhiHu LeonWeber-GenzelLudwig-Maximilians-Universität München - PaulRöttgerBocconi University + PaulRöttgerBocconi University FraukeKreuterUniversity of Maryland, College Park - DirkHovyBocconi University + DirkHovyBocconi University BarbaraPlankLudwig-Maximilians-Universität München and IT University of Copenhagen 7407-7416 The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model’s diverse response styles such as starting with “Sure” or refusing to answer. Consequently, first-token evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation. @@ -12191,7 +12191,7 @@ A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning PanagiotisKaliosis - JohnPavlopoulosAthens University of Economics and Business + JohnPavlopoulosAthens University of Economics and Business FoivosCharalampakosAthens University of Economics and Business GeorgiosMoschovis IonAndroutsopoulosAthens University of Economics and Business @@ -12205,7 +12205,7 @@ HengyuanZhang YanruWu DaweiLi - SakYangUniversity of the Chinese Academy of Sciences + SakYangUniversity of the Chinese Academy of Sciences RuiZhaoQing Yuan Research Institute, Shanghai Jiao Tong University and SenseTime Research YongJiangTsinghua University FeiTanSensetime Research @@ -12231,8 +12231,8 @@ Light-<fixed-case>PEFT</fixed-case>: Lightening Parameter-Efficient Fine-Tuning via Early Pruning NaibinGu - PengFuInstitute of Information Engineering, Chinese Academy of Sciences - XiyuLiu + PengFuInstitute of Information Engineering, Chinese Academy of Sciences + XiyuLiu BowenShenUniversity of the Chinese Academy of Sciences ZhengLinInstitute of Information Engineering, Chinese Academy of Sciences WeipingWang @@ -12244,7 +12244,7 @@ Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into <fixed-case>G</fixed-case>erman - ManuelLardelli + ManuelLardelli GiuseppeAttanasioInstituto de Telecomunicações AnneLauscherUniversität Hamburg 7542-7550 @@ -12257,8 +12257,8 @@ Prompt Chaining or Stepwise Prompt? Refinement in Text Summarization ShichaoSunThe Hong Kong Polytechnic University RuifengYuan - ZiqiangCao - WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University + ZiqiangCao + WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University PengfeiLiu 7551-7558 2024.findings-acl.449 @@ -12269,8 +12269,8 @@ Trust in Internal or External Knowledge? Generative Multi-Modal Entity Linking with Knowledge Retriever XinweiLong JialiZeng - FandongMengWeChat AI, Tencent Inc. - JieZhou + FandongMengWeChat AI, Tencent Inc. + JieZhou BowenZhouTsinghua University 7559-7569 Multi-modal entity linking (MEL) is a challenging task that requires accurate prediction of entities within extensive search spaces, utilizing multi-modal contexts. Existing generative approaches struggle with the knowledge gap between visual entity information and the intrinsic parametric knowledge of LLMs. To address this knowledge gap, we introduce a novel approach called GELR, which incorporates a knowledge retriever to enhance visual entity information by leveraging external sources. Additionally, we devise a prioritization scheme that effectively handles noisy retrieval results and manages conflicts arising from the integration of external and internal knowledge. Moreover, we propose a noise-aware instruction tuning technique during training to finely adjust the model’s ability to leverage retrieved information effectively. Through extensive experiments conducted on three benchmarks, our approach showcases remarkable improvements, ranging from 3.0% to 6.5%, across all evaluation metrics compared to strong baselines. These results demonstrate the effectiveness and superiority of our proposed method in tackling the complexities of multi-modal entity linking. @@ -12281,7 +12281,7 @@ A Semantic Distance Metric Learning approach for Lexical Semantic Change Detection TaichiAidaTokyo Metropolitan University - DanushkaBollegalaAmazon and University of Liverpool + DanushkaBollegalaAmazon and University of Liverpool 7570-7584 Detecting temporal semantic changes of words is an important task for various NLP applications that must make time-sensitive predictions.Lexical Semantic Change Detection (SCD) task involves predicting whether a given target word, w, changes its meaning between two different text corpora, C_1 and C_2.For this purpose, we propose a supervised two-staged SCD method that uses existing Word-in-Context (WiC) datasets.In the first stage, for a target word w, we learn two sense-aware encoders that represent the meaning of w in a given sentence selected from a corpus.Next, in the second stage, we learn a sense-aware distance metric that compares the semantic representations of a target word across all of its occurrences in C_1 and C_2.Experimental results on multiple benchmark datasets for SCD show that our proposed method achieves strong performance in multiple languages.Additionally, our method achieves significant improvements on WiC benchmarks compared to a sense-aware encoder with conventional distance functions. 2024.findings-acl.451 @@ -12294,7 +12294,7 @@ HuajianZhang JianhaoYanWestlake University YongjingYin - YueZhangWestlake University + YueZhangWestlake University 7585-7606 Recent advances have made non-autoregressive (NAT) translation comparable to autoregressive methods (AT). However, their evaluation using BLEU has been shown to weakly correlate with human annotations. Limited research compares non-autoregressive translation and autoregressive translation comprehensively, leaving uncertainty about the true proximity of NAT to AT. To address this gap, we systematically evaluate four representative NAT methods across various dimensions, including human evaluation. Our empirical results demonstrate that despite narrowing the performance gap, state-of-the-art NAT still underperforms AT under more reliable evaluation metrics. Furthermore, we discover that explicitly modeling dependencies is crucial for generating natural language and generalizing to out-of-distribution sequences. 2024.findings-acl.452 @@ -12304,7 +12304,7 @@ From Zero to Hero: Cold-Start Anomaly Detection TalReissHebrew University of Jerusalem - GeorgeKourInternational Business Machines + GeorgeKourInternational Business Machines NaamaZwerdling AteretAnaby TavorInternational Business Machines YedidHoshenGoogle and Hebrew University of Jerusalem @@ -12316,11 +12316,11 @@ Large Language Models Fall Short: Understanding Complex Relationships in Detective Narratives RuncongZhao - QinglinZhuKing’s College London, University of London + QinglinZhuKing’s College London, University of London HainiuXu JiazhengLiKing’s College London, University of London YuxiangZhouKing’s College London - YulanHeKing’s College London, University of London + YulanHeKing’s College London, University of London LinGuiKing’s College London, University of London 7618-7638 Existing datasets for narrative understanding often fail to represent the complexity and uncertainty of relationships in real-life social scenarios. To address this gap, we introduce a new benchmark, Conan, designed for extracting and analysing intricate character relation graphs from detective narratives. Specifically, we designed hierarchical relationship categories and manually extracted and annotated role-oriented relationships from the perspectives of various characters, incorporating both public relationships known to most characters and secret ones known to only a few. Our experiments with advanced Large Language Models (LLMs) like GPT-3.5, GPT-4, and Llama2 reveal their limitations in inferencing complex relationships and handling longer narratives. The combination of the Conan dataset and our pipeline strategy is geared towards understanding the ability of LLMs to comprehend nuanced relational dynamics in narrative contexts. @@ -12330,9 +12330,9 @@ <fixed-case>D</fixed-case>istill<fixed-case>MIKE</fixed-case>: Editing Distillation of Massive In-Context Knowledge Editing in Large Language Models - ShanbaoQiao - XuebingLiu - Seung-HoonNaChonbuk National University + ShanbaoQiao + XuebingLiu + Seung-HoonNaChonbuk National University 7639-7654 Among the recently emerged knowledge editing methods, in-context knowledge editing (IKE) has shown respectable abilities on knowledge editing in terms of generalization and specificity. Noting the promising advantages but unexplored issues of IKE, we propose **DistillMIKE** as a novel extension of IKE, i.e., editing **distill**ation of "**M**assive” **I**n-context **K**nowledge **E**diting in large language models (LLMs), mainly consisting of two expansions; 1) *Massive in-context knowledge editing (MIKE)*, which extends IKE to a massive editing task, aiming to inject not a single edit but a set of massive edits to LLMs; To preserve specificity, our key novel extension is a “selective” retrieval augmentation, where the retrieval-augmented IKE is only applied to “in-scope” examples, whereas the unedited model without IKE is employed for “out-of-scope” ones. 2) *Editing distillation* of MIKE using low-rank adaptation (LoRA), which distills editing abilities of MIKE to parameters of LLMs in a manner of eliminating the need of lengthy in-context demonstrations, thus removing the computational overhead encountered at the inference time. Experimental results on the zsRE and CounterFact datasets demonstrate that MIKE shows the state-of-the-art perfomrances and DistilMIKE show comparable performances with MIKE. Our code is available at https://github.com/JoveReCode/DistillMIKE.git. 2024.findings-acl.455 @@ -12341,14 +12341,14 @@ Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding - HemingXia + HemingXia ZheYangPeking University QingxiuDong PeiyiWang YongqiLiHong Kong Polytechnic University TaoGe TianyuLiu - WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University + WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University ZhifangSuiPeking University 7655-7671 To mitigate the high inference latency stemming from autoregressive decoding in Large Language Models (LLMs), Speculative Decoding has emerged as a novel decoding paradigm for LLM inference. In each decoding step, this method first drafts several future tokens efficiently and then verifies them in parallel. Unlike autoregressive decoding, Speculative Decoding facilitates the simultaneous decoding of multiple tokens per step, thereby accelerating inference. This paper presents a comprehensive overview and analysis of this promising decoding paradigm. We begin by providing a formal definition and formulation of Speculative Decoding. Then, we organize in-depth discussions on its key facets, such as drafter selection and verification strategies. Furthermore, we present a comparative analysis of leading methods under third-party testing environments. We aim for this work to serve as a catalyst for further research on Speculative Decoding, ultimately contributing to more efficient LLM inference. @@ -12358,8 +12358,8 @@ Hierarchy-aware Biased Bound Margin Loss Function for Hierarchical Text Classification - GibaegKim - SangHunImKorea University of Technology and Education + GibaegKim + SangHunImKorea University of Technology and Education Heung-SeonOhKorea University of Technology and Education 7672-7682 Hierarchical text classification (HTC) is a challenging problem with two key issues: utilizing structural information and mitigating label imbalance. Recently, the unit-based approach generating unit-based feature representations has outperformed the global approach focusing on a global feature representation. Nevertheless, unit-based models using BCE and ZLPR losses still face static thresholding and label imbalance challenges. Those challenges become more critical in large-scale hierarchies. This paper introduces a novel hierarchy-aware loss function for unit-based HTC models: Hierarchy-aware Biased Bound Margin (HBM) loss. HBM integrates learnable bounds, biases, and a margin to address static thresholding and mitigate label imbalance adaptively. Experimental results on benchmark datasets demonstrate the superior performance of HBM compared to competitive HTC models. @@ -12383,10 +12383,10 @@ <fixed-case>CICL</fixed-case>e: Conformal In-Context Learning for Largescale Multi-Class Food Risk Classification - KorbinianRandlStockholm University - JohnPavlopoulosAthens University of Economics and Business - AronHenrikssonStockholm University - TonyLindgrenDepratment of Computer and Systems Sciences + KorbinianRandlStockholm University + JohnPavlopoulosAthens University of Economics and Business + AronHenrikssonStockholm University + TonyLindgrenDepratment of Computer and Systems Sciences 7695-7715 Contaminated or adulterated food poses a substantial risk to human health. Given sets of labeled web texts for training, Machine Learning and Natural Language Processing can be applied to automatically detect such risks. We publish a dataset of 7,546 short texts describing public food recall announcements. Each text is manually labeled, on two granularity levels (coarse and fine), for food products and hazards that the recall corresponds to. We describe the dataset and benchmark naive, traditional, and Transformer models. Based on our analysis, Logistic Regression based on a TF-IDF representation outperforms RoBERTa and XLM-R on classes with low support. Finally, we discuss different prompting strategies and present an LLM-in-the-loop framework, based on Conformal Prediction, which boosts the performance of the base classifier while reducing energy consumption compared to normal prompting. 2024.findings-acl.459 @@ -12398,8 +12398,8 @@ RuikangLiu HaoliBaiHuawei Technologies Ltd. HaokunLin - YueningLi - HanGaoHuawei Technologies Ltd. + YueningLi + HanGaoHuawei Technologies Ltd. ZhengzhuoXu LuHouHuawei Technologies Ltd. JunYaoHuawei Technologies Ltd. @@ -12413,7 +12413,7 @@ Learning Adverbs with Spectral Mixture Kernels TomoeTaniguchiOchanomizu Women’s University - DaichiMochihashi + DaichiMochihashi IchiroKobayashiOchanomizu University 7742-7752 For humans and robots to collaborate more in the real world, robots need to understand human intentions from the different manner of their behaviors. In our study, we focus on the meaning of adverbs which describe human motions. We propose a topic model, Hierarchical Dirichlet Process-Spectral Mixture Latent Dirichlet Allocation, which concurrently learns the relationship between those human motions and those adverbs by capturing the frequency kernels that represent motion characteristics and the shared topics of adverbs that depict such motions. We trained the model on datasets we made from movies about “walking” and “dancing”, and found that our model outperforms representative neural network models in terms of perplexity score. We also demonstrate our model’s ability to determine the adverbs for a given motion and confirmed that the model predicts more appropriate adverbs. @@ -12429,10 +12429,10 @@ XiangtaoKong ZhigangZheng DaijiaTang - ChengmingLiShenzhen MSU-BIT University - XipingHuBeijing Institute of Technology + ChengmingLiShenzhen MSU-BIT University + XipingHuBeijing Institute of Technology RuifengXuHarbin Institute of Technology - ShiwenNiShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences + ShiwenNiShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences MinYangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences 7753-7774 The rapid development of Large Language Models (LLMs) has led to their increasing utilization in Chinese K-12 education. Despite the growing integration of LLMs and education, the absence of a dedicated benchmark for evaluating LLMs within this domain presents a pressing concern. Consequently, there is an urgent need for a comprehensive natural language processing benchmark to precisely assess the capabilities of various LLMs in Chinese K-12 education. In response, we introduce E-EVAL, the first comprehensive evaluation benchmark specifically tailored for Chinese K-12 education. E-EVAL comprises 4,351 multiple-choice questions spanning primary, middle, and high school levels, covering a diverse array of subjects. Through meticulous evaluation, we find that Chinese-dominant models often outperform English-dominant ones, with many exceeding GPT 4.0. However, most struggle with complex subjects like mathematics. Additionally, our analysis indicates that most Chinese-dominant LLMs do not achieve higher scores at the primary school level compared to the middle school level, highlighting the nuanced relationship between proficiency in higher-order and lower-order knowledge domains. Furthermore, experimental results highlight the effectiveness of the Chain of Thought (CoT) technique in scientific subjects and Few-shot prompting in liberal arts. Through E-EVAL, we aim to conduct a rigorous analysis delineating the strengths and limitations of LLMs in educational applications, thereby contributing significantly to the advancement of Chinese K-12 education and LLMs. @@ -12442,12 +12442,12 @@ <fixed-case>C</fixed-case>hart<fixed-case>A</fixed-case>ssistant: A Universal Chart Multimodal Language Model via Chart-to-Table Pre-training and Multitask Instruction Tuning - FanqingMeng + FanqingMeng WenqiShao QuanfengLuShanghai Jiaotong University and Nanjing university PengGaoshanghai ai lab KaipengZhangShanghai AI Laboratory - YuQiao + YuQiao PingLuoThe University of Hong Kong 7775-7803 Charts play a vital role in data visualization, understanding data patterns, and informed decision-making. However, their unique combination of graphical elements (e.g., bars, lines) and textual components (e.g., labels, legends) poses challenges for general-purpose multimodal models. While vision-language models trained on chart data excel in comprehension, they struggle with generalization. To address these challenges, we propose ChartAssistant, a chart-based vision-language model for universal chart comprehension and reasoning. ChartAssistant leverages ChartSFT, a comprehensive dataset covering diverse chart-related tasks with basic (e.g. bars and pies) and specialized (e.g. radars, and bubbles) chart types. It undergoes a two-stage training process, starting with pre-training on chart-to-table parsing to align chart and text, followed by multitask instruction-following fine-tuning. This approach enables ChartAssistant to achieve competitive performance across various chart tasks. Experimental results demonstrate significant performance gains over the state-of-the-art UniChart and ChartLlama methods, especially outperforming them on real-world chart data with zero-shot setting. The code and data are available at https://github.com/OpenGVLab/ChartAst. @@ -12488,8 +12488,8 @@ HaoxinLiu ZhiyuanZhao JindongWangMicrosoft Research - HarshavardhanKamarthiGeorgia Institute of Technology - B. AdityaPrakashGeorgia Institute of Technology + HarshavardhanKamarthiGeorgia Institute of Technology + B. AdityaPrakashGeorgia Institute of Technology 7832-7840 Time-series forecasting (TSF) finds broad applications in real-world scenarios. Prompting off-the-shelf Large Language Models (LLMs) demonstrates strong zero-shot TSF capabilities while preserving computational efficiency. However, existing prompting methods oversimplify TSF as language next-token predictions, overlooking its dynamic nature and lack of integration with state-of-the-art prompt strategies such as Chain-of-Thought. Thus, we propose LSTPrompt, a novel approach for prompting LLMs in zero-shot TSF tasks. LSTPrompt decomposes TSF into short-term and long-term forecasting sub-tasks, tailoring prompts to each. LSTPrompt guides LLMs to regularly reassess forecasting mechanisms to enhance adaptability. Extensive evaluations demonstrate consistently better performance of LSTPrompt than existing prompting methods, and competitive results compared to foundation TSF models. 2024.findings-acl.466 @@ -12499,8 +12499,8 @@ Mitigating Boundary Ambiguity and Inherent Bias for Text Classification in the Era of Large Language Models ZhenyiLu - JieTian - WeiWeiHuazhong University of Science and Technology + JieTian + WeiWeiHuazhong University of Science and Technology XiaoyeQuShanghai Artificial Intelligence Laboratory YuChengThe Chinese University of Hong Kong WenfengXie @@ -12513,12 +12513,12 @@ <fixed-case>UOR</fixed-case>: Universal Backdoor Attacks on Pre-trained Language Models - WeiDu + WeiDu PeixuanLi - HaodongZhaoShanghai Jiaotong University + HaodongZhaoShanghai Jiaotong University TianjieJu GeRen - GongshenLiuShanghai Jiao Tong University + GongshenLiuShanghai Jiao Tong University 7865-7877 Task-agnostic and transferable backdoors implanted in pre-trained language models (PLMs) pose a severe security threat as they can be inherited to any downstream task. However, existing methods rely on manual selection of triggers and backdoor representations, hindering their effectiveness and universality across different PLMs or usage paradigms. In this paper, we propose a new backdoor attack method called UOR, which overcomes these limitations by turning manual selection into automatic optimization. Specifically, we design poisoned supervised contrastive learning, which can automatically learn more uniform and universal backdoor representations. This allows for more even coverage of the output space, thus hitting more labels in downstream tasks after fine-tuning. Furthermore, we utilize gradient search to select appropriate trigger words that can be adapted to different PLMs and vocabularies. Experiments show that UOR achieves better attack performance on various text classification tasks compared to manual methods. Moreover, we test on PLMs with different architectures, usage paradigms, and more challenging tasks, achieving higher scores for universality. 2024.findings-acl.468 @@ -12527,9 +12527,9 @@ Language models emulate certain cognitive profiles: An investigation of how predictability measures interact with individual differences - PatrickHallerUniversity of Zurich + PatrickHallerUniversity of Zurich LenaBolligerUniversity of Zurich - LenaJägerUniversity of Zurich and Universität Potsdam + LenaJägerUniversity of Zurich and Universität Potsdam 7878-7892 To date, most investigations on surprisal and entropy effects in reading have been conducted on the group level, disregarding individual differences. In this work, we revisit the predictive power (PP) of different LMs’ surprisal and entropy measures on data of human reading times as a measure of processing effort by incorporating information of language users’ cognitive capacities. To do so, we assess the PP of surprisal and entropy estimated from generative language models (LMs) on reading data obtained from individuals who also completed a wide range of psychometric tests.Specifically, we investigate if modulating surprisal and entropy relative to cognitive scores increases prediction accuracy of reading times, and we examine whether LMs exhibit systematic biases in the prediction of reading times for cognitively high- or low-performing groups, revealing what type of psycholinguistic subjects a given LM emulates.Our study finds that in most cases, incorporating cognitive capacities increases predictive power of surprisal and entropy on reading times, and that generally, high performance in the psychometric tests is associated with lower sensitivity to predictability effects. Finally, our results suggest that the analyzed LMs emulate readers with lower verbal intelligence, suggesting that for a given target group (i.e., individuals with high verbal intelligence), these LMs provide less accurate predictability effect estimates. 2024.findings-acl.469 @@ -12539,7 +12539,7 @@ The State of Relation Extraction Data Quality: Is Bigger Always Better? EricaCaiDepartment of Computer Science, University of Massachusetts at Amherst - BrendanO’ConnorUniversity of Massachusetts, Amherst + BrendanO’ConnorUniversity of Massachusetts, Amherst 7893-7906 Relation extraction (RE) extracts structured tuples of relationships (e.g. friend, enemy) between entities (e.g. Sherlock Holmes, John Watson) from text, with exciting potential applications. Hundreds of RE papers have been published in recent years; do their evaluation practices inform these goals? We review recent surveys and a sample of recent RE methods papers, compiling 38 datasets currently being used. Unfortunately, many have frequent label errors, and ones with known problems continue to be used. Many datasets focus on producing labels for a large number of relation types, often through error-prone annotation methods (e.g. distant supervision or crowdsourcing), and many recent papers rely exclusively on such datasets. We draw attention to a promising alternative: datasets with a small number of relations, often in specific domains like chemistry, finance, or biomedicine, where it is possible to obtain high quality expert annotations; such data can more realistically evaluate RE performance. The research community should consider more often using such resources. 2024.findings-acl.470 @@ -12550,11 +12550,11 @@ <fixed-case>N</fixed-case>atural<fixed-case>C</fixed-case>ode<fixed-case>B</fixed-case>ench: Examining Coding Performance Mismatch on <fixed-case>H</fixed-case>uman<fixed-case>E</fixed-case>val and Natural User Queries ShudanZhang HanlinZhao - XiaoLiu + XiaoLiu QinkaiZheng ZehanQiTsinghua University XiaotaoGuZhipu AI - YuxiaoDongTsinghua University + YuxiaoDongTsinghua University JieTangTsinghua University, Tsinghua University 7907-7928 Large language models (LLMs) have manifested strong ability to generate codes for productive activities. However, current benchmarks for code synthesis, such as HumanEval, MBPP, and DS-1000, are predominantly oriented towards introductory tasks on algorithm and data science, insufficiently satisfying challenging requirements prevalent in real-world coding. To fill this gap, we propose NaturalCodeBench (NCB), a challenging code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks. NCB comprises 402 high-quality problems in Python and Java, meticulously selected from natural user queries from online coding services, covering 6 different domains. Noting the extraordinary difficulty in creating testing cases for real-world queries, we also introduce a semi-automated pipeline to enhance the efficiency of test case construction. Comparing with manual solutions, it achieves an efficiency increase of more than 4 times. Our systematic experiments on 39 LLMs find that performance gaps on NCB between models with close HumanEval scores could still be significant, indicating a lack of focus on practical code synthesis scenarios or over-specified optimization on HumanEval. On the other hand, even the best-performing GPT-4 is still far from satisfying on NCB. The evaluation toolkit and development set are available at https://github.com/THUDM/NaturalCodeBench. @@ -12575,7 +12575,7 @@ Empowering cross-lingual abilities of instruction-tuned large language models by translation-following demonstrations - LeonardoRanaldiIdiap Research Institute + LeonardoRanaldiIdiap Research Institute GiuliaPucci AndreFreitasIdiap Research Institute and University of Manchester 7961-7973 @@ -12598,7 +12598,7 @@ Efficient <tex-math>k</tex-math>-Nearest-Neighbor Machine Translation with Dynamic Retrieval YanGao - ZhiweiCao + ZhiweiCao ZhongjianMiao BaosongYang ShiyuLiu @@ -12614,8 +12614,8 @@ Symmetric Dot-Product Attention for Efficient Training of <fixed-case>BERT</fixed-case> Language Models MartinCourtoisGerman Research Center for AI MalteOstendorffGerman Research Center for AI - LeonhardHennigGerman Research Center for AI - GeorgRehmHumboldt Universität Berlin and Deutsches Forschungszentrum für Künstliche Intelligenz + LeonhardHennigGerman Research Center for AI + GeorgRehmHumboldt Universität Berlin and Deutsches Forschungszentrum für Künstliche Intelligenz 8002-8011 Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research.In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise coefficient dot-product attention. When applied to the pre-training of BERT-like models, this new symmetric attention mechanism reaches a score of 79.36 on the GLUE benchmark against 78.74 for the traditional implementation, leads to a reduction of 6% in the number of trainable parameters, and reduces the number of training steps required before convergence by half. 2024.findings-acl.476 @@ -12627,7 +12627,7 @@ FanyouWuAmazon WeijieXu ChandanReddyVirginia Tech - SrinivasanSengameduAmazon + SrinivasanSengameduAmazon 8012-8026 In this study, we tackle the challenge of inadequate and costly training data that has hindered the development of conversational question answering (ConvQA) systems. Enterprises have a large corpus of diverse internal documents. Instead of relying on a searching engine, a more compelling approach for people to comprehend these documents is to create a dialogue system. In this paper, we propose a robust dialog synthesising method. We learn the segmentation of data for the dialog task instead of using segmenting at sentence boundaries. The synthetic dataset generated by our proposed method achieves superior quality when compared to WikiDialog, as assessed through machine and human evaluations. By employing our inpainted data for ConvQA retrieval system pre-training, we observed a notable improvement in performance across OR-QuAC benchmarks. 2024.findings-acl.477 @@ -12647,7 +12647,7 @@ Alignment-Based Decoding Policy for Low-Latency and Anticipation-Free Neural <fixed-case>J</fixed-case>apanese Input Method Editors ArminSarhangzadeh - TaroWatanabeNara Institute of Science and Technology, Japan + TaroWatanabeNara Institute of Science and Technology, Japan 8043-8054 Japanese input method editors (IMEs) are essential tools for inputting Japanese text using a limited set of characters such as the kana syllabary. However, despite their importance, the potential of newer attention-based encoder-decoder neural networks, such as Transformer, has not yet been fully explored for IMEs due to their high computational cost and low-quality intermediate output in simultaneous settings, leading to high latencies. In this work, we propose a simple decoding policy to enable the use of attention-based encoder-decoder networks for simultaneous kana-kanji conversion in the context of Japanese IMEs inspired by simultaneous machine translation (SimulMT). We demonstrate that simply decoding by explicitly considering the word boundaries achieves a fairly strong quality-latency trade-off, as it can be seen as equivalent to performing decoding on aligned prefixes and thus achieving an incremental anticipation-free conversion. We further show how such a policy can be applied in practice to achieve high-quality conversions with minimal computational overhead. Our experiments show that our approach can achieve a noticeably better quality-latency trade-off compared to the baselines, while also being a more practical approach due to its ability to directly handle streaming input. Our code is available at https://anonymous.4open.science/r/transformer_ime-D327. 2024.findings-acl.479 @@ -12658,12 +12658,12 @@ <fixed-case>EC</fixed-case>o<fixed-case>K</fixed-case>: Emotional Commonsense Knowledge Graph for Mining Emotional Gold ZhunhengWangNankai University XiaoyiLiu - MengtingHuNankai University + MengtingHuNankai University RuiYing MingJiangNankai University JianfengWu YalanXieNankai University - HangGaoTianjin University of Science and Technology + HangGaoTianjin University of Science and Technology RenhongCheng 8055-8074 The demand for understanding and expressing emotions in the field of natural language processing is growing rapidly. Knowledge graphs, as an important form of knowledge representation, have been widely utilized in various emotion-related tasks. However, existing knowledge graphs mainly focus on the representation and reasoning of general factual knowledge, while there are still significant deficiencies in the understanding and reasoning of emotional knowledge. In this work, we construct a comprehensive and accurate emotional commonsense knowledge graph, ECoK. We integrate cutting-edge theories from multiple disciplines such as psychology, cognitive science, and linguistics, and combine techniques such as large language models and natural language processing. By mining a large amount of text, dialogue, and sentiment analysis data, we construct rich emotional knowledge and establish the knowledge generation model COMET-ECoK. Experimental results show that ECoK contains high-quality emotional reasoning knowledge, and the performance of our knowledge generation model surpasses GPT-4-Turbo, which can help downstream tasks better understand and reason about emotions. Our data and code is available from https://github.com/ZornWang/ECoK. @@ -12674,8 +12674,8 @@ Deterministic Reversible Data Augmentation for Neural Machine Translation JiashuYao - HeyanHuangBeijing Institute of Technology - ZemingLiu + HeyanHuangBeijing Institute of Technology + ZemingLiu YuhangGuo 8075-8089 Data augmentation is an effective way to diversify corpora in machine translation, but previous methods may introduce semantic inconsistency between original and augmented data because of irreversible operations and random subword sampling procedures. To generate both symbolically diverse and semantically consistent augmentation data, we propose Deterministic Reversible Data Augmentation (DRDA), a simple but effective data augmentation method for neural machine translation. DRDA adopts deterministic segmentations and reversible operations to generate multi-granularity subword representations and pulls them closer together with multi-view techniques. With no extra corpora or model changes required, DRDA outperforms strong baselines on several translation tasks with a clear margin (up to 4.3 BLEU gain over Transformer) and exhibits good robustness in noisy, low-resource, and cross-domain datasets. @@ -12715,7 +12715,7 @@ Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks AditiMishra - SajjadurRahmanMegagon Labs + SajjadurRahmanMegagon Labs KushanMitra HannahKimMegagon Labs EstevamHruschkaMegagon Labs and Carnegie Mellon University @@ -12740,8 +12740,8 @@ Linear Cross-Lingual Mapping of Sentence Embeddings - OlegVasilyevPrimer Technologies - FumikaIsonoPrimer AI + OlegVasilyevPrimer Technologies + FumikaIsonoPrimer AI JohnBohannon 8163-8171 Semantics of a sentence is defined with much less ambiguity than semantics of a single word, and we assume that it should be better preserved by translation to another language. If multilingual sentence embeddings intend to represent sentence semantics, then the similarity between embeddings of any two sentences must be invariant with respect to translation. Based on this suggestion, we consider a simple linear cross-lingual mapping as a possible improvement of the multilingual embeddings. We also consider deviation from orthogonality conditions as a measure of deficiency of the embeddings. @@ -12775,8 +12775,8 @@ <fixed-case>BASS</fixed-case>: Batched Attention-optimized Speculative Sampling - HaifengQianAmazon - Sujan KumarGonugondlaAmazon + HaifengQianAmazon + Sujan KumarGonugondlaAmazon SungsooHaAmazon MingyueShangAmazon Sanjay KrishnaGoudaAmazon @@ -12795,7 +12795,7 @@ DekunWuUniversité de Montréal HaochenShi ZhiyuanSun - BangLiuUniversity of Montreal + BangLiuUniversity of Montreal 8225-8291 In this study, we explore the application of Large Language Models (LLMs) in Jubensha, a Chinese detective role-playing game and a novel area in Artificial Intelligence (AI) driven gaming. We introduce the first dataset specifically for Jubensha, including character scripts and game rules, to foster AI agent development in this complex narrative environment. Our work also presents a unique multi-agent interaction framework using LLMs, allowing AI agents to autonomously engage in Jubensha games. To evaluate the gaming performance of these AI agents, we developed novel methods measuring their mastery of case information and reasoning skills. Furthermore, we incorporated the latest advancements in prompting engineering to enhance the agents’ performance in information gathering, murderer identification, and logical reasoning. The experimental results validate the effectiveness of our proposed methods. This work aims to offer a novel perspective on understanding LLM capabilities and establish a new benchmark for evaluating large language model-based agents. 2024.findings-acl.490 @@ -12804,8 +12804,8 @@ It Is Not About What You Say, It Is About How You Say It: A Surprisingly Simple Approach for Improving Reading Comprehension - SagiShaier - LawrenceHunterUniversity of Colorado at Denver + SagiShaier + LawrenceHunterUniversity of Colorado at Denver KatharinaWenseJohannes-Gutenberg Universität Mainz, University of Colorado, Boulder and New York University 8292-8305 Natural language processing has seen rapid progress over the past decade. Due to the speed of developments, some practices get established without proper evaluation. Considering one such case and focusing on reading comprehension, we ask our first research question: 1) How does the order of inputs – i.e., question and context – affect model performance? Additionally, given recent advancements in input emphasis, we ask a second research question: 2) Does emphasizing either the question, the context, or both enhance performance? Experimenting with 9 large language models across 3 datasets, we find that presenting the context before the question improves model performance, with an accuracy increase of up to 31%. Furthermore, emphasizing the context yields superior results compared to question emphasis, and in general, emphasizing parts of the input is particularly effective for addressing questions that models lack the parametric knowledge to answer. Experimenting with both prompt-based and attention-based emphasis methods, we additionally find that the best method is surprisingly simple: it only requires concatenating a few tokens to the input and results in an ac- curacy improvement of up to 36%, allowing smaller models to outperform their significantly larger counterparts. @@ -12829,7 +12829,7 @@ XinyuWangUniversity of Warwick HainiuXu LinGuiKing’s College London, University of London - YulanHeKing’s College London, University of London + YulanHeKing’s College London, University of London 8324-8340 Task embedding, a meta-learning technique that captures task-specific information, has gained popularity, especially in areas such as multi-task learning, model editing, and interpretability. However, it faces challenges with the emergence of prompt-guided Large Language Models (LLMs) operating in a gradient-free manner. Existing task embedding methods rely on fine-tuned, task-specific language models, which hinders the adaptability of task embeddings across diverse models, especially prompt-based LLMs. To hardness the potential of task embeddings in the era of LLMs, we propose a framework for unified task embeddings (FUTE), harmonizing task embeddings from various models, including smaller language models and LLMs with varied prompts, within a single vector space. Such uniformity enables comparison and analysis of similarities amongst different models, broadening the scope and utility of existing task embedding methods in multi-model scenarios, while maintaining their performance comparable to architecture-specific methods. 2024.findings-acl.493 @@ -12841,7 +12841,7 @@ YinhongLiu YimaiFangApple DavidVandyke - NigelCollierUniversity of Cambridge + NigelCollierUniversity of Cambridge 8341-8356 In light of recent advances in large language models (LLMs), the expectations for the next generation of virtual assistants include enhanced naturalness and adaptability across diverse usage scenarios. However, the creation of high-quality annotated data for Task-Oriented Dialog (TOD) is recognized to be slow and costly. To address these challenges, we introduce Task-Oriented Automatic Dialogs (TOAD), a novel and scalable TOD dataset along with its automatic generation pipeline. The TOAD dataset simulates realistic app context interaction and provide a variety of system response style options. Two aspects of system response styles are considered, verbosity level and users’ expression mirroring. We benchmark TOAD on two response generation tasks, and the results show that modeling more verbose responses or responses without user expression mirroring is more challenging. 2024.findings-acl.494 @@ -12851,7 +12851,7 @@ Machine-Generated Text Localization ZhongpingZhang - WendaQin + WendaQin BryanPlummerBoston University 8357-8371 Machine-Generated Text (MGT) detection aims to identify a piece of text as machine or human written. Prior work has primarily formulated MGT detection as a binary classification task over an entire document, with limited work exploring cases where only part of a document is machine generated. This paper provides the first in-depth study of MGT that localizes the portions of a document that were machine generated. Thus, if a bad actor were to change a key portion of a news article to spread misinformation, whole document MGT detection may fail since the vast majority is human written, but our approach can succeed due to its granular approach. A key challenge in our MGT localization task is that short spans of text, *e.g.*, a single sentence, provides little information indicating if it is machine generated due to its short length. To address this, we leverage contextual information, where we predict whether multiple sentences are machine or human written at once. This enables our approach to identify changes in style or content to boost performance. A gain of 4-13% mean Average Precision (mAP) over prior work demonstrates the effectiveness of approach on five diverse datasets: GoodNews, VisualNews, WikiText, Essay, and WP. We release our implementation at https://github.com/Zhongping-Zhang/MGT_Localization. @@ -12861,8 +12861,8 @@ <fixed-case>B</fixed-case>ench<fixed-case>IE</fixed-case>^<fixed-case>FL</fixed-case>: A Manually Re-Annotated Fact-Based Open Information Extraction Benchmark - FabriceLamarche - PhilippeLanglaisUniversité de Montréal + FabriceLamarche + PhilippeLanglaisUniversité de Montréal 8372-8394 Open Information Extraction (OIE) is a field of natural language processing that aims to present textual information in a format that allows it to be organized, analyzed and reflected upon. Numerous OIE systems are developed, claiming ever-increasing performance, marking the need for objective benchmarks. BenchIE is the latest reference we know of. Despite being very well thought out, we noticed a number of issues we believe are limiting. Therefore, we propose BenchIE^FL, a new OIE benchmark which fully enforces the principles of BenchIE while containing fewer errors, omissions and shortcomings when candidate facts are matched towards reference ones. BenchIE^FL allows insightful conclusions to be drawn on the actual performance of OIE extractors. 2024.findings-acl.496 @@ -12872,12 +12872,12 @@ <fixed-case>C</fixed-case>ausal<fixed-case>C</fixed-case>ite: A Causal Formulation of Paper Citations IshanAgrawal - ZhijingJin + ZhijingJin EhsanMokhtarianSwiss Federal Institute of Technology Lausanne SiyuanGuo YuenChenUniversity of Illinois at Urbana-Champaign MrinmayaSachanSwiss Federal Institute of Technology - BernhardSchölkopfELLIS Institute and Max Planck Institute for Intelligent Systems, Max-Planck Institute + BernhardSchölkopfELLIS Institute and Max Planck Institute for Intelligent Systems, Max-Planck Institute 8395-8410 Citation count of a paper is a commonly used proxy for evaluating the significance of a paper in the scientific community. Yet citation measures are widely criticized for failing to accurately reflect the true impact of a paper. Thus, we propose CausalCite, a new way to measure the significance of a paper by assessing the causal impact of the paper on its follow-up papers. CausalCite is based on a novel causal inference method, TextMatch, which adapts the traditional matching framework to high-dimensional text embeddings. TextMatch encodes each paper using text embeddings from large language models (LLMs), extracts similar samples by cosine similarity, and synthesizes a counterfactual sample as the weighted average of similar papers according to their similarity values. We demonstrate the effectiveness of CausalCite on various criteria, such as high correlation with paper impact as reported by scientific experts on a previous dataset of 1K papers, (test-of-time) awards for past papers, and its stability across various subfields of AI. We also provide a set of findings that can serve as suggested ways for future researchers to use our metric for a better understanding of the quality of a paper. Our code is available at https://github.com/causalNLP/causal-cite. 2024.findings-acl.497 @@ -12917,7 +12917,7 @@ Multi-Label Classification for Implicit Discourse Relation Recognition WanqiuLong - N.Siddharth + N.Siddharth BonnieWebber 8437-8451 Discourse relations play a pivotal role in establishing coherence within textual content, uniting sentences and clauses into a cohesive narrative. The Penn Discourse Treebank (PDTB) stands as one of the most extensively utilized datasets in this domain. In PDTB-3, the annotators can assign multiple labels to an example, when they believe the simultaneous presence of multiple relations. Prior research in discourse relation recognition has treated these instances as separate examples during training, with a gold-standard prediction matching one of the labels considered correct at test time. However, this approach is inadequate, as it fails to account for the interdependence of labels in real-world contexts and to distinguish between cases where only one sense relation holds and cases where multiple relations hold simultaneously. In our work, we address this challenge by exploring various multi-label classification frameworks to handle implicit discourse relation recognition. We show that the methods for multi-label prediction don’t depress performance for single-label prediction. Additionally, we give comprehensive analysis of results and data. Our work contributes to advancing the understanding and application of discourse relations and provide a foundation for the future study. @@ -12928,10 +12928,10 @@ <fixed-case>S</fixed-case>tudent<fixed-case>E</fixed-case>val: A Benchmark of Student-Written Prompts for Large Language Models of Code Hannah McLeanBabe - SydneyNguyen - YangtianZi + SydneyNguyen + YangtianZi ArjunGuha - Molly QFeldman + Molly QFeldman Carolyn JaneAnderson 8452-8474 Code LLMs have the potential to make it easier for non-experts to understand and write code. However, current CodeLLM benchmarks rely on a single expert-written prompt per problem, making it hard to generalize their success to non-expert users. In this paper, we present a new natural-language-to-code benchmark of prompts written by a key population of non-experts: beginning programmers. StudentEval contains 1,749 prompts written by 80 students who have only completed one introductory Python course. StudentEval contains numerous non-expert prompts describing the same problem, enabling exploration of key factors in prompt success. We use StudentEval to evaluate 12 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. Our analysis of student prompting strategies reveals that nondeterministic LLM sampling can mislead students about the quality of their descriptions, a finding with key implications for Code LLMs in education. @@ -12953,9 +12953,9 @@ Generating Diverse and High-Quality Texts by Minimum <fixed-case>B</fixed-case>ayes Risk Decoding YuuJinnaiCyberAgent, Inc. - UkyoHondaCyberAgent, Inc. + UkyoHondaCyberAgent, Inc. TetsuroMorimuraCyberAgent, Inc. - PeinanZhangCyberAgent AI Lab + PeinanZhangCyberAgent AI Lab 8494-8525 One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse.Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed to generate diverse outputs are predominantly based on beam search or random sampling, thus their output quality is capped by these underlying decoding algorithms. In this paper, we investigate an alternative approach – we develop diversity-promoting decoding algorithms by enforcing diversity objectives to MBR decoding.We propose two variants of MBR; (i) Diverse MBR (DMBR) that adds a diversity penalty to the decoding objective and (ii) k-medoids MBR (KMBR) that reformulates the decoding task as a clustering problem.We evaluate DMBR and KMBR on a variety of directed text generation tasks using encoder-decoder models and a language model with prompting. The experimental results show that the proposed method achieves a better trade-off than the diverse beam search and sampling algorithms overall. 2024.findings-acl.503 @@ -12999,9 +12999,9 @@ Bi-Chainer: Automated Large Language Models Reasoning with Bidirectional Chaining - ShuqiLiu - BoweiHe - LinqiSongCity University of Hong Kong + ShuqiLiu + BoweiHe + LinqiSongCity University of Hong Kong 8578-8598 Large Language Models (LLMs) have shown human-like reasoning abilities but still face challenges in solving complex logical problems. Existing unidirectional chaining methods, such as forward chaining and backward chaining, suffer from issues like low prediction accuracy and efficiency. To address these, we propose a bidirectional chaining method, Bi-Chainer, which dynamically switches to depth-first reasoning in the opposite reasoning direction when it encounters multiple branching options within the current direction. Thus, the intermediate reasoning results can be utilized as guidance to facilitate the reasoning process. We show that Bi-Chainer achieves sizable accuracy boots over unidirectional chaining frameworks on four challenging logical reasoning datasets. Moreover, Bi-Chainer enhances the accuracy of intermediate proof steps and reduces the average number of inference calls, resulting in more efficient and accurate reasoning. 2024.findings-acl.507 @@ -13022,8 +13022,8 @@ Knowledge Context Modeling with Pre-trained Language Models for Contrastive Knowledge Graph Completion GuangqianYangUniversity of Science and Technology of China YiLiuState Key Laboratory of Communication Content Cognition - LeiZhangUniversity of Science and Technology of China - LichengZhang + LeiZhangUniversity of Science and Technology of China + LichengZhang HongtaoXieUniversity of Science and Technology of China ZhendongMaoUniversity of Science and Technology of China 8619-8630 @@ -13039,7 +13039,7 @@ XianLiAmazon JingboShangUniversity of California, San Diego HoangNguyen - PhilipYuUniversity of Illinois, Chicago + PhilipYuUniversity of Illinois, Chicago 8631-8643 Attribute value extraction involves identifying the value spans of predetermined attributes in product texts. This area of research has traditionally operated under a closed-world assumption, focusing on products from a static set of categories and their associated attributes. However, products in e-commerce stores are ever-increasing and evolving, calling for life-long learning. If continuously trained on the fast-increasing products and attributes, most existing solutions not only struggle for parameter efficiency but also endure foreseeable defects due to data contamination, catastrophic forgetting, etc. As a remedy, we propose and study a new task, which aims to effectively maintain a strong single model for many domains in a life-long learning fashion, without jeopardizing the model performance and parameter efficiency. We introduce factorization into the model and make it domain-aware by decoupling the modeling of product type and attribute, as a way to promote de-contamination and parameter efficiency while scaling up. Tuning the model with distillation prevents forgetting historical knowledge and enables continuous learning from emerging domains. Experiments on hundreds of domains showed that our model attains the near state-of-the-art performance with affordable parameter size, the least historical knowledge forgetting, and the greatest robustness against noises, whilst adding only a few parameters per domain when compared with competitive baselines. 2024.findings-acl.510 @@ -13050,7 +13050,7 @@ Exploring Domain Robust Lightweight Reward Models based on Router Mechanism HyukNamgoongChungnam National University JeesuJung - SangkeunJung + SangkeunJung YoonHyungRohElectronics and Telecommunications Research Institute 8644-8652 Recent advancements in large language models have heavily relied on the large reward model from reinforcement learning from human feedback for fine-tuning. However, the use of a single reward model across various domains may not always be optimal, often requiring retraining from scratch when new domain data is introduced. To address these challenges, we explore the utilization of small language models operating in a domain-specific manner based on router mechanisms. Our three approaches are: 1) utilize mixture of experts to form a single reward model by modularizing an internal router and experts, 2) employing external router to select the appropriate reward model from multiple domain-specific models, and 3) the framework reduces parameter size by loading reward models and router adapters onto a single small language model using adapters. Experimental validation underscores the effectiveness of our approach, demonstrating performance comparable to baseline methods while also reducing the total parameter size. @@ -13062,13 +13062,13 @@ Generalized Category Discovery with Large Language Models in the Loop WenbinAnXi’an Jiaotong University WenkaiShi - FengTianXi’an Jiaotong University + FengTianXi’an Jiaotong University HaonanLinXi’an Jiaotong University QianYingWang - YaqiangWuLenovo Research + YaqiangWuLenovo Research MingxiangCai LuyanWang - YanChenXi’an Jiaotong University + YanChenXi’an Jiaotong University HaipingZhuXi’an Jiaotong University PingChenUniversity of Massachusetts, Boston 8653-8665 @@ -13124,7 +13124,7 @@ LiangDing HaotongQinETHZ - ETH Zurich XiabinZhou - YifuDingBeihang University + YifuDingBeihang University XueboLiuHarbin Institute of Technolgy, Shenzhen MinZhangHarbin Institute of Technology, Shenzhen JinyangGuoBeijing University of Aeronautics and Astronautics @@ -13143,7 +13143,7 @@ YiLiuPeking University YuxiangWang ShuhuaiRen - LeiLiUniversity of Hong Kong + LeiLiUniversity of Hong Kong SishuoChenAlibaba Group XuSun LuHouHuawei Technologies Ltd. @@ -13156,7 +13156,7 @@ “Get Their Hands Dirty, Not Mine”: On Researcher-Annotator Collaboration and the Agency of Annotators ShengqiZhuCornell University - JeffreyRzeszotarskiCornell University + JeffreyRzeszotarskiCornell University 8773-8782 Annotation quality is often framed as post-hoc cleanup of annotator-caused issues. This position paper discusses whether, how, and why this narrative limits the scope of improving annotation. We call to consider annotation as a procedural collaboration, outlining three points in this direction:(1) An issue can be either annotator- or researcher-oriented, where one party is accountable and the other party may lack ability to fix it; (2) yet, they can co-occur or have similar consequences, and thus any specific problem we encounter may be a combination;(3) therefore, we need a new language to capture the nuance and holistically describe the full procedure to resolve these issues.To that end, we propose to study how agency is manifested in annotation and picture how this perspective benefits the community more broadly. 2024.findings-acl.518 @@ -13165,7 +13165,7 @@ Teaching Large Language Models an Unseen Language on the Fly - ChenZhangPeking University + ChenZhangPeking University XiaoLiuPeking University JiuhengLin YansongFengPeking University @@ -13181,7 +13181,7 @@ BaopuQiu LiangDing KanjianZhangSchools of Automation, Southeast University - TomKocmiMicrosoft + TomKocmiMicrosoft DachengTaoUniversity of Sydney 8801-8816 Generative large language models (LLMs), e.g., ChatGPT, have demonstrated remarkable proficiency across several NLP tasks, such as machine translation, text summarization. Recent research (Kocmi and Federmann, 2023) has shown that utilizing LLMs for assessing the quality of machine translation (MT) achieves state-of-the-art performance at the system level but performs poorly at the segment level. To further improve the performance of LLMs on MT quality assessment, we conduct an investigation into several prompting designs, and propose a new prompting method called Error Analysis Prompting (EAPrompt) by combining Chain-of-Thoughts (Wei et al., 2022) and Error Analysis (Lu et al., 2023). This technique emulates the commonly accepted human evaluation framework - Multidimensional Quality Metrics (MQM, Freitag et al., (2021)) and produces explainable and reliable MT evaluations at both the system and segment level. Experimental Results from WMT22 metrics shared task validate the effectiveness of EAPrompt on various LLMs, with different structures. Further analysis confirms that EAPrompt effectively distinguishes major errors from minor ones, while also sharing a similar distribution of the number of errors with MQM. These findings highlight the potential of EAPrompt as a human-like evaluator prompting technique for MT evaluation. We will release our code and scripts to facilitate the community. @@ -13192,7 +13192,7 @@ <fixed-case>GAOKAO</fixed-case>-<fixed-case>MM</fixed-case>: A <fixed-case>C</fixed-case>hinese Human-Level Benchmark for Multimodal Models Evaluation YiZong - XipengQiuFudan University + XipengQiuFudan University 8817-8825 The Large Vision-Language Models (LVLMs) have demonstrated great abilities in image perception and language understanding. However, existing datasets either focus solely on primary perception abilities and commonsense knowledge, or have a low level of text comprehension difficulty, which are insufficient to reflect the comprehensive capabilities of LVLMs, particularly in terms of Chinese language proficiency. We propose GAOKAO-MM, a multimodal benchmark based on the Chinese College Entrance Examination (GAOKAO), comprising of 8 subjects and 12 types of images, such as diagrams, function graphs, maps and photos. GAOKAO-MM derives from native Chinese context and sets human-level requirements for the model’s abilities, including perception, understanding, knowledge and reasoning. We evaluate 10 LVLMs and find that the accuracies of all of them are lower than 50%, with GPT-4-Vision (48.1%), Qwen-VL-Plus (41.2%) and Gemini-Pro-Vision (35.1%) ranking in the top three positions. The results of our multi-dimension analysis indicate that LVLMs have moderate distance towards Artificial General Intelligence (AGI) and provide insights facilitating the development of multilingual LVLMs. The dataset and evaluation code are available through: https://github.com/OpenMOSS/GAOKAO-MM 2024.findings-acl.521 @@ -13205,7 +13205,7 @@ ChengyuWangAlibaba Group TingfengCao JunHuang - LianwenJinSouth China University of Technology + LianwenJinSouth China University of Technology 8826-8840 We present DiffChat, a novel method to align Large Language Models (LLMs) to “chat” with prompt-as-input Text-to-Image Synthesis (TIS)models (e.g., Stable Diffusion) for interactive image creation. Given a raw prompt/image and a user-specified instruction, DiffChat can effectively make appropriate modifications and generate the target prompt, which can be leveraged to create the target image of high quality. To achieve this, we first collect an instruction-following prompt engineering dataset named InstructPE for the supervised training of DiffChat.Next, we propose a reinforcement learning framework with the feedback of three core criteria for image creation, i.e., aesthetics, user preference and content integrity. It involves an action-space dynamic modification technique to obtain more relevant positive samples and harder negative samples during the off-policy sampling. Content integrity is also introduced into the value estimation function for further improvement of produced images. Our method can exhibit superior performance than baseline models and strong competitors based on both automatic and human evaluations, which fully demonstrates its effectiveness. 2024.findings-acl.522 @@ -13215,10 +13215,10 @@ Revisiting Parallel Context Windows: A Frustratingly Simple Alternative and Chain-of-Thought Deterioration KejuanYang - XiaoLiu + XiaoLiu KaiwenMen AohanZengTsinghua University, Tsinghua University - YuxiaoDongTsinghua University + YuxiaoDongTsinghua University JieTangTsinghua University, Tsinghua University 8841-8852 We identify two crucial limitations in the evaluation of recent parallel-integrated method Parallel Context Windows (PCW), which extends the maximum context lengths of language models, e.g., 2048 for LLaMA, by harnessing window-wise attention and positional embedding techniques. We first show that a simple yet strong baseline, weighted sum ensemble, is missing for the in-context few-shot classification. Moreover, on more challenging Chain-of-Thought (CoT) reasoning (e.g., HotpotQA), PCW would present unexpected deterioration regarding question miscomprehension and false inference. Based on our findings, we suggest that the existing PCW design may not guarantee sufficient improvement and practicality in handling lengthy documents in real-world applications. More community efforts on enabling language models’ long context understanding ability should be paid. @@ -13230,8 +13230,8 @@ Rationales for Answers to Simple Math Word Problems Confuse Large Language Models YidanZhang MingfengXueSichuan University - DayihengLiuAlibaba Group - ZhenanHeSichuan University + DayihengLiuAlibaba Group + ZhenanHeSichuan University 8853-8869 Recently, large language models (LLMs) have demonstrated breakthrough mathematical problem-solving capabilities in grade school math word problems (MWP). For example, on the MWP benchmark GSM8K, the accuracy of GPT-3.5-Turbo and MetaMath-70B reaches 80.80% and 82.30%, respectively. One question arises, does it mean that LLMs have truly mastered related mathematical problem-solving abilities? In this paper, by presenting two types of benchmarks, where MCGSM8K aims at selecting one correct solution from four solutions, while GSM8K-Judgement judges whether a solution to a given question is true or false, we demonstrate that the ability of most LLMs to evaluate the mathematical reasoning process of MWP is far from sufficient. To compensate for this issue, we propose hybrid supervised fine-tuning data from the training data of GSM8K, MCGSM8K, and GSM8K-Judgement, which significantly improves performance on the proposed reasoning process evaluation benchmarks. For example, fine-tuning improves the performance of LLaMA-2-13B from 33.51% to 70.89% on MCGSM8K. In conclusion, we experimentally demonstrate that most LLMs have limited ability to evaluate the mathematical reasoning process of MWP, which can be enhanced through fine-tuning. 2024.findings-acl.524 @@ -13258,8 +13258,8 @@ Towards Objectively Benchmarking Social Intelligence of Language Agents at the Action Level - ChenxuWangTsinghua University, Tsinghua University - BinDaiXiaoIce + ChenxuWangTsinghua University, Tsinghua University + BinDaiXiaoIce HuapingLiuTsinghua University, Tsinghua University BaoyuanWangXiaobing.ai 8885-8897 @@ -13270,7 +13270,7 @@ Semantic Role Labeling from <fixed-case>C</fixed-case>hinese Speech via End-to-End Learning - HuiyaoChen + HuiyaoChen XinxinLi MeishanZhangHarbin Institute of Technology (Shenzhen), China and Tianjin University, China MinZhangHarbin Institute of Technology, Shenzhen @@ -13283,8 +13283,8 @@ <fixed-case>MEEL</fixed-case>: Multi-Modal Event Evolution Learning ZhengweiTao - ZhiJinPeking University and Peking University - JunqiangHuangVIPSHOP + ZhiJinPeking University and Peking University + JunqiangHuangVIPSHOP XiancaiChen XiaoyingBai YifanZhang @@ -13299,7 +13299,7 @@ <fixed-case>LLM</fixed-case>-<fixed-case>REDIAL</fixed-case>: A Large-Scale Dataset for Conversational Recommender Systems Created from User Behaviors with <fixed-case>LLM</fixed-case>s TingtingLiangHangzhou Dianzi University ChenxinJin - LingzhiWangThe Chinese University of Hong Kong + LingzhiWangThe Chinese University of Hong Kong WenqiFan CongyingXiaSalesForce.com KaiChen @@ -13314,7 +13314,7 @@ Investigating Subtler Biases in <fixed-case>LLM</fixed-case>s: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models MahammedKamruzzamanUniversity of South Florida Md.ShovonRajshahi University of Engineering and Technology - GeneKimUniversity of South Florida + GeneKimUniversity of South Florida 8940-8965 LLMs are increasingly powerful and widely used to assist users in a variety of tasks. This use risks introducing LLM biases into consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. Bias in NLP systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., Asians are good at math). In this paper, we investigate bias along less-studied but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that LLMs make between social groups and unrelated positive and negative attributes. Although these subtler biases are understudied they follow people as much as gender and ethnicity do. So, we want to see whether they also follow one with LLMs.We introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. We also reverse the completion task to select the social group based on an attribute. We report the correlations that we find for 4 cutting-edge LLMs. This dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation. 2024.findings-acl.530 @@ -13325,10 +13325,10 @@ <fixed-case>EVIT</fixed-case>: Event-Oriented Instruction Tuning for Event Reasoning ZhengweiTao XiancaiChen - ZhiJinPeking University and Peking University + ZhiJinPeking University and Peking University XiaoyingBai HaiyanZhaoPeking University - YiweiLou + YiweiLou 8966-8979 Events refer to specific occurrences, incidents, or happenings that take place under a particular background. Event reasoning aims to infer events according to certain relations and predict future events. The cutting-edge techniques for event reasoning play a crucial role in various natural language processing applications. Large language models (LLMs) have made significant advancements in event reasoning owing to their wealth of knowledge and reasoning capabilities. However, smaller instruction-tuned models currently in use do not consistently demonstrate exceptional proficiency in managing these tasks. This discrepancy arises from the absence of explicit modeling of events and the interconnections of them within their instruction data. Consequently, these models face challenges in comprehending event structures and semantics while struggling to bridge the gap between their interpretations and human understanding of events. Additionally, their limitations in grasping event relations lead to constrained event reasoning abilities to effectively deduce and incorporate pertinent event knowledge. In this paper, we propose Event-Oriented Instruction Tuning to train our large language model named EvIT specializing in event reasoning tasks. Specifically, we first propose a novel structure named event quadruple which contains the structure and semantics of events and is complete in the event representation. We then design event-relation learning based on the structures. We encapsulate the learning into the instruction-tuning formulation to better stimulate the event reasoning capacity of our model. To implement our training, we design a heuristic unsupervised method to mine event quadruple from a large-scale corpus. At last, we finetune a Llama model on our Event-Oriented Instruction Tuning. We conduct extensive experiments on event reasoning tasks on several datasets. Automatic and human evaluations demonstrate EvIT achieves competitive performances on event reasoning. 2024.findings-acl.531 @@ -13338,8 +13338,8 @@ <fixed-case>I</fixed-case>nstruct<fixed-case>CMP</fixed-case>: Length Control in Sentence Compression through Instruction-based Large Language Models Juseon-DoChungnam National University - JingunKwonChungnam National University - HidetakaKamigaitoNara Institute of Science and Technology + JingunKwonChungnam National University + HidetakaKamigaitoNara Institute of Science and Technology ManabuOkumuraTokyo Institute of Technology 8980-8996 Extractive summarization can produce faithful summaries but often requires additional constraints such as a desired summary length. Traditional sentence compression models do not typically consider the constraints because of their restricted model abilities, which require model modifications for coping with them. To bridge this gap, we propose Instruction-based Compression (InstructCMP), an approach to the sentence compression task that can consider the length constraint through instructions by leveraging the zero-shot task-solving abilities of Large Language Models (LLMs). For this purpose, we created new evaluation datasets by transforming traditional sentence compression datasets into an instruction format. By using the datasets, we first reveal that the current LLMs still face challenges in accurately controlling the length for a compressed text. To address this issue, we propose an approach named length priming, that incorporates additional length information into the instructions without external resources. While the length priming effectively works in a zero-shot setting, a training dataset with the instructions would further improve the ability of length control. Thus, we additionally created a training dataset in an instruction format to fine-tune the model on it. Experimental results and analysis show that applying the length priming significantly improves performances of InstructCMP in both zero-shot and fine-tuning settings without the need of any model modifications. @@ -13351,7 +13351,7 @@ <fixed-case>S</fixed-case>ym<fixed-case>T</fixed-case>ax: Symbiotic Relationship and Taxonomy Fusion for Effective Citation Recommendation KaranGoyalIndraprastha Institute of Information Technology, Delhi MayankGoel - VikramGoyalIndraprastha Institute of Information Technology, Delhi + VikramGoyalIndraprastha Institute of Information Technology, Delhi MukeshMohaniaIndraprastha Institute of Information Technology 8997-9008 Citing pertinent literature is pivotal to writing and reviewing a scientific document. Existing techniques mainly focus on the local context or the global context for recommending citations but fail to consider the actual human citation behaviour. We propose SymTax, a three-stage recommendation architecture that considers both the local and the global context, and additionally the taxonomical representations of query-candidate tuples and the Symbiosis prevailing amongst them. SymTax learns to embed the infused taxonomies in the hyperbolic space and uses hyperbolic separation as a latent feature to compute query-candidate similarity. We build a novel and large dataset ArSyTa containing 8.27 million citation contexts and describe the creation process in detail. We conduct extensive experiments and ablation studies to demonstrate the effectiveness and design choice of each module in our framework. Also, combinatorial analysis from our experiments shed light on the choice of language models (LMs) and fusion embedding, and the inclusion of section heading as a signal. Our proposed module that captures the symbiotic relationship solely leads to performance gains of 26.66% and 39.25% in Recall@5 w.r.t. SOTA on ACL-200 and RefSeer datasets, respectively. The complete framework yields a gain of 22.56% in Recall@5 wrt SOTA on our proposed dataset. The code and dataset are available at https://github.com/goyalkaraniit/SymTax. @@ -13362,8 +13362,8 @@ Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability YejunYoonSoongsil University - SeunghyunYoonAdobe Research - KunwooParkSoongsil University + SeunghyunYoonAdobe Research + KunwooParkSoongsil University 9009-9024 This paper addresses the critical challenge of assessing the representativeness of news thumbnail images, which often serve as the first visual engagement for readers when an article is disseminated on social media. We focus on whether a news image represents the actors discussed in the news text. To serve the challenge, we introduce NewsTT, a manually annotated dataset of 1000 news thumbnail images and text pairs. We found that the pretrained vision and language models, such as BLIP-2, struggle with this task. Since news subjects frequently involve named entities or proper nouns, the pretrained models could have a limited capability to match news actors’ visual and textual appearances. We hypothesize that learning to contrast news text with its counterfactual, of which named entities are replaced, can enhance the cross-modal matching ability of vision and language models. We propose CFT-CLIP, a contrastive learning framework that updates vision and language bi-encoders according to the hypothesis. We found that our simple method can boost the performance for assessing news thumbnail representativeness, supporting our assumption. Code and data can be accessed at https://github.com/ssu-humane/news-images-acl24. 2024.findings-acl.534 @@ -13372,7 +13372,7 @@ Towards Better Question Generation in <fixed-case>QA</fixed-case>-based Event Extraction - ZijinHong + ZijinHong JianLiuBeijing Jiaotong University 9025-9038 Event Extraction (EE) is an essential information extraction task that aims to extract event-related information from unstructured texts.The paradigm of this task has shifted from conventional classification-based methods to more contemporary question-answering-based (QA-based) approaches. However, in QA-based EE, the quality of the questions dramatically affects the extraction accuracy, and how to generate high-quality questions for QA-based EE remains a challenge. In this work, to tackle this challenge, we suggest four criteria to evaluate the quality of a question and propose a reinforcement learning method, RLQG, for QA-based EE that can generate generalizable, high-quality, and context-dependent questions and provides clear guidance to QA models. The extensive experiments conducted on ACE and RAMS datasets have strongly validated our approach’s effectiveness, which also demonstrates its robustness in scenarios with limited training data. The corresponding code of RLQG is released for further research. @@ -13383,11 +13383,11 @@ Budget-Constrained Tool Learning with Planning YuanhangZhengTsinghua University, Tsinghua University - PengLiTsinghua University - MingYan + PengLiTsinghua University + MingYan JiZhangAlibaba Group FeiHuangAlibaba Group - YangLiu + YangLiu 9039-9052 Despite intensive efforts devoted to tool learning, the problem of budget-constrained tool learning, which focuses on resolving user queries within a specific budget constraint, has been widely overlooked. This paper proposes a novel method for budget-constrained tool learning. Our approach involves creating a preferable plan under the budget constraint before utilizing the tools. This plan outlines the feasible tools and the maximum number of times they can be employed, offering a comprehensive overview of the tool learning process for large language models. This allows them to allocate the budget from a broader perspective. To devise the plan without incurring significant extra costs, we suggest initially estimating the usefulness of the candidate tools based on past experience. Subsequently, we employ dynamic programming to formulate the plan. Experimental results demonstrate that our method can be integrated with various tool learning methods, significantly enhancing their effectiveness under strict budget constraints. 2024.findings-acl.536 @@ -13399,10 +13399,10 @@ HuayangLi SihengLi DengCaiTencent AI Lab - LongyueWang + LongyueWang LemaoLiuTencent - TaroWatanabeNara Institute of Science and Technology, Japan - YujiuYangGraduate School at Shenzhen,Tsinghua University + TaroWatanabeNara Institute of Science and Technology, Japan + YujiuYangGraduate School at Shenzhen,Tsinghua University ShumingShiTencent AI Lab 9053-9076 Large language models with instruction-following abilities have revolutionized the field of artificial intelligence. These models show exceptional generalizability to tackle various real-world tasks through their natural language interfaces. However, their performance heavily relies on high-quality exemplar data, which is often difficult to obtain. This challenge is further exacerbated when it comes to multimodal instruction following. We introduce TextBind, an almost annotation-free framework for empowering LLMs with multi-turn interleaved multimodal instruction-following capabilities. Our approach requires only image-caption pairs and generates multi-turn multimodal instruction-response conversations from a language model. To accommodate interleaved image-text inputs and outputs, we devise MIM, a language model-centric architecture that seamlessly integrates image encoder and decoder models. Extensive quantitative and qualitative experiments demonstrate that MIM trained on TextBind achieves remarkable generation capability in multimodal conversations compared to recent baselines. @@ -13416,7 +13416,7 @@ JunlongLi WeizheYuan RuifengYuan - WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University + WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University PengfeiLiu 9077-9096 2024.findings-acl.538 @@ -13425,8 +13425,8 @@ <fixed-case>C</fixed-case>o<fixed-case>C</fixed-case>o-Agent: A Comprehensive Cognitive <fixed-case>MLLM</fixed-case> Agent for Smartphone <fixed-case>GUI</fixed-case> Automation - XinbeiMa - ZhuoshengZhangShanghai Jiao Tong University + XinbeiMa + ZhuoshengZhangShanghai Jiao Tong University HaiZhaoShanghai Jiao Tong University 9097-9110 Multimodal large language models (MLLMs) have shown remarkable potential as human-like autonomous language agents to interact with real-world environments, especially for graphical user interface (GUI) automation.However, those GUI agents require comprehensive cognition including exhaustive perception and reliable action response.We propose a Comprehensive Cognitive LLM Agent, CoCo-Agent, with two novel approaches, comprehensive environment perception (CEP) and conditional action prediction (CAP), to systematically improve the GUI automation performance. First, CEP facilitates the GUI perception through different aspects and granularity, including screenshots and complementary detailed layouts for the visual channel and historical actions for the textual channel.Second, CAP decomposes the action prediction into sub-problems: determining the action type and then identifying the action target conditioned on the action type.With our technical design, our agent achieves state-of-the-art performance on AITW and META-GUI benchmarks, showing promising abilities in realistic scenarios. Code is available at https://github.com/xbmxb/CoCo-Agent. @@ -13476,10 +13476,10 @@ <fixed-case>CTC</fixed-case>-based Non-autoregressive Textless Speech-to-Speech Translation - QingkaiFangInstitute of Computing Technology, Chinese Academy of Sciences + QingkaiFangInstitute of Computing Technology, Chinese Academy of Sciences ZhengruiMaInstitute of Computing Technology, Chinese Academy of Sciences YanZhou - MinZhangHarbin Institute of Technology + MinZhangHarbin Institute of Technology YangFengInstitute of Computing Technology, Chinese Academy of Sciences 9155-9161 Direct speech-to-speech translation (S2ST) has achieved impressive translation quality, but it often faces the challenge of slow decoding due to the considerable length of speech sequences. Recently, some research has turned to non-autoregressive (NAR) models to expedite decoding, yet the translation quality typically lags behind autoregressive (AR) models significantly. In this paper, we investigate the performance of CTC-based NAR models in S2ST, as these models have shown impressive results in machine translation. Experimental results demonstrate that by combining pretraining, knowledge distillation, and advanced NAR training techniques such as glancing training and non-monotonic latent alignments, CTC-based NAR models achieve translation quality comparable to the AR model, while preserving up to 26.81\times decoding speedup. @@ -13533,10 +13533,10 @@ <fixed-case>LCS</fixed-case>: A Language Converter Strategy for Zero-Shot Neural Machine Translation ZengkuiSun YijinLiuWechat AI - FandongMengWeChat AI, Tencent Inc. + FandongMengWeChat AI, Tencent Inc. JinanXuBeijing Jiaotong University YufengChen - JieZhou + JieZhou 9201-9214 Multilingual neural machine translation models generally distinguish translation directions by the language tag (LT) in front of the source or target sentences. However, current LT strategies cannot indicate the desired target language as expected on zero-shot translation, i.e., the off-target issue. Our analysis reveals that the indication of the target language is sensitive to the placement of the target LT. For example, when placing the target LT on the decoder side, the indication would rapidly degrade along with decoding steps, while placing the target LT on the encoder side would lead to copying or paraphrasing the source input. To address the above issues, we propose a simple yet effective strategy named Language Converter Strategy (LCS). By introducing the target language embedding into the top encoder layers, LCS mitigates confusion in the encoder and ensures stable language indication for the decoder. Experimental results on MultiUN, TED, and OPUS-100 datasets demonstrate that LCS could significantly mitigate the off-target issue, with language accuracy up to 95.28%, 96.21%, and 85.35% meanwhile outperforming the vanilla LT strategy by 3.07, 3,3, and 7.93 BLEU scores on zero-shot translation, respectively. 2024.findings-acl.547 @@ -13562,12 +13562,12 @@ JingweiYiUniversity of Science and Technology of China RuiYeShanghai Jiaotong University QisiChen - BinZhuMicrosoft Research + BinZhuMicrosoft Research SihengChenShanghai Jiao Tong University - DefuLianUniversity of Science and Technology of China - GuangzhongSunUniversity of Science and Technology of China - XingXieMicrosoft - FangzhaoWuMicrosoft + DefuLianUniversity of Science and Technology of China + GuangzhongSunUniversity of Science and Technology of China + XingXieMicrosoft + FangzhaoWuMicrosoft 9236-9260 Large language models (LLMs) possess immense capabilities but are susceptible to malicious exploitation. To mitigate the risk, safety alignment is employed to align LLMs with ethical standards. However, safety-aligned LLMs may remain vulnerable to carefully crafted jailbreak attacks, but these attacks often face high rejection rates and limited harmfulness. In this paper, we expose the vulnerabilities of safety alignment in open-access LLMs, which can significantly enhance the success rate and harmfulness of jailbreak attacks. Through reverse alignment, achieved by accessing model parameters, we show the feasibility of efficiently fine-tuning LLMs to undermine their inherent safeguards. We investigate two types of reverse alignment techniques: reverse supervised fine-tuning (RSFT) and reverse preference optimization (RPO). RSFT operates by supervising the fine-tuning of LLMs to reverse their inherent values. We also explore how to prepare data needed for RSFT. RPO optimizes LLMs to enhance their preference for harmful content, reversing the models’ safety alignment. Our extensive experiments reveal that open-access high-performance LLMs can be adeptly reverse-aligned to output harmful content, even in the absence of manually curated malicious datasets. Our research acts as a whistleblower for the community, emphasizing the need to pay more attention to safety of open-accessing LLMs. It also underscores the limitations of current safety alignment approaches and calls for research on robust safety alignment methods to counteract malicious fine-tuning attacks. 2024.findings-acl.549 @@ -13577,9 +13577,9 @@ <fixed-case>PEK</fixed-case>: A Parameter-Efficient Framework for Knowledge-Grounded Dialogue Generation PanYang - DandanSongBeijing Institute of Technology - ZhijingWuBeijing Institute of Technology - YanruZhou + DandanSongBeijing Institute of Technology + ZhijingWuBeijing Institute of Technology + YanruZhou 9261-9273 Pre-trained language models (PLMs) have shown great dialogue generation capability in different scenarios. However, the huge VRAM consumption when fine-tuning them is one of their drawbacks. PEFT approaches can significantly reduce the number of trainable parameters, which enables us to fine-tune larger dialogue generation models. However, the reduction in parameter quantity can diminish a PLM’s expressive capacity and affect the PLM’s learning from certain specific examples like knowledge-related conversations. Previous works have demonstrated that injecting external knowledge into dialogue generation models can improve the model’s performance in knowledge-related conversations. Nonetheless, these methods are designed for the scenario where most parameters of the entire framework are trainable. In this paper, we propose PEK, a parameter-efficient framework for knowledge-enhanced dialogue generation. It enables PLMs to leverage external knowledge documents and knowledge graphs to enhance its generation capabilities with an acceptable number of trainable parameters. Evaluation results on the Wizard of Wikipedia and CMU_DoG datasets show that our approach outperforms baseline methods on multiple evaluation metrics, which validates the effectiveness of our approach. 2024.findings-acl.550 @@ -13591,7 +13591,7 @@ LiwenZhengBeijing University of Posts and Telecommunications ChaozhuoLi XiZhangBeijing University of Posts and Telecommunications - Yu-MingShang + Yu-MingShang FeiranHuang HaoranJiaBeijing University of Posts and Telecommunications 9274-9281 @@ -13604,11 +13604,11 @@ Outdated Issue Aware Decoding for Factual Knowledge Editing ZengkuiSun YijinLiuWechat AI - JiaanWangSoochow University - FandongMengWeChat AI, Tencent Inc. + JiaanWangSoochow University + FandongMengWeChat AI, Tencent Inc. JinanXuBeijing Jiaotong University YufengChen - JieZhou + JieZhou 9282-9293 Recently, Knowledge Editing has received increasing attention, since it could update the specific knowledge from outdated ones in pretrained models without re-training. However, as pointed out by recent studies, existing related methods tend to merely memorize the superficial word composition of the edited knowledge, rather than truly learning and absorbing it. Consequently, on the reasoning questions, we discover that existing methods struggle to utilize the edited knowledge to reason the new answer, and tend to retain outdated responses, which are generated by the original models utilizing original knowledge. Nevertheless, the outdated responses are unexpected for the correct answers to reasoning questions, which we named as the outdated issue. To alleviate this issue, in this paper, we propose a simple yet effective decoding strategy, i.e., outDated ISsue aware deCOding (DISCO), to enhance the performance of edited models on reasoning questions. Specifically, we capture the difference in the probability distribution between the original and edited models. Further, we amplify the difference of the token prediction in the edited model to alleviate the outdated issue, and thus enhance the model performance w.r.t the edited knowledge. Experimental results suggest that applying DISCO could enhance edited models to reason, e.g., on reasoning questions, DISCO outperforms the prior SOTA method by 12.99 F1 scores, and reduces the ratio of the outdated issue to 5.78% on the zsRE dataset. 2024.findings-acl.552 @@ -13617,9 +13617,9 @@ Disentangling Dialect from Social Bias via Multitask Learning to Improve Fairness - MaximilianSpliethöverLeibniz University Hannover + MaximilianSpliethöverLeibniz University Hannover Sai NikhilMenon - HenningWachsmuthLeibniz Universität Hannover + HenningWachsmuthLeibniz Universität Hannover 9294-9313 Dialects introduce syntactic and lexical variations in language that occur in regional or social groups. Most NLP methods are not sensitive to such variations. This may lead to unfair behavior of the methods, conveying negative bias towards dialect speakers. While previous work has studied dialect-related fairness for aspects like hate speech, other aspects of biased language, such as lewdness, remain fully unexplored. To fill this gap, we investigate performance disparities between dialects in the detection of five aspects of biased language and how to mitigate them. To alleviate bias, we present a multitask learning approach that models dialect language as an auxiliary task to incorporate syntactic and lexical variations. In our experiments with African-American English dialect, we provide empirical evidence that complementing common learning approaches with dialect modeling improves their fairness. Furthermore, the results suggest that multitask learning achieves state-of-the-art performance and helps to detect properties of biased language more reliably. 2024.findings-acl.553 @@ -13628,10 +13628,10 @@ <fixed-case>DP</fixed-case>-<fixed-case>MLM</fixed-case>: Differentially Private Text Rewriting Using Masked Language Models - StephenMeisenbacher + StephenMeisenbacher MaulikChevliTechnische Universität München JurajVladikaTechnische Universität München - FlorianMatthesTechnische Universität München + FlorianMatthesTechnische Universität München 9314-9328 2024.findings-acl.554 meisenbacher-etal-2024-dp @@ -13651,10 +13651,10 @@ <fixed-case>EX</fixed-case>-<fixed-case>FEVER</fixed-case>: A Dataset for Multi-hop Explainable Fact Verification HuanhuanMa WeizhiXu - YifanWei + YifanWei LiujiChen - LiangWang - QiangLiuInstitute of Automation, Chinese Academy of Sciences + LiangWang + QiangLiuInstitute of Automation, Chinese Academy of Sciences ShuWuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences LiangWangCASIA 9340-9353 @@ -13665,14 +13665,14 @@ Agent-<fixed-case>FLAN</fixed-case>: Designing Data and Methods of Effective Agent Tuning for Large Language Models - ZehuiChen + ZehuiChen KuikunLiu QiuchenWang - WenweiZhangShanghai AI Laboratory + WenweiZhangShanghai AI Laboratory JiangningLiu DahuaLinThe Chinese University of Hong Kong - KaiChenShanghai AI Laboratory - FengZhaoUniversity of Science and Technology of China + KaiChenShanghai AI Laboratory + FengZhaoUniversity of Science and Technology of China 9354-9366 Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem.This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LANguage models for Agents.Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs. The code and models are available at https://github.com/InternLM/Agent-FLAN. 2024.findings-acl.557 @@ -13683,15 +13683,15 @@ Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification EkaterinaFadeeva AleksandrRubashevskiiSkolkovo Institute of Science and Technology - ArtemShelmanovMohamed bin Zayed University of Artificial Intelligence - SergeyPetrakov - HaonanLi + ArtemShelmanovMohamed bin Zayed University of Artificial Intelligence + SergeyPetrakov + HaonanLi HamdyMubarak EvgeniiTsymbalovIndependent Researcher GlebKuzminArtificial Intelligence Research Institute and Institute for Systems Analysis of Russian Academy of Sciences AlexanderPanchenkoSkoltech - TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne - PreslavNakovMohamed bin Zayed University of Artificial Intelligence + TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne + PreslavNakovMohamed bin Zayed University of Artificial Intelligence MaximPanovMohamed bin Zayed University of Artificial Intelligence 9367-9385 Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven different LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge. @@ -13701,14 +13701,14 @@ Deciphering the Impact of Pretraining Data on Large Language Models through Machine Unlearning - YangZhao + YangZhao LiDu - XiaoDing - KaiXiongHarbin Institute of Technology + XiaoDing + KaiXiongHarbin Institute of Technology ZhouhaoSun ShiJun TingLiuHarbin Institute of Technology - BingQinHarbin Institute of Technology + BingQinHarbin Institute of Technology 9386-9406 Through pretraining on a corpus with various sources, Large Language Models (LLMs) have gained impressive performance. However, the impact of each component of the pretraining corpus remains opaque. As a result, the organization of the pretraining corpus is still empirical and may deviate from the optimal. To address this issue, we systematically analyze the impact of 48 datasets from 5 major categories of pretraining data of LLMs and measure their impacts on LLMs using benchmarks about nine major categories of model capabilities. Our analyses provide empirical results about the contribution of multiple corpora on the performances of LLMs, along with their joint impact patterns, including complementary, orthogonal, and correlational relationships. We also identify a set of “high-impact data” such as Books that is significantly related to a set of model capabilities. These findings provide insights into the organization of data to support more efficient pretraining of LLMs. 2024.findings-acl.559 @@ -13745,8 +13745,8 @@ Description Boosting for Zero-Shot Entity and Relation Classification GabrielePiccoInternational Business Machines LeopoldFuchsDuale Hochschule Baden-Württemberg Stuttgart - MarcosMartínez GalindoInternational Business Machines - AlbertoPurpuraInternational Business Machines + MarcosMartínez GalindoInternational Business Machines + AlbertoPurpuraInternational Business Machines VanessaLópezInternational Business Machines HoangThanh LamInternational Business Machines 9441-9457 @@ -13758,10 +13758,10 @@ Domain-Aware <tex-math>k</tex-math>-Nearest-Neighbor Knowledge Distillation for Machine Translation ZhexuanWang - ShudongLiuUniversity of Macau + ShudongLiuUniversity of Macau XueboLiuHarbin Institute of Technolgy, Shenzhen - MiaoZhangHarbin Institute of Technology (Shenzhen) - DerekWongUniversity of Macau + MiaoZhangHarbin Institute of Technology (Shenzhen) + DerekWongUniversity of Macau MinZhangHarbin Institute of Technology, Shenzhen 9458-9469 kNN-MT has utilized neighborhood knowledge for auxiliary decoding, significantly improving translation performance. Subsequently, kNN-KD transitions the use of neighborhood knowledge from the decoding phase to the training phase, to address the temporal and spatial inefficiencies inherent in kNN-MT. However, kNN-KD transfers all the kNN knowledge arbitrarily, which has the potential to restrict the learning of student models. In this paper, we propose a novel domain-aware kNN-KD method, which filters out domain-relevant neighborhood knowledge for learning in the distillation process. Notably, this entire process exclusively utilizes the neighborhood knowledge of the original model, eliminating the need for establishing any additional datastores. Experiments on four domain translation tasks demonstrate that our method achieves state-of-the-art performance, realizing an average gain of 1.55 COMET and 1.42 BLEU scores, by further enhancing the translation of rare words. Source code can be accessed at https://github.com/wangzx1219/Dk-KD. @@ -13773,13 +13773,13 @@ Beyond Single-Event Extraction: Towards Efficient Document-Level Multi-Event Argument Extraction WanlongLiu LiZhouThe Chinese University of Hong Kong - DingYiZeng + DingYiZeng YichenXiao - ShaohuanChengUniversity of Electronic Science and Technology of China - ChenZhangNational University of Singapore + ShaohuanChengUniversity of Electronic Science and Technology of China + ChenZhangNational University of Singapore GrandeeLeeSingapore University of Social Sciences MaluZhangUniversity of Electronic Science and Technology of China - WenyuChen + WenyuChen 9470-9487 Recent mainstream event argument extraction methods process each event in isolation, resulting in inefficient inference and ignoring the correlations among multiple events. To address these limitations, here we propose a multiple-event argument extraction model DEEIA (Dependency-guided Encoding and Event-specific Information Aggregation), capable of extracting arguments from all events within a document simultaneously. The proposed DEEIA model employs a multi-event prompt mechanism, comprising DE and EIA modules. The DE module is designed to improve the correlation between prompts and their corresponding event contexts, whereas the EIA module provides event-specific information to improve contextual understanding. Extensive experiments show that our method achieves new state-of-the-art performance on four public datasets (RAMS, WikiEvents, MLEE, and ACE05), while significantly saving the inference time compared to the baselines. Further analyses demonstrate the effectiveness of the proposed modules. 2024.findings-acl.564 @@ -13791,14 +13791,14 @@ Revisiting Interpolation Augmentation for Speech-to-Text Generation ChenXuHarbin Engineering University - JieWang + JieWang XiaoqianLiuNortheastern University QianDongByteDance ChunliangZhangNortheastern University TongXiaoNortheastern University JingBoZhuNortheastern University DapengMan - WuYang + WuYang 9488-9499 Speech-to-text (S2T) generation systems frequently face challenges in low-resource scenarios, primarily due to the lack of extensive labeled datasets. One emerging solution is constructing virtual training samples by interpolating inputs and labels, which has notably enhanced system generalization in other domains. Despite its potential, this technique’s application in S2T tasks has remained under-explored. In this paper, we delve into the utility of interpolation augmentation, guided by several pivotal questions. Our findings reveal that employing an appropriate strategy in interpolation augmentation significantly enhances performance across diverse tasks, architectures, and data scales, offering a promising avenue for more robust S2T systems in resource-constrained settings. 2024.findings-acl.565 @@ -13843,7 +13843,7 @@ Enhancing Cross Text-Molecule Learning by Self-Augmentation YinuoJiang - XiangZhuang + XiangZhuang KeyanDingZhejiang University QiangZhangZhejiang University HuajunChenZhejiang University @@ -13856,7 +13856,7 @@ <fixed-case>R</fixed-case>e<fixed-case>PALM</fixed-case>: Popular Quote Tweet Generation via Auto-Response Augmentation ErxinYuHong Kong Polytechnic University - JingLiThe Hong Kong Polytechnic University + JingLiThe Hong Kong Polytechnic University ChunpuXu 9566-9579 A quote tweet enables users to share others’ content while adding their own commentary. In order to enhance public engagement through quote tweets, we investigate the task of generating popular quote tweets. This task aims to produce quote tweets that garner higher popularity, as indicated by increased likes, replies, and retweets. Despite the impressive language generation capabilities of large language models (LLMs), there has been limited research on how LLMs can effectively learn the popularity of text to better engage the public. Therefore, we introduce a novel approach called Response-augmented Popularity-Aligned Language Model (RePALM), which aligns language generation with popularity by leveraging insights from augmented auto-responses provided by readers. We utilize the Proximal Policy Optimization framework with a dual-reward mechanism to jointly optimize for the popularity of the quote tweet and its consistency with the auto-responses. In our experiments, we collected two datasets consisting of quote tweets containing external links and those referencing others’ tweets. Extensive results demonstrate the superiority of RePALM over advanced language models that do not incorporate response augmentation. @@ -13880,7 +13880,7 @@ Do Pre-Trained Language Models Detect and Understand Semantic Underspecification? Ask the <fixed-case>DUST</fixed-case>! FrankWildenburg MichaelHannaUniversity of Amsterdam - SandroPezzelleUniversity of Amsterdam + SandroPezzelleUniversity of Amsterdam 9598-9613 In everyday language use, speakers frequently utter and interpret sentences that are semantically underspecified, namely, whose content is insufficient to fully convey their message or interpret them univocally. For example, to interpret the underspecified sentence “Don’t spend too much”, which leaves implicit what (not) to spend, additional linguistic context or outside knowledge is needed. In this work, we propose a novel Dataset of semantically Underspecified Sentences grouped by Type (DUST) and use it to study whether pre-trained language models (LMs) correctly identify and interpret underspecified sentences. We find that newer LMs are reasonably able to identify underspecified sentences when explicitly prompted. However, interpreting them correctly is much harder for any LMs. Our experiments show that when interpreting underspecified sentences, LMs exhibit little uncertainty, contrary to what theoretical accounts of underspecification would predict. Overall, our study reveals limitations in current models’ processing of sentence semantics and highlights the importance of using naturalistic data and communicative scenarios when evaluating LMs’ language capabilities. 2024.findings-acl.572 @@ -13892,7 +13892,7 @@ WenHuangUniversity of Science and Technology of China HongbinLiuDuke University MinxinGuoUniversity of Hong Kong - NeilGongDuke University + NeilGongDuke University 9614-9631 Visual hallucination (VH) means that a multi-modal LLM (MLLM) imagines incorrect details about an image in visual question answering. Existing studies find VH instances only in existing image datasets, which results in biased understanding of MLLMs’ performance under VH due to limited diversity of such VH instances. In this work, we propose a tool called VHTest to generate a diverse set of VH instances. Specifically, VHTest finds some initial VH instances in existing image datasets (e.g., COCO), generates a text description for each VH mode, and uses a text-to-image generative model (e.g., DALL-E-3) to generate VH images based on the text descriptions. We collect a benchmark dataset with 1,200 VH instances in 8 VH modes using VHTest. We find that existing MLLMs such as GPT-4, LLaVA-1.5, and MiniGPT-v2 hallucinate for a large fraction of the instances in our benchmark. Moreover, we find that fine-tuning an MLLM using our benchmark dataset reduces its likelihood to hallucinate without sacrificing its performance on other benchmarks. Our benchmarks are publicly available: https://github.com/wenhuang2000/VHTest. 2024.findings-acl.573 @@ -13902,11 +13902,11 @@ <fixed-case>S</fixed-case>um<fixed-case>S</fixed-case>urvey: An Abstractive Dataset of Scientific Survey Papers for Long Document Summarization RanLiuInstitute of Information Engineering, Chinese Academy of Sciences and University of Chinese Academy of Sciences - MingLiuDeakin University + MingLiuDeakin University MinYuInstitute of Information Engineering, Chinese Academy of Sciences - HeZhangCNPIEC KEXIN LTD + HeZhangCNPIEC KEXIN LTD JianguoJiangInstitute of Information Engineering, Chinese Academy of Sciences - GangLiDeakin University + GangLiDeakin University WeiqingHuangInstitute of Information Engineering, Chinese Academy of Sciences 9632-9651 With the popularity of large language models (LLMs) and their ability to handle longer input documents, there is a growing need for high-quality long document summarization datasets. Although many models already support 16k input, current lengths of summarization datasets are inadequate, and salient information is not evenly distributed. To bridge these gaps, we collect a new summarization dataset called SumSurvey, consisting of more than 18k scientific survey papers. With an average document length exceeding 12k and a quarter exceeding 16k, as well as the uniformity metric outperforming current mainstream long document summarization datasets, SumSurvey brings new challenges and expectations to both fine-tuned models and LLMs. The informativeness of summaries and the models supporting the evaluation of long document summarization warrant further attention. Automatic and human evaluation results on this abstractive dataset confirm this view. Our dataset and code are available at https://github.com/Oswald1997/SumSurvey. @@ -13916,10 +13916,10 @@ Pushing the Limits of Low-Resource <fixed-case>NER</fixed-case> Using <fixed-case>LLM</fixed-case> Artificial Data Generation - JoanSantosoInstitut Sains dan Teknologi Terpadu Surabaya + JoanSantosoInstitut Sains dan Teknologi Terpadu Surabaya PatrickSutanto BillyCahyadiInstitut Sains dan Teknologi Terpadu Surabaya - EstherSetiawanInstitut Sains dan Teknologi Terpadu Surabaya + EstherSetiawanInstitut Sains dan Teknologi Terpadu Surabaya 9652-9667 Named Entity Recognition (NER) is an important task, but to achieve great performance, it is usually necessary to collect a large amount of labeled data, incurring high costs. In this paper, we propose using open-source Large Language Models (LLM) to generate NER data with only a few labeled examples, reducing the cost of human annotations. Our proposed method is very simple and can perform well using only a few labeled data points. Experimental results on diverse low-resource NER datasets show that our proposed data generation method can significantly improve the baseline. Additionally, our method can be used to augment datasets with class-imbalance problems and consistently improves model performance on macro-F1 metrics. 2024.findings-acl.575 @@ -13930,9 +13930,9 @@ Understanding and Patching Compositional Reasoning in <fixed-case>LLM</fixed-case>s ZhaoyiLiCity University of Hong Kong and University of Science and Technology of China GangweiJiangCity University of Hong Kong and University of Science and Technology of China - HongXieUniversity of Science and Technology of China - LinqiSongCity University of Hong Kong - DefuLianUniversity of Science and Technology of China + HongXieUniversity of Science and Technology of China + LinqiSongCity University of Hong Kong + DefuLianUniversity of Science and Technology of China YingWeiNanyang Technological University 9668-9688 LLMs have marked a revolutonary shift, yet they falter when faced with compositional reasoning tasks. Our research embarks on a quest to uncover the root causes of compositional reasoning failures of LLMs, uncovering that most of them stem from the improperly generated or leveraged implicit reasoning results. Inspired by our empirical findings, we resort to Logit Lens and an intervention experiment to dissect the inner hidden states of LLMs. This deep dive reveals that implicit reasoning results indeed surface within middle layers and play a causative role in shaping the final explicit reasoning results. Our exploration further locates multi-head self-attention (MHSA) modules within these layers, which emerge as the linchpins in accurate generation and leveraing of implicit reasoning results. Grounded on the above findings, we develop CREME, a lightweight method to patch errors in compositional reasoning via editing the located MHSA modules. Our empirical evidence stands testament to CREME’s effectiveness, paving the way for autonomously and continuously enhancing compositional reasoning capabilities in language models. @@ -13942,7 +13942,7 @@ Bilingual Rhetorical Structure Parsing with Large Parallel Annotations - ElenaChistovaFRC CSC RAS + ElenaChistovaFRC CSC RAS 9689-9706 Discourse parsing is a crucial task in natural language processing that aims to reveal the higher-level relations in a text. Despite growing interest in cross-lingual discourse parsing, challenges persist due to limited parallel data and inconsistencies in the Rhetorical Structure Theory (RST) application across languages and corpora. To address this, we introduce a parallel Russian annotation for the large and diverse English GUM RST corpus. Leveraging recent advances, our end-to-end RST parser achieves state-of-the-art results on both English and Russian corpora. It demonstrates effectiveness in both monolingual and bilingual settings, successfully transferring even with limited second-language annotation. To the best of our knowledge, this work is the first to evaluate the potential of cross-lingual end-to-end RST parsing on a manually annotated parallel corpus. 2024.findings-acl.577 @@ -13951,8 +13951,8 @@ <fixed-case>B</fixed-case>ook2<fixed-case>D</fixed-case>ial: Generating Teacher Student Interactions from Textbooks for Cost-Effective Development of Educational Chatbots - JunlingWangETHZ - ETH Zurich - JakubMacinaDepartment of Computer Science, ETHZ - ETH Zurich + JunlingWangETHZ - ETH Zurich + JakubMacinaDepartment of Computer Science, ETHZ - ETH Zurich NicoDaheimTechnische Universität Darmstadt SankalanPal Chowdhury MrinmayaSachanSwiss Federal Institute of Technology @@ -13966,8 +13966,8 @@ <fixed-case>SELP</fixed-case>: A Semantically-Driven Approach for Separated and Accurate Class Prototypes in Few-Shot Text Classification WenxinLiang TingyuZhangDalian University of Technology - HanLiuDalian University of Technology - FengZhangPeking University + HanLiuDalian University of Technology + FengZhangPeking University 9732-9741 2024.findings-acl.579 liang-etal-2024-selp @@ -13977,7 +13977,7 @@ Automated Focused Feedback Generation for Scientific Writing Assistance EricChamounUniversity of Cambridge MichaelSchlichtkrullQueen Mary, University of London - AndreasVlachosUniversity of Cambridge + AndreasVlachosUniversity of Cambridge 9742-9763 Scientific writing is a challenging task, particularly for novice researchers who often rely on feedback from experienced peers. Recent work has primarily focused on improving surface form and style rather than manuscript content. In this paper, we propose a novel task: automated focused feedback generation for scientific writing assistance. We present SWIF^2T: a Scientific WrIting Focused Feedback Tool. It is designed to generate specific, actionable and coherent comments, which identify weaknesses in a scientific paper and/or propose revisions to it. Our approach consists of four components - planner, investigator, reviewer and controller - leveraging multiple Large Language Models (LLMs) to implement them. We compile a dataset of 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation. The results demonstrate the superiority in specificity, reading comprehension, and overall helpfulness of SWIF^2T’s feedback compared to other approaches. In our analysis, we also identified cases where automatically generated reviews were judged better than human ones, suggesting opportunities for integration of AI-generated feedback in scientific writing. 2024.findings-acl.580 @@ -13987,7 +13987,7 @@ <fixed-case>F</fixed-case>ast<fixed-case>GAS</fixed-case>: Fast Graph-based Annotation Selection for In-Context Learning ZihanChen - SongWangUniversity of Virginia + SongWangUniversity of Virginia CongShenUniversity of Virginia JundongLiUniversity of Virginia 9764-9780 @@ -14013,7 +14013,7 @@ Integrating Multi-scale Contextualized Information for Byte-based Neural Machine Translation - LanglinHuang + LanglinHuang YangFengInstitute of Computing Technology, Chinese Academy of Sciences 9794-9801 Subword tokenization is a common method for vocabulary building in Neural Machine Translation (NMT) models. However, increasingly complex tasks have revealed its disadvantages. First, a vocabulary cannot be modified once it is learned, making it hard to adapt to new words. Second, in multilingual translation, the imbalance in data volumes across different languages spreads to the vocabulary, exacerbating translations involving low-resource languages. While byte-based tokenization addresses these issues, byte-based models struggle with the low information density inherent in UTF-8 byte sequences. Previous works enhance token semantics through local contextualization but fail to select an appropriate contextualizing scope based on the input. Consequently, we propose the Multi-Scale Contextualization (MSC) method, which learns contextualized information of varying scales across different hidden state dimensions. It then leverages the attention module to dynamically integrate the multi-scale contextualized information. Experiments show that MSC significantly outperforms subword-based and other byte-based methods in both multilingual and out-of-domain scenarios. Code can be found in https://github.com/ictnlp/Multiscale-Contextualization. @@ -14024,9 +14024,9 @@ Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability Afra FeyzaAkyürekBoston University - EkinAkyürek + EkinAkyürek LeshemChoshenInternational Business Machines - DerryWijayaMonash University and Boston University + DerryWijayaMonash University and Boston University JacobAndreasMassachusetts Institute of Technology and Microsoft 9802-9818 While language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims, these generally do not reflect a globally coherent, manipulable model of the world. As a consequence, current LMs also generate incorrect or nonsensical content, and are difficult to edit and bring up to date. We present a method called Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate, yielding an efficient self-supervised procedure for improving LM factuality. Given a collection of seed documents, DCT prompts LMs to generate additional text implied by these documents, reason globally about the correctness of this generated text, and finally fine-tune on text inferred to be correct. Given seed documents from a trusted source, DCT provides a tool for supervised model updating; if seed documents are sampled from the LM itself, DCT enables fully unsupervised fine-tuning for improved coherence and accuracy. Across the CREAK, MQuAKE, and Reversal Curse datasets, supervised DCT improves LM fact verification and text generation accuracy by 3-26%; on CREAK, fully unsupervised DCT improves verification accuracy by 12%. These results show that LMs’ reasoning capabilities during inference can be leveraged during training to improve their reliability. @@ -14038,9 +14038,9 @@ Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion RuiqiLi RongjieHuangFAIR - YongqiWangZhejiang University + YongqiWangZhejiang University ZhiqingHong - ZhouZhaoZhejiang University and Zhejiang University + ZhouZhaoZhejiang University and Zhejiang University 9819-9831 Speech-to-singing voice conversion (STS) task always suffers from data scarcity, because it requires paired speech and singing data. Compounding this issue are the challenges of content-pitch alignment and the suboptimal quality of generated outputs, presenting significant hurdles in STS research. This paper presents SVPT, an STS approach boosted by a self-supervised singing voice pre-training model.We leverage spoken language model techniques to tackle the rhythm alignment problem and the in-context learning capability to achieve zero-shot conversion. We adopt discrete-unit random resampling and pitch corruption strategies, enabling training with unpaired singing data and thus mitigating the issue of data scarcity. SVPT also serves as an effective backbone for singing voice synthesis (SVS), offering insights into scaling up SVS models. Experimental results indicate that SVPT delivers notable improvements in both STS and SVS endeavors. Audio samples are available at https://speech2sing.github.io. 2024.findings-acl.585 @@ -14063,7 +14063,7 @@ YanghaiZhangUniversity of Science and Technology of China YeLiuUniversity of Science and Technology of China ShiweiWuPeking University, Peking University and The Chinese University of Hong Kong - KaiZhang + KaiZhang XukaiLiuUniversity of Science and Technology of China QiLiuUniversity of Science and Technology of China EnhongChenUniversity of Science and Technology of China @@ -14079,7 +14079,7 @@ XinLiangUniversity of Central Florida JiaqiXueUniversity of Central Florida YanchengZhang - RuiXieUniversity of Central Florida + RuiXieUniversity of Central Florida MengxinZhengUniversity of Central Florida 9863-9875 It is imperative to ensure the stability of every prediction made by a language model; that is, a language’s prediction should remain consistent despite minor input variations, like word substitutions. In this paper, we investigate the problem of certifying a language model’s robustness against Universal Text Perturbations (UTPs), which have been widely used in universal adversarial attacks and backdoor attacks. Existing certified robustness based on random smoothing has shown considerable promise in certifying the input-specific text perturbations (ISTPs), operating under the assumption that any random alteration of a sample’s clean or adversarial words would negate the impact of sample-wise perturbations. However, with UTPs, masking only the adversarial words can eliminate the attack. A naive method is to simply increase the masking ratio and the likelihood of masking attack tokens, but it leads to a significant reduction in both certified accuracy and the certified radius due to input corruption by extensive masking. To solve this challenge, we introduce a novel approach, the superior prompt search method, designed to identify a superior prompt that maintains higher certified accuracy under extensive masking. Additionally, we theoretically motivate why ensembles are a particularly suitable choice as base prompts for random smoothing. The method is denoted by superior prompt ensembling technique. We also empirically confirm this technique, obtaining state-of-the-art results in multiple settings. These methodologies, for the first time, enable high certified accuracy against both UTPs and ISTPs. The source code of CR-UTP is available at https://github.com/UCF-ML-Research/CR-UTP. @@ -14090,8 +14090,8 @@ Recovering document annotations for sentence-level bitext RachelWicksJohns Hopkins University - MattPostMicrosoft and Johns Hopkins University - PhilippKoehnJohns Hopkins University + MattPostMicrosoft and Johns Hopkins University + PhilippKoehnJohns Hopkins University 9876-9890 In machine translation, historical models were incapable of handling longer contexts, so the lack of document-level datasets was less noticeable. Now, despite the emergence of long-sequence methods, we remain within a sentence-level paradigm and without data to adequately approach context-aware machine translation. Most large-scale datasets have been processed through a pipeline that discards document-level metadata. In this work, we reconstruct document-level information for three (ParaCrawl, News Commentary, and Europarl) large datasets in German, French, Spanish, Italian, Polish, and Portuguese (paired with English). We then introduce a document-level filtering technique as an alternative to traditional bitext filtering. We present this filtering with analysis to show that this method prefers context-consistent translations rather than those that may have been sentence-level machine translated. Last we train models on these longer contexts and demonstrate improvement in document-level translation without degradation of sentence-level translation. We release our dataset, ParaDocs, and resulting models as a resource to the community. 2024.findings-acl.589 @@ -14100,11 +14100,11 @@ <fixed-case>M</fixed-case>eta<fixed-case>P</fixed-case>ro 2.0: Computational Metaphor Processing on the Effectiveness of Anomalous Language Modeling - RuiMao + RuiMao KaiHeNational University of Singapore ClaudiaOng - QianLiuUniversity of Auckland - ErikCambriaNanyang Technological University + QianLiuUniversity of Auckland + ErikCambriaNanyang Technological University 9891-9908 Metaphor interpretation is a difficult task in natural language understanding. The development of relevant techniques in this domain is slow, mostly because of the lack of large annotated datasets and effective pre-trained language models (PLMs) for metaphor learning. Thus, we propose a large annotated dataset and a PLM for the metaphor interpretation task. Our foundation model is based on a novel anomalous language modeling (ALM) method, which we benchmark with comparable PLM baselines on the new dataset, finding that it largely improves model performance on metaphor identification and interpretation. 2024.findings-acl.590 @@ -14116,9 +14116,9 @@ ShenzhiWangDepartment of Automation, Tsinghua University ChangLiu ZilongZhengBeijing Institute for General Artificial Intelligence - SiyuanQiBeijing Institute for General Artificial Intelligence + SiyuanQiBeijing Institute for General Artificial Intelligence ShuoChenBeijing Institute for General Artificial Intelligence - QisenYang + QisenYang AndrewZhao ChaofeiWangTsinghua University, Tsinghua University ShijiSongTsinghua University, Tsinghua University @@ -14132,7 +14132,7 @@ Direct Preference Optimization with an Offset AfraAminiETHZ - ETH Zurich - TimVieiraJohns Hopkins University + TimVieiraJohns Hopkins University RyanCotterellSwiss Federal Institute of Technology 9954-9972 Direct preference optimization (DPO) is a successful fine-tuning strategy for aligning large language models with human preferences without the need to train a reward model or employ reinforcement learning. DPO, as originally formulated, relies on binary preference data and fine-tunes a language model to increase the likelihood of a preferred response over a dispreferred response. However, not all preference pairs are equal. Sometimes, the preferred response is only slightly better than the dispreferred one. In other cases, the preference is much stronger. For instance, if a response contains harmful or toxic content, the annotator will have a strong preference for that response. In this paper, we propose a generalization of DPO, termed DPO with an offset (ODPO), that does not treat every preference pair equally during fine-tuning. Intuitively, ODPO requires the difference between the likelihood of the preferred and dispreferred response to be greater than an offset value. The offset is determined based on the extent to which one response is preferred over another. Our experiments on various tasks suggest that ODPO significantly outperforms DPO in aligning language models, especially when the number of preference pairs is limited. @@ -14142,16 +14142,16 @@ <fixed-case>T</fixed-case>rans<fixed-case>F</fixed-case>ace: Unit-Based Audio-Visual Speech Synthesizer for Talking Head Translation - XizeCheng + XizeCheng RongjieHuangFAIR - LinjunLiZhejiang University - ZehanWang - TaoJin + LinjunLiZhejiang University + ZehanWang + TaoJin AoxiongYinMicrosoft and Zhejiang University ChenFeiyang XinyuDuan BaoxingHuai - ZhouZhaoZhejiang University and Zhejiang University + ZhouZhaoZhejiang University and Zhejiang University 9973-9986 Direct speech-to-speech translation achieves high-quality results through the introduction of discrete units obtained from self-supervised learning. However, talking head translation, converting audio-visual speech (i.e., talking head video) from one language into another, still confronts several challenges compared to audio speech: (1) Existing methods invariably rely on cascading, synthesizing via both audio and text, resulting in delays and cascading errors. (2) Talking head translation has a limited set of reference frames. If the generated translation exceeds the length of the original speech, the video sequence needs to be supplemented by repeating frames, leading to jarring video transitions. In this work, we propose a model for talking head translation, TransFace, which can directly translate audio-visual speech into audio-visual speech in other languages. It consists of a speech-to-unit translation model to convert audio speech into discrete units and a unit-based audio-visual speech synthesizer, Unit2Lip, to re-synthesize synchronized audio-visual speech from discrete units in parallel. Furthermore, we introduce a Bounded Duration Predictor, ensuring isometric talking head translation and preventing duplicate reference frames. Experiments demonstrate that Unit2Lip significantly improves synchronization and boosts inference speed by a factor of 4.35 on LRS2. Additionally, TransFace achieves impressive BLEU scores of 61.93 and 47.55 for Es-En and Fr-En on LRS3-T and 100% isochronous translations. The samples are available at https://transface-demo.github.io . 2024.findings-acl.593 @@ -14160,7 +14160,7 @@ More than Minorities and Majorities: Understanding Multilateral Bias in Language Generation - JiaxuZhao + JiaxuZhao ZijingShi YitongLiHuawei Technologies Co., Ltd. YulongPeiEindhoven University of Technology @@ -14176,10 +14176,10 @@ Fair Federated Learning with Biased Vision-Language Models HuiminZeng - ZhenruiYue + ZhenruiYue YangZhangUniversity of Illinois at Urbana-Champaign - LanyuShang - DongWangUniversity of Illinois at Urbana-Champaign + LanyuShang + DongWangUniversity of Illinois at Urbana-Champaign 10002-10017 Existing literature that integrates CLIP into federated learning (FL) largely ignores the inherent group unfairness within CLIP and its ethical implications on FL applications. Furthermore, such CLIP bias may be amplified in FL, due to the unique issue of data heterogeneity across clients. However, in identity-sensitive FL applications, model fairness (i.e., group fairness) is imperative for model development. Therefore, this work explores a critical question ignored by the existing literature: how can we build a fair FL framework using biased pre-trained VLMs (e.g., CLIP)? To address this problem, we propose a fairness-aware adaptation framework tailored for VLM (e.g., CLIP) in the context of FL, named Fair Federated Deep Visiual Prompting or FF-DVP. As implied by its name, trains a fair FL model with fairness-aware deep visual prompting (DVP). Moreover, incorporates modality-fused classification heads to learn client-specific knowledge and fairness constraints. These modules explicitly addresses a unique bias in FL, namely the bias triggered by data heterogeneity. We show that can be readily extended to prevailing parameter-efficient fine-tuning methods (e.g., adapter or LoRA) for debiasing. To the best of our knowledge, is the first to leverage biased VLMs for building fair FL frameworks. Extensive results on human face attribute recognition (FAR) applications suggest that effectively improves model fairness and training convergence, outperforming state-of-the-art baselines. 2024.findings-acl.595 @@ -14228,8 +14228,8 @@ JieHe YuhuaKe GuangyaoZhuWaseda University - VictorGutierrez BasultoCardiff University - JeffPanUniversity of Edinburgh, University of Edinburgh + VictorGutierrez BasultoCardiff University + JeffPanUniversity of Edinburgh, University of Edinburgh 10057-10084 Multimodal Large Language Models (MLLMs) fine-tuned with multimodal instruction-following data have demonstrated formidable capabilities in multimodal tasks. However, fine-tuning all parameters of MLLMs has become challenging due to the rapid growth of the overall model’s parameters. To address this issue, we study Parameter-Efficient Fine-Tuning (PEFT) methods for MLLMs. We aim to identify effective methods for enhancing performance in scenarios where only a limited number of parameters are trained. This paper conducts empirical studies that employ four widely used PEFT methods to fine-tune the LLM component of open-source MLLMs. We present a comprehensive analysis that encompasses various aspects, including the impact of PEFT methods on various models, parameters and location of PEFT module, fine-tuning data scale, model stability based on PEFT method, MLLM’s generalization, and hallucination. We evaluated four PEFT methods on seven datasets from two different categories, unseen and seen datasets. Across all experiments, we show that the adapter is the best-performing PEFT method in various aspects. At the same time, fine-tuning the connector layers leads to improved performance in most MLLMs. 2024.findings-acl.598 @@ -14239,8 +14239,8 @@ <fixed-case>PARADISE</fixed-case>: Evaluating Implicit Planning Skills of Language Models with Procedural Warnings and Tips Dataset ArdaUzunoğluJohns Hopkins University - AbdulfattahSafaKoç University - Gözde GülŞahinKoç University + AbdulfattahSafaKoç University + Gözde GülŞahinKoç University 10085-10102 Recently, there has been growing interest within the community regarding whether large language models are capable of planning or executing plans. However, most prior studies use LLMs to generate high-level plans for simplified scenarios lacking linguistic complexity and domain diversity, limiting analysis of their planning abilities. These setups constrain evaluation methods (e.g., predefined action space), architectural choices (e.g., only generative models), and overlook the linguistic nuances essential for realistic analysis. To tackle this, we present PARADISE, an abductive reasoning task using Q&A format on practical procedural text sourced from wikiHow. It involves tip and warning inference tasks directly associated with goals, excluding intermediary steps, with the aim of testing the ability of the models to infer implicit knowledge of the plan solely from the given goal. Our experiments, utilizing fine-tuned language models and zero-shot prompting, reveal the effectiveness of task-specific small models over large language models in most scenarios. Despite advancements, all models fall short of human performance. Notably, our analysis uncovers intriguing insights, such as variations in model behavior with dropped keywords, struggles of BERT-family and GPT-4 with physical and abstract goals, and the proposed tasks offering valuable prior knowledge for other unseen procedural tasks. The PARADISE dataset and associated resources are publicly available for further research exploration with https://anonymous.4open.science/r/paradise-53BD/README.md. 2024.findings-acl.599 @@ -14249,12 +14249,12 @@ <fixed-case>TURNA</fixed-case>: A <fixed-case>T</fixed-case>urkish Encoder-Decoder Language Model for Enhanced Understanding and Generation - GökçeUludoğan + GökçeUludoğan ZeynepBalalBoğaziçi University FurkanAkkurtBoğaziçi University - MeliksahTurkerBogazici University + MeliksahTurkerBogazici University OnurGungorBoğaziçi University - SusanÜsküdarlıBoğaziçi University + SusanÜsküdarlıBoğaziçi University 10103-10117 The recent advances in natural language processing have predominantly favored well-resourced English-centric models, resulting in a significant gap with low-resource languages. In this work, we introduce TURNA, a language model developed for the low-resource language Turkish and is capable of both natural language understanding and generation tasks.TURNA is pretrained with an encoder-decoder architecture based on the unified framework UL2 with a diverse corpus that we specifically curated for this purpose. We evaluated TURNA with three generation tasks and five understanding tasks for Turkish. The results show that TURNA outperforms several multilingual models in both understanding and generation tasks and competes with monolingual Turkish models in understanding tasks. 2024.findings-acl.600 @@ -14268,7 +14268,7 @@ ShuichiroShimizu ZhengdongYangKyoto University, Kyoto University YihangLi - ChenhuiChuKyoto University + ChenhuiChuKyoto University SadaoKurohashiKyoto University 10118-10126 Emotion plays a crucial role in human conversation. This paper underscores the significance of considering emotion in speech translation. We present the MELD-ST dataset for the emotion-aware speech translation task, comprising English-to-Japanese and English-to-German language pairs. Each language pair includes about 10,000 utterances annotated with emotion labels from the MELD dataset. Baseline experiments using the SeamlessM4T model on the dataset indicate that fine-tuning with emotion labels can enhance translation performance in some settings, highlighting the need for further research in emotion-aware speech translation systems. @@ -14291,7 +14291,7 @@ Chain-of-Quizzes: Pedagogy-inspired Example Selection in In-Context-Learning YiquanWuZhejiang University AnlaiZhouZhejiang University - YuhangLiuZhejiang University + YuhangLiuZhejiang University YifeiLiu AdamJatowt WeimingLuZhejiang University @@ -14306,7 +14306,7 @@ It’s Not Easy Being Wrong: Large Language Models Struggle with Process of Elimination Reasoning NishantBalepur - ShramayPalta + ShramayPalta RachelRudingerUniversity of Maryland, College Park 10143-10166 Chain-of-thought (COT) prompting can help large language models (LLMs) reason toward correct answers, but its efficacy in reasoning toward incorrect answers is unexplored. This process of elimination (PoE), when used with COT, can enhance self-consistency, interpretability, and tasks such as medical diagnoses of exclusion. Thus, we propose PoE with COT, where LLMs must reason toward incorrect options on multiple-choice questions. We evaluate the ability of GPT-3.5, LLaMA-2, and Falcon to perform PoE with COT on a total of four commonsense and scientific reasoning datasets. We find that the strategy of PoE always underperforms the strategy of choosing the correct answer. The agreement of these strategies is also lower than the self-consistency of each strategy. To study these issues further, we conduct error analyses and give suggestions for future work. @@ -14316,13 +14316,13 @@ From Discrimination to Generation: Low-Resource Intent Detection with Language Model Instruction Tuning - FengZhangPeking University + FengZhangPeking University WeiChen FeiDingPeking University MengGao TengjiaoWang JiahuiYao - JiabinZheng + JiabinZheng 10167-10183 Intent detection aims to identify user goals from utterances, and is a ubiquitous step towards the satisfaction of user desired needs in many interaction systems. As dynamic and varied intents arise, models that are capable of identifying new intents promptly are required. However, existing studies usually fine-tune discriminative models on the specific defined intent classes, precluding them from being directly adopted to new intent domains. In this paper, we introduce a generative pre-trained intent model that can recognize new intents from different domains in low-resource scenarios. We reformulate intent detection into a generation task and design descriptive and regularized instructions to guide the model effectively to detect new intents in open domains with no parameter updates. To validate the proposed method, we introduce a new intent detection benchmark, including the Meta-Intent Dataset and three types of representative evaluation settings. We conduct extensive experiments which demonstrate that our method outperforms a range of strong baselines that needs further fine-tuning or domain-specific samples. 2024.findings-acl.605 @@ -14331,7 +14331,7 @@ Efficient Continual Pre-training for Building Domain Specific Large Language Models - YongXieAmazon + YongXieAmazon KaranAggarwalAmazon and University of Minnesota, Minneapolis AitzazAhmadAmazon 10184-10201 @@ -14342,7 +14342,7 @@ Distantly-Supervised Joint Extraction with Noise-Robust Learning - YufeiLiUniversity of California, Riverside + YufeiLiUniversity of California, Riverside XiaoYuStellar Cyber YanghongGuo YanchiLiuNEC-Labs @@ -14373,8 +14373,8 @@ YiQiuGuo YuchenYang YaZhangShanghai Jiao Tong University - YuWangShanghai Jiao Tong University - YanfengWangShanghai Jiao Tong University + YuWangShanghai Jiao Tong University + YanfengWangShanghai Jiao Tong University 10231-10241 Structured data offers an efficient means of organizing information. Exsisting text-serialization based methods for processing structured data using large language models (LLMs) are not designed to explicitly capture the heterogeneity of structured data. Such methods are suboptimal for LLMs to process structured data, and may lead to large input token size and poor robustness to input perturbation. In this paper, we propose a novel framework called DictLLM, which is an efficient and effective framework for the modeling of medical lab report to deal with the report-assisted diagnosis generation task. DictLLM introduce 1) group positional encoding to maintain the permutation invariance, 2) hierarchical attention bias to capture the inductive bias of structured data, and 3) a optimal transport alignment layer to align the embeddings generated by the dict encoder with the LLM, producing a list of fixed-length virtual tokens. We conduct experiments with multiple LLM models on a large-scale real-world medical lab report dataset for automatic diagnosis generation. The results show that our proposed framework outperforms the baseline methods and few-shot GPT-4 in terms of both Rouge-L and Knowledge F1 score. We also conduct multiple experiments and analyze the scalability and robustness of our proposed framework, demonstrating the superiority of our method in modeling the heterogeneous structure of medical dictionaries data. 2024.findings-acl.609 @@ -14383,10 +14383,10 @@ imap<fixed-case>S</fixed-case>core: Medical Fact Evaluation Made Easy - HuiminWangJarvis Research Center, Tencent YouTu Lab + HuiminWangJarvis Research Center, Tencent YouTu Lab YutianZhaoTencent AI Lab - XianWuTencent - YefengZheng + XianWuTencent + YefengZheng 10242-10257 Automatic evaluation of natural language generation (NLG) tasks has gained extensive research interests, since it can rapidly assess the performance of large language models (LLMs). However, automatic NLG evaluation struggles with medical QA because it fails to focus on the crucial correctness of medical facts throughout the generated text. To address this, this paper introduces a new data structure, imap, designed to capture key information in questions and answers, enabling evaluators to focus on essential details. The imap comprises three components: Query, Constraint, and Inform, each of which is in the form of term-value pairs to represent medical facts in a structural manner. We then introduce imapScore, which compares the corresponding medical term-value pairs in the imap to score generated texts. We utilize GPT-4 to extract imap from questions, human-annotated answers, and generated responses. To mitigate the diversity in medical terminology for fair term-value pairs comparison, we use a medical knowledge graph to assist GPT-4 in determining matches. To compare imapScore with existing NLG metrics, we establish a new benchmark dataset. The experimental results show that imapScore consistently outperforms state-of-the-art metrics, demonstrating an average improvement of 79.8% in correlation with human scores. Furthermore, incorporating imap into n-gram, embedding, and LLM metrics boosts the base versions, increasing correlation with human scores by averages of 89.9%, 81.7%, and 32.6%, respectively. 2024.findings-acl.610 @@ -14412,7 +14412,7 @@ Debiasing Large Language Models with Structured Knowledge CongdaMaTokyo Institute of Technology, Tokyo Institute of Technology TianyuZhaoSakana AI - ManabuOkumuraTokyo Institute of Technology, Tokyo Institute of Technology + ManabuOkumuraTokyo Institute of Technology, Tokyo Institute of Technology 10274-10287 Due to biases inherently present in data for pre-training, current pre-trained Large Language Models (LLMs) also ubiquitously manifest the same phenomena. Since the bias influences the output from the LLMs across various tasks, the widespread deployment of the LLMs is hampered. We propose a simple method that utilizes structured knowledge to alleviate this issue, aiming to reduce the bias embedded within the LLMs and ensuring they have an encompassing perspective when used in applications. Experimental results indicated that our method has good debiasing ability when applied to existing both autoregressive and masked language models. Additionally, it could ensure that the performances of LLMs on downstream tasks remain uncompromised.Our method outperforms state-of-the-art (SOTA) baselines in the debiasing ability. Importantly, our method obviates the need for training from scratch, thus offering enhanced scalability and cost-effectiveness. 2024.findings-acl.612 @@ -14428,7 +14428,7 @@ FanYinUniversity of California, Los Angeles AramGalstyanInformation Sciences Institute, University of Southern California, University of Southern California, University of Southern California and Amazon Alexa WenpengYinPennsylvania State University - MuhaoChenUniversity of California, Davis and University of Southern California + MuhaoChenUniversity of California, Davis and University of Southern California 10288-10302 Instruction tuning has been used as a promising approach to improve the performance of large language models (LLMs) on unseen tasks. However, current LLMs exhibit limited robustness to unseen instructions, generating inconsistent outputs when the same instruction is phrased with slightly varied forms or language styles. This behavior indicates LLMs’ lack of robustness to textual variations and generalizability to unseen instructions, potentially leading to trustworthiness issues. Accordingly, we propose Contrastive Instruction Tuning, which maximizes the similarity between the hidden representations of semantically equivalent instruction-instance pairs while minimizing the similarity between semantically different ones. To facilitate this approach, we augment the existing FLAN collection by paraphrasing task instructions. Experiments on the PromptBench benchmark show that CoIN consistently improves LLMs’ robustness to unseen instructions with variations across character, word, sentence, and semantic levels by an average of +2.5% in accuracy. 2024.findings-acl.613 @@ -14451,9 +14451,9 @@ Refining and Synthesis: A Simple yet Effective Data Augmentation Framework for Cross-Domain Aspect-based Sentiment Analysis - HainingWang + HainingWang KangHe - BoboLiWuhan University + BoboLiWuhan University LeiChen FeiLiWuhan University XuHan @@ -14468,9 +14468,9 @@ Codec-<fixed-case>SUPERB</fixed-case>: An In-Depth Analysis of Sound Codec Models HaibinWu - Ho-LamChungNational Taiwan University + Ho-LamChungNational Taiwan University Yi-ChengLin - Yuan-KueiWuNational Taiwan University + Yuan-KueiWuNational Taiwan University XuanjunChen Yu-ChiPai Hsiu-HsuanWang @@ -14488,8 +14488,8 @@ SirryChen ShuoFeng LiangSongsong - Chen-ChenZong - JingLiThe Hong Kong Polytechnic University + Chen-ChenZong + JingLiThe Hong Kong Polytechnic University PijiLiNanjing University of Aeronautics and Astronautics 10349-10360 Social media bot detection is increasingly crucial with the rise of social media platforms. Existing methods predominantly construct social networks as graph and utilize graph neural networks (GNNs) for bot detection. However, most of these methods focus on how to improve the performance of GNNs while neglecting the community structure within social networks. Moreover, GNNs based methods still face problems such as poor model generalization due to the relatively small scale of the dataset and over-smoothness caused by information propagation mechanism. To address these problems, we propose the Community-Aware Heterogeneous Graph Contrastive Learning framework (i.e., CACL), which constructs social network as heterogeneous graph with multiple node types and edge types, and then utilizes community-aware module to mine both hard positive samples and hard negative samples for supervised graph contrastive learning with adaptive graph enhancement algorithms. Extensive experiments demonstrate that our framework addresses the previously mentioned challenges and outperforms competitive baselines on three social media bot benchmarks. @@ -14501,8 +14501,8 @@ Are Machines Better at Complex Reasoning? Unveiling Human-Machine Inference Gaps in Entailment Verification SoumyaSanyal TianyiXiao - JiachengLiuAllen Institute for Artificial Intelligence and Paul G. Allen School of Computer Science and Engineering, University of Washington - WenyaWangNanyang Technological University + JiachengLiuAllen Institute for Artificial Intelligence and Paul G. Allen School of Computer Science and Engineering, University of Washington + WenyaWangNanyang Technological University XiangRen 10361-10386 Making inferences in text comprehension to understand the meaning is essential in language processing. This work studies the entailment verification (EV) problem of complex, multi-sentence premises requiring a system to make multiple inferences implicitly. Modern applications of EV in detecting inconsistent model-generated rationales require complex multi-hop reasoning. However, current textual inference datasets mostly contain short-sentence premises that partially focus on this. To address this, we compile an EV benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises. On benchmarking humans and LLMs, we find that LLMs are better than humans in multi-hop reasoning across extended contexts, while humans perform better in simple deductive reasoning tasks. We also finetune a Flan-T5 model for EV using two training objectives to obtain a strong open-source model that outperforms GPT-3.5 and rivals GPT-4. Finally, we use our finetuned model to filter out inconsistent model-generated rationales in self-consistency decoding, resulting in a 6% accuracy improvement on average across three MCQ datasets. @@ -14514,7 +14514,7 @@ <fixed-case>C</fixed-case>hart<fixed-case>I</fixed-case>nstruct: Instruction Tuning for Chart Comprehension and Reasoning AhmedMasryYork University MehradShahmohammadi - Md RizwanParvezQatar Computing Research Institute and Bosch + Md RizwanParvezQatar Computing Research Institute and Bosch EnamulHoqueYork University ShafiqJotySalesForce.com and Nanyang Technological University 10387-10409 @@ -14545,8 +14545,8 @@ MengLiPeking University AasishPappuMeta AI BarlasOguzMeta - MuhammadAbdul-MageedUniversity of British Columbia - LaksLakshmananUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia + LaksLakshmananUniversity of British Columbia RaghuramanKrishnamoorthiFacebook VikasChandraMeta 10424-10443 @@ -14558,7 +14558,7 @@ <fixed-case>S</fixed-case>hared<fixed-case>C</fixed-case>on: Implicit Hate Speech Detection using Shared Semantics HyeseonAhnYonsei University - YoungwookKimKT Corporation + YoungwookKimKT Corporation JunginKim Yo-SubHanYonsei University 10444-10455 @@ -14592,11 +14592,11 @@ Generalization-Enhanced Code Vulnerability Detection via Multi-Task Instruction Fine-Tuning - XiaohuDu + XiaohuDu MingWenHuazhong University of Science and Technology JiahaoZhu - ZifanXie - BinJiNational University of Singapore + ZifanXie + BinJiNational University of Singapore HuijunLiu XuanhuaShiHuazhong University of Science and Technology HaiJinHuazhong University of Science and Technology @@ -14610,9 +14610,9 @@ <fixed-case>PPTSER</fixed-case>: A Plug-and-Play Tag-guided Method for Few-shot Semantic Entity Recognition on Visually-rich Documents WenhuiLiao JiapengWang - ZeningLinSouth China University of Technology + ZeningLinSouth China University of Technology LongfeiXiongKingsoft Office - LianwenJinSouth China University of Technology + LianwenJinSouth China University of Technology 10522-10539 Visually-rich document information extraction (VIE) is a vital aspect of document understanding, wherein Semantic Entity Recognition (SER) plays a significant role. However, few-shot SER on visually-rich documents remains relatively unexplored despite its considerable potential for practical applications. To address this issue, we propose a simple yet effective Plug-and-Play Tag-guided method for few-shot Semantic Entity Recognition (PPTSER) on visually-rich documents. PPTSER is built upon off-the-shelf multi-modal pre-trained models. It leverages the semantics of the tags to guide the SER task, reformulating SER into entity typing and span detection, handling both tasks simultaneously via cross-attention. Experimental results illustrate that PPTSER outperforms existing fine-tuning and few-shot methods, especially in low-data regimes. With full training data, PPTSER achieves comparable or superior performance to fine-tuning baseline. For instance, on the FUNSD benchmark, our method improves the performance of LayoutLMv3-base in 1-shot, 3-shot and 5-shot scenarios by 15.61%, 2.13%, and 2.01%, respectively. Overall, PPTSER demonstrates promising generalizability, effectiveness, and plug-and-play nature for few-shot SER on visually-rich documents. The codes will be available at [https://github.com/whlscut/PPTSER](https://github.com/whlscut/PPTSER). 2024.findings-acl.626 @@ -14622,8 +14622,8 @@ <fixed-case>LLM</fixed-case> Performance Predictors are good initializers for Architecture Search GaneshJawaharGoogle DeepMind - MuhammadAbdul-MageedUniversity of British Columbia - LaksLakshmananUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia + LaksLakshmananUniversity of British Columbia DujianDingComputing Science, University of British Columbia 10540-10560 In this work, we utilize Large Language Models (LLMs) for a novel use case: constructing Performance Predictors (PP) that estimate the performance of specific deep neural network architectures on downstream tasks. We create PP prompts for LLMs, comprising (i) role descriptions, (ii) instructions for the LLM, (iii) hyperparameter definitions, and (iv) demonstrations presenting sample architectures with efficiency metrics and ‘training from scratch’ performance. In machine translation (MT) tasks, GPT-4 with our PP prompts (LLM-PP) achieves a SoTA mean absolute error and a slight degradation in rank correlation coefficient compared to baseline predictors. Additionally, we demonstrate that predictions from LLM-PP can be distilled to a compact regression model (LLM-Distill-PP), which surprisingly retains much of the performance of LLM-PP. This presents a cost-effective alternative for resource-intensive performance estimation. Specifically, for Neural Architecture Search (NAS), we introduce a Hybrid-Search algorithm (HS-NAS) employing LLM-Distill-PP for the initial search stages and reverting to the baseline predictor later. HS-NAS performs similarly to SoTA NAS, reducing search hours by approximately 50%, and in some cases, improving latency, GFLOPs, and model size. The code can be found at: https://github.com/UBC-NLP/llmas. @@ -14637,7 +14637,7 @@ DeXinKongSuzhou University SuxianZhao XingyuLi - GuohongFu + GuohongFu 10561-10573 Dialogue discourse parsing (DDP) aims to capture the relations between utterances in the dialogue. In everyday real-world scenarios, dialogues are typically multi-modal and cover open-domain topics. However, most existing widely used benchmark datasets for DDP contain only textual modality and are domain-specific. This makes it challenging to accurately and comprehensively understand the dialogue without multi-modal clues, and prevents them from capturing the discourse structures of the more prevalent daily conversations. This paper proposes MODDP, the first multi-modal Chinese discourse parsing dataset derived from open-domain daily dialogues, consisting 864 dialogues and 18,114 utterances, accompanied by 12.7 hours of video clips. We present a simple yet effective benchmark approach for multi-modal DDP. Through extensive experiments, we present several benchmark results based on MODDP. The significant improvement in performance from introducing multi-modalities into the original textual unimodal DDP model demonstrates the necessity of integrating multi-modalities into DDP. 2024.findings-acl.628 @@ -14646,13 +14646,13 @@ <fixed-case>C</fixed-case>hinese <fixed-case>M</fixed-case>ental<fixed-case>BERT</fixed-case>: Domain-Adaptive Pre-training on Social Media for <fixed-case>C</fixed-case>hinese Mental Health Text Analysis - WeiZhai + WeiZhai HongzhiQi - QingZhao - JianqiangLiBeijing University of Technology - ZiqiWang - HanWang - BingYang + QingZhao + JianqiangLiBeijing University of Technology + ZiqiWang + HanWang + BingYang GuanghuiFu 10574-10585 In the current environment, psychological issues are prevalent and widespread, with social media serving as a key outlet for individuals to share their feelings. This results in the generation of vast quantities of data daily, where negative emotions have the potential to precipitate crisis situations. There is a recognized need for models capable of efficient analysis. While pre-trained language models have demonstrated their effectiveness broadly, there’s a noticeable gap in pre-trained models tailored for specialized domains like psychology. To address this, we have collected a huge dataset from Chinese social media platforms and enriched it with publicly available datasets to create a comprehensive database encompassing 3.36 million text entries. To enhance the model’s applicability to psychological text analysis, we integrated psychological lexicons into the pre-training masking mechanism. Building on an existing Chinese language model, we performed adaptive training to develop a model specialized for the psychological domain. We evaluated our model’s performance across six public datasets, where it demonstrated improvements compared to eight other models. Additionally, in the qualitative comparison experiment, our model provided psychologically relevant predictions given the masked sentences. Due to concerns regarding data privacy, the dataset will not be made publicly available. However, we have made the pre-trained models and codes publicly accessible to the community via: https://github.com/zwzzzQAQ/Chinese-MentalBERT. @@ -14663,12 +14663,12 @@ Beyond One-Preference-Fits-All Alignment: Multi-Objective Direct Preference Optimization ZhanhuiZhouShanghai Artificial Intelligence Laboratory - JieLiuThe Chinese University of Hong Kong + JieLiuThe Chinese University of Hong Kong JingShaoShanghai AI Laboratory XiangyuYueThe Chinese University of Hong Kong ChaoYang - WanliOuyangShanghai AI Lab - YuQiao + WanliOuyangShanghai AI Lab + YuQiao 10586-10613 A single language model, even when aligned with labelers through reinforcement learning from human feedback (RLHF), may not suit all human preferences. Recent approaches therefore prefer customization, gathering multi-dimensional feedback, and creating distinct reward models for each dimension.Different language models are then optimized for various preferences using multi-objective RLHF (MORLHF) with varying reward weights.However, RL fine-tuning is unstable and resource-heavy, especially with diverse and usually conflicting objectives.In this paper, we present Multi-Objective Direct Preference Optimization (MODPO), an RL-free extension of Direct Preference Optimization (DPO) for multiple alignment objectives.Essentially, MODPO folds language modeling directly into reward modeling, training language models as implicit collective reward models that combine all objectives with specific weights. MODPO theoretically yields the same optimal solutions as MORLHF but is practically more stable and efficient.Empirical results in safety alignment and long-form question answering show that MODPO matches or outperforms existing methods, producing a Pareto front of language models catering to diverse preferences with three times less computational resources compared to MORLHF.Code is available at https://github.com/ZHZisZZ/modpo. 2024.findings-acl.630 @@ -14695,7 +14695,7 @@ WenqiangLeiSichuan University DingnanJin JiaLiu - Tat-SengChuaNational University of Singapore + Tat-SengChuaNational University of Singapore 10633-10649 Equipping a conversational search engine with strategies regarding when to ask clarification questions is becoming increasingly important across various domains. Attributing to the context understanding capability of LLMs and their access to domain-specific sources of knowledge, LLM-based clarification strategies feature rapid transfer to various domains in a post-hoc manner.However, they still struggle to deliver promising performance on unseen domains, struggling to achieve effective domain transferability.We take the first step to investigate this issue and existing methods tend to produce one-size-fits-all strategies across diverse domains, limiting their search effectiveness.In response, we introduce a novel method, called STYLE,to achieve effective domain transferability.Our experimental results indicate that STYLE bears strong domain transferability, resulting in an average search performance improvement of 10% on four unseen domains. 2024.findings-acl.632 @@ -14704,16 +14704,16 @@ Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions - XumingHuThe Hong Kong University of Science and Technology (Guangzhou) and Hong Kong University of Science and Technology + XumingHuThe Hong Kong University of Science and Technology (Guangzhou) and Hong Kong University of Science and Technology XiaochuanLi - JunzheChen + JunzheChen YinghuiLi YangningLiTsinghua University, Tsinghua University XiaoguangLi YashengWang - QunLiuHuawei Noah’s Ark Lab + QunLiuHuawei Noah’s Ark Lab LijieWenSchool of Software, Tsinghua University - PhilipYuUniversity of Illinois, Chicago + PhilipYuUniversity of Illinois, Chicago ZhijiangGuoUniversity of Cambridge 10650-10671 Generative search engines have the potential to transform how people seek information online, but generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Nonetheless, retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable part of a claim. To this end, we propose evaluating the robustness of generative search engines in the realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning incorrect responses. Through a comprehensive human evaluation of various generative search engines, such as Bing Chat, PerplexityAI, and YouChat across diverse queries, we demonstrate the effectiveness of adversarial factual questions in inducing incorrect responses. Moreover, retrieval-augmented generation exhibits a higher susceptibility to factual errors compared to LLMs without retrieval. These findings highlight the potential security risks of these systems and emphasize the need for rigorous evaluation before deployment. The dataset and code will be publicly available. @@ -14739,8 +14739,8 @@ YuhanChen SendongZhao HaochunWang - GongZhangGongZhang - BingQinHarbin Institute of Technology + GongZhangGongZhang + BingQinHarbin Institute of Technology TingLiuHarbin Institute of Technology 10686-10697 Chain-of-Thought (CoT) serves as a critical emerging ability in LLMs, especially when it comes to logical reasoning. Attempts have been made to induce such ability in small models as well by distilling from the data with CoT generated by Large Language Models (LLMs). However, existing methods often simply generate and incorporate more data from LLMs and fail to note the importance of efficiently utilizing existing CoT data. We here propose a new training paradigm AS-ES (Abstractive Segments - Extractive Segments) learning, which exploits the inherent information in CoT for iterative generation. Experiments show that our methods surpass the direct seq2seq training on CoT-extensive tasks like MWP and PET summarization, without data augmentation or altering the model itself. Furthermore, we explore the reason behind the inefficiency of small models in learning CoT and provide an explanation of why AS-ES learning works, giving insights into the underlying mechanism of CoT. @@ -14752,7 +14752,7 @@ <fixed-case>II</fixed-case>-<fixed-case>MMR</fixed-case>: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering JihyungKilOhio State University, Columbus FaridehTavazoee - DongyeopKangUniversity of Minnesota + DongyeopKangUniversity of Minnesota Joo-KyungKimAmazon AGI 10698-10709 Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model’s overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i.e., visual or beyond-visual) of reasoning are required to answer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR observes that most of their VQA questions are easy to answer, simply demanding “single-hop” reasoning, whereas only a few questions require “multi-hop” reasoning. Moreover, while the recent V&L model struggles with such complex multi-hop reasoning questions even using the traditional CoT method, II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings. @@ -14763,13 +14763,13 @@ <fixed-case>TAME</fixed-case>-<fixed-case>RD</fixed-case>: Text Assisted Replication of Image Multi-Adjustments for Reverse Designing PoojaGuhanUniversity of Maryland, College Park - UttaranBhattacharyaAdobe Systems + UttaranBhattacharyaAdobe Systems SomdebSarkhelAdobe Research VahidAzizi XiangChenAdobe Systems - SaayanMitraAdobe Research - AniketBeraPurdue University and University of Maryland, College Park - DineshManochaUniversity of Maryland, College Park + SaayanMitraAdobe Research + AniketBeraPurdue University and University of Maryland, College Park + DineshManochaUniversity of Maryland, College Park 10710-10727 Given a source and its edited version performed based on human instructions in natural language, how do we extract the underlying edit operations, to automatically replicate similar edits on other images? This is the problem of reverse designing, and we present TAME-RD, a model to solve this problem. TAME-RD automatically learns from the complex interplay of image editing operations and the natural language instructions to learn fully specified edit operations. It predicts both the underlying image edit operations as discrete categories and their corresponding parameter values in the continuous space.We accomplish this by mapping together the contextual information from the natural language text and the structural differences between the corresponding source and edited images using the concept of pre-post effect. We demonstrate the efficiency of our network through quantitative evaluations on multiple datasets. We observe improvements of 6-10% on various accuracy metrics and 1.01X-4X on the RMSE score and the concordance correlation coefficient for the corresponding parameter values on the benchmark GIER dataset. We also introduce I-MAD, a new two-part dataset: I-MAD-Dense, a collection of approximately 100K source and edited images, together with automatically generated text instructions and annotated edit operations, and I-MAD-Pro, consisting of about 1.6K source and edited images, together with text instructions and annotated edit operations provided by professional editors. On our dataset, we observe absolute improvements of 1-10% on the accuracy metrics and 1.14X–5X on the RMSE score. 2024.findings-acl.637 @@ -14779,11 +14779,11 @@ Batch-<fixed-case>ICL</fixed-case>: Effective, Efficient, and Order-Agnostic In-Context Learning KaiyiZhangRenmin University of China - AngLv - YuhanChenXiaomi Corporation + AngLv + YuhanChenXiaomi Corporation HansenHa TaoXu - RuiYanRenmin University of China + RuiYanRenmin University of China 10728-10739 In this paper, by treating in-context learning (ICL) as a meta-optimization process, we explain why LLMs are sensitive to the order of ICL examples. This understanding leads us to the development of Batch-ICL, an effective, efficient, and order-agnostic inference algorithm for ICL. Differing from the standard N-shot learning approach, Batch-ICL employs N separate 1-shot forward computations and aggregates the resulting meta-gradients. These aggregated meta-gradients are then applied to the forward computation of a zero-shot query to generate the final prediction. This batch processing approach renders the LLM agnostic to the order of ICL examples. Through extensive experiments and analysis, we demonstrate that Batch-ICL consistently outperforms most permutations of ICL examples. In some cases, it even exceeds the performance of the best order for standard ICL, all while reducing the computational resources required. Furthermore, we develop a novel variant of Batch-ICL featuring multiple “epochs” of meta-optimization. This variant implicitly explores permutations of ICL examples, further enhancing ICL performance. 2024.findings-acl.638 @@ -14794,7 +14794,7 @@ <fixed-case>I</fixed-case>ndic<fixed-case>V</fixed-case>oices: Towards building an Inclusive Multilingual Speech Dataset for <fixed-case>I</fixed-case>ndian Languages TahirJaved JankiNawale - EldhoGeorgeIndian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology + EldhoGeorgeIndian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology SakshiJoshi KaushalBhogale DeovratMehendaleDepartment of Computer Science, Indian Institute of Technology, Madras, Indian Institute of Technology, Madras @@ -14824,7 +14824,7 @@ KaiwenZhou KwonjoonLeeHonda Research Institute USA TeruhisaMisuHonda Research Institute USA, Inc. - XinWangUniversity of California, Santa Cruz + XinWangUniversity of California, Santa Cruz 10783-10795 In our work, we explore the synergistic capabilities of pre-trained vision-and-language models (VLMs) and large language models (LLMs) on visual commonsense reasoning (VCR) problems. We find that VLMs and LLMs-based decision pipelines are good at different kinds of VCR problems. Pre-trained VLMs exhibit strong performance for problems involving understanding the literal visual content, which we noted as visual commonsense understanding (VCU). For problems where the goal is to infer conclusions beyond image content, which we noted as visual commonsense inference (VCI), VLMs face difficulties, while LLMs, given sufficient visual evidence, can use commonsense to infer the answer well. We empirically validate this by letting LLMs classify VCR problems into these two categories and show the significant difference between VLM and LLM with image caption decision pipelines on two subproblems. Moreover, we identify a challenge with VLMs’ passive perception, which may miss crucial context information, leading to incorrect reasoning by LLMs. Based on these, we suggest a collaborative approach, named ViCor, where pre-trained LLMs serve as problem classifiers to analyze the problem category, then either use VLMs to answer the question directly or actively instruct VLMs to concentrate on and gather relevant visual elements to support potential commonsense inferences. We evaluate our framework on two VCR benchmark datasets and outperform all other methods without in-domain fine-tuning. 2024.findings-acl.640 @@ -14833,7 +14833,7 @@ Decomposition for Enhancing Attention: Improving <fixed-case>LLM</fixed-case>-based Text-to-<fixed-case>SQL</fixed-case> through Workflow Paradigm - YuanzhenXieTencent + YuanzhenXieTencent XinzhouJin TaoXie MatrixmxlinMatrixmxlin @@ -14842,7 +14842,7 @@ ChengLei ChengxiangZhuo BoHu - ZangLiTencent + ZangLiTencent 10796-10816 In-context learning of large-language models (LLMs) has achieved remarkable success in the field of natural language processing, while extensive case studies reveal that the single-step chain-of-thought prompting approach faces challenges such as attention diffusion and inadequate performance in complex tasks like text-to-SQL. To improve the contextual learning capabilities of LLMs in text-to-SQL, a workflow paradigm method is proposed, aiming to enhance the attention and problem-solving scope of LLMs through decomposition. Specifically, the information determination module for eliminating redundant information and the brand-new prompt structure based on problem classification greatly enhance the model’s attention. Additionally, the inclusion of self-correction and active learning modules greatly expands the problem-solving scope of LLMs, hence improving the upper limit of LLM-based approaches. Extensive experiments conducted on three datasets demonstrate that our approach outperforms other methods by a significant margin. About 2-3 percentage point improvements compared to the existing baseline on the Spider Dev, Spider-Realistic, and Bird Dev datasets and new SOTA results on the Spider Test dataset are achieved. Our code is available on GitHub: https://github.com/FlyingFeather/DEA-SQL. 2024.findings-acl.641 @@ -14851,12 +14851,12 @@ Unveiling Opinion Evolution via Prompting and Diffusion for Short Video Fake News Detection - LinlinZongDalian University of Technology + LinlinZongDalian University of Technology JiahuiZhou WenminLinDalian University of Technology XinyueLiuDalian University of Technology XianchaoZhangDalian University of Technology - BoXuDalian University of Technology + BoXuDalian University of Technology 10817-10826 Short video fake news detection is crucial for combating the spread of misinformation. Current detection methods tend to aggregate features from individual modalities into multimodal features, overlooking the implicit opinions and the evolving nature of opinions across modalities. In this paper, we mine implicit opinions within short video news and promote the evolution of both explicit and implicit opinions across all modalities. Specifically, we design a prompt template to mine implicit opinions regarding the credibility of news from the textual component of videos. Additionally, we employ a diffusion model that encourages the interplay among diverse modal opinions, including those extracted through our implicit opinion prompts. Experimental results on a publicly available dataset for short video fake news detection demonstrate the superiority of our model over state-of-the-art methods. 2024.findings-acl.642 @@ -14865,7 +14865,7 @@ i<fixed-case>S</fixed-case>ign: A Benchmark for <fixed-case>I</fixed-case>ndian <fixed-case>S</fixed-case>ign <fixed-case>L</fixed-case>anguage Processing - AbhinavJoshiIndian Institute of Technology, Kanpur + AbhinavJoshiIndian Institute of Technology, Kanpur RomitMohanty MounikaKanakanti AndeshaManglaIndian Sign Language Research and Training Centre @@ -14882,8 +14882,8 @@ Data Contamination Calibration for Black-box <fixed-case>LLM</fixed-case>s WentaoYeZhejiang University JiaqiHu - LiyaoLiZhejiang University - HaoboWangZhejiang University + LiyaoLiZhejiang University + HaoboWangZhejiang University GangChen JunboZhaoZhejiang University 10845-10861 @@ -14895,7 +14895,7 @@ Truth-Aware Context Selection: Mitigating Hallucinations of Large Language Models Being Misled by Untruthful Contexts TianYu - ShaoleiZhang + ShaoleiZhang YangFengInstitute of Computing Technology, Chinese Academy of Sciences 10862-10884 Although Large Language Models (LLMs) have demonstrated impressive text generation capabilities, they are easily misled by untruthful contexts provided by users or knowledge augmentation tools, leading to hallucinations. To alleviate LLMs from being misled by untruthful context and take advantage of knowledge augmentation, we propose Truth-Aware Context Selection (TACS), a lightweight method to adaptively recognize and mask untruthful context from the inputs. TACS begins by performing truth detection on the input context, leveraging the parameterized knowledge within the LLM. Subsequently, it constructs a corresponding attention mask based on the truthfulness of each position, selecting the truthful context and discarding the untruthful context. Additionally, we introduce a new evaluation metric, Disturbance Adaption Rate, to further study the LLMs’ ability to accept truthful information and resist untruthful information.Experimental results indicate that TACS can effectively filter untruthful context and significantly improve the overall quality of LLMs’ responses when presented with misleading information. @@ -14908,7 +14908,7 @@ MenglongCui JiangcunDu ShaolinZhuTianjin University - DeyiXiongTianjin University + DeyiXiongTianjin University 10885-10897 Large language models (LLMs) exhibit outstanding performance in machine translation via in-context learning. In contrast to sentence-level translation, document-level translation (DOCMT) by LLMs based on in-context learning faces two major challenges: firstly, document translations generated by LLMs are often incoherent; secondly, the length of demonstration for in-context learning is usually limited. To address these issues, we propose a Context-Aware Prompting method (CAP), which enables LLMs to generate more accurate, cohesive, and coherent translations via in-context learning. CAP takes into account multi-level attention, selects the most relevant sentences to the current one as context, and then generates a summary from these collected sentences. Subsequently, sentences most similar to the summary are retrieved from the datastore as demonstrations, which effectively guide LLMs in generating cohesive and coherent translations. We conduct extensive experiments across various DOCMT tasks, and the results demonstrate the effectiveness of our approach, particularly in zero pronoun translation (ZPT) and literary translation tasks. 2024.findings-acl.646 @@ -14933,7 +14933,7 @@ <fixed-case>RECOST</fixed-case>: External Knowledge Guided Data-efficient Instruction Tuning QiZhang YimingZhang - HaoboWangZhejiang University + HaoboWangZhejiang University JunboZhaoZhejiang University 10911-10921 In the current landscape of large language models (LLMs), the process of instruction tuning serves as an essential step. Considering the high computing power overhead, data-efficient instruction tuning was proposed to reduce the training data size in this process, aiming at selecting high-quality instructional data. Nevertheless, we argue that most current data-efficient instruction-tuning methods are highly dependent on the quality of the original instruction-tuning dataset. When it comes to datasets synthesized by LLMs, a common scenario in this field, dirty samples will even be selected with a higher probability than other samples. To address these challenges, we utilized external knowledge (relevant examples or paragraphs) to evaluate those samples synthesized by LLMs with an in-context-based relative predictive entropy. Based on the new metric, we proposed a framework, dubbed as RECOST, which integrates external-knowledge-base re-ranking and diversity-consistent sampling into a single pipeline. Through extensive experiments on several synthetic datasets (Alpaca and Alpaca-gpt4), we demonstrate the effectiveness of our method and achieve even better results with only 1% of the full dataset. @@ -14944,7 +14944,7 @@ Understanding Cross-Lingual <fixed-case>A</fixed-case>lignment—<fixed-case>A</fixed-case> Survey KatharinaHämmerlCIS, LMU Munich - JindřichLibovickýCharles University Prague + JindřichLibovickýCharles University Prague AlexanderFraserTechnical University of Munich 10922-10943 Cross-lingual alignment, the meaningful similarity of representations across languages in multilingual language models, has been an active field of research in recent years. We survey the literature of techniques to improve cross-lingual alignment, providing a taxonomy of methods and summarising insights from throughout the field. We present different understandings of cross-lingual alignment and their limitations. We provide a qualitative summary of results from a number of surveyed papers. Finally, we discuss how these insights may be applied not only to encoder models, where this topic has been heavily studied, but also to encoder-decoder or even decoder-only models, and argue that an effective trade-off between language-neutral and language-specific information is key. @@ -14956,7 +14956,7 @@ Mitigate Negative Transfer with Similarity Heuristic Lifelong Prompt Tuning ChenyuanWuUniversity of Science and Technology of China GangweiJiangCity University of Hong Kong and University of Science and Technology of China - DefuLianUniversity of Science and Technology of China + DefuLianUniversity of Science and Technology of China 10944-10959 Lifelong prompt tuning has significantly advanced parameter-efficient lifelong learning with its efficiency and minimal storage demands on various tasks.Our empirical studies, however, highlights certain transferability constraints in the current methodologies: a universal algorithm that guarantees consistent positive transfer across all tasks is currently unattainable, especially when dealing dissimilar tasks that may engender negative transfer.Identifying the misalignment between algorithm selection and task specificity as the primary cause of negative transfer, we present the Similarity Heuristic Lifelong Prompt Tuning (SHLPT) framework. This innovative strategy partitions tasks into two distinct subsets by harnessing a learnable similarity metric, thereby facilitating fruitful transfer from tasks regardless of their similarity or dissimilarity. Additionally, SHLPT incorporates a parameter pool to combat catastrophic forgetting effectively. Our experiments shows that SHLPT outperforms state-of-the-art techniques in lifelong learning benchmarks and demonstrates robustness against negative transfer in diverse task sequences. 2024.findings-acl.650 @@ -14969,11 +14969,11 @@ ZonghanYangDepartment of Computer Science and Technology, Tsinghua University ZhenheZhang QingyuanHu - PengLiTsinghua University - MingYan + PengLiTsinghua University + MingYan JiZhangAlibaba Group FeiHuangAlibaba Group - YangLiu + YangLiu 10960-10977 While Large language models (LLMs) have demonstrated considerable capabilities across various natural language tasks, they often fall short of the performance achieved by domain-specific state-of-the-art models. One potential approach to enhance domain-specific capabilities of LLMs involves fine-tuning them using corresponding datasets. However, this method can be both resource and time-intensive, and not applicable to closed-source commercial LLMs. In this paper, we propose Preference Adaptation for Enhancing Domain-specific Abilities of LLMs (PANDA), a method designed to augment the domain-specific capabilities of LLMs by leveraging insights from the response preference of expert models without requiring fine-tuning. Our experimental results reveal that PANDA significantly enhances the domain-specific ability of LLMs on text classification and interactive decision tasks. Moreover, LLM with PANDA even outperforms the expert model that being learned on 4 tasks of ScienceWorld. This finding highlights the potential of exploring tuning-free approaches to achieve weak-to-strong generalization. 2024.findings-acl.651 @@ -14984,10 +14984,10 @@ Developing <fixed-case>PUGG</fixed-case> for <fixed-case>P</fixed-case>olish: A Modern Approach to <fixed-case>KBQA</fixed-case>, <fixed-case>MRC</fixed-case>, and <fixed-case>IR</fixed-case> Dataset Construction AlbertSawczynWroclaw University of Science and Technology KatsiarynaViarenich - KonradWojtasik - AleksandraDomogałaTechnical University of Wroclaw - MarcinOleksy - MaciejPiaseckiWroclaw University of Science and Technology + KonradWojtasik + AleksandraDomogałaTechnical University of Wroclaw + MarcinOleksy + MaciejPiaseckiWroclaw University of Science and Technology TomaszKajdanowiczWroclaw University of Science and Technology 10978-10996 Advancements in AI and natural language processing have revolutionized machine-human language interactions, with question answering (QA) systems playing a pivotal role. The knowledge base question answering (KBQA) task, utilizing structured knowledge graphs (KG), allows for handling extensive knowledge-intensive questions. However, a significant gap exists in KBQA datasets, especially for low-resource languages. Many existing construction pipelines for these datasets are outdated and inefficient in human labor, and modern assisting tools like Large Language Models (LLM) are not utilized to reduce the workload. To address this, we have designed and implemented a modern, semi-automated approach for creating datasets, encompassing tasks such as KBQA, Machine Reading Comprehension (MRC), and Information Retrieval (IR), tailored explicitly for low-resource environments. We executed this pipeline and introduced the PUGG dataset, the first Polish KBQA dataset, and novel datasets for MRC and IR. Additionally, we provide a comprehensive implementation, insightful findings, detailed statistics, and evaluation of baseline models. @@ -14997,12 +14997,12 @@ Knowledge-to-<fixed-case>SQL</fixed-case>: Enhancing <fixed-case>SQL</fixed-case> Generation with Data Expert <fixed-case>LLM</fixed-case> - ZijinHong + ZijinHong ZhengYuan - HaoChen - QinggangZhang + HaoChen + QinggangZhang FeiranHuang - XiaoHuangThe Hong Kong Polytechnic University + XiaoHuangThe Hong Kong Polytechnic University 10997-11008 Generating accurate SQL queries for user questions (text-to-SQL) has been a long-standing challenge since it requires a deep understanding of both the user’s question and the corresponding database schema in order to retrieve the desired content accurately. Existing methods rely on the comprehensive capability of large language models (LLMs) to generate the SQL. However, some necessary knowledge is not explicitly included in the database schema and user question or has been learned by LLMs. Thus, the generated SQL of the knowledge-insufficient questions may be inaccurate, negatively influencing the text-to-SQL models’ performance and robustness. To address this challenge, we propose the Knowledge-to-SQL framework, which employs tailored Data Expert LLM (DELLM) to provide helpful knowledge for all text-to-SQL models. Specifically, we introduce the detailed implementation of DELLM regarding table reading and the basic fine-tuning process. We further propose a Preference Learning via Database Feedback (PLDBF) strategy, refining the DELLM to generate more helpful knowledge for LLMs. Extensive experiments verify that DELLM can enhance the state-of-the-art approaches for text-to-SQL tasks. The corresponding code of DELLM is released for further research. 2024.findings-acl.653 @@ -15011,10 +15011,10 @@ Centroid-Based Efficient Minimum <fixed-case>B</fixed-case>ayes Risk Decoding - HiroyukiDeguchiNara Institute of Science and Technology, Japan and National Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology + HiroyukiDeguchiNara Institute of Science and Technology, Japan and National Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology YusukeSakaiNara Institute of Science and Technology, Japan - HidetakaKamigaitoDivision of Information Science, Nara Institute of Science and Technology - TaroWatanabeNara Institute of Science and Technology, Japan + HidetakaKamigaitoDivision of Information Science, Nara Institute of Science and Technology + TaroWatanabeNara Institute of Science and Technology, Japan HidekiTanakaNational Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology MasaoUtiyamaNational Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology 11009-11018 @@ -15042,9 +15042,9 @@ Exploiting Positional Bias for Query-Agnostic Generative Content in Search - AndrewParry - SeanMacAvaneyUniversity of Glasgow - DebasisGangulyUniversity of Glasgow + AndrewParry + SeanMacAvaneyUniversity of Glasgow + DebasisGangulyUniversity of Glasgow 11030-11047 In recent years, research shows that neural ranking models (NRMs) substantially outperform their lexical counterparts in text retrieval. In traditional search pipelines, a combination of features leads to well-defined behaviour. However, as neural approaches become increasingly prevalent as the final scoring component of engines or as standalone systems, their robustness to malicious text and, more generally, semantic perturbation needs to be better understood. We posit that the transformer attention mechanism can induce exploitable defects in search models through sensitivity to token position within a sequence, leading to an attack that could generalise beyond a single query or topic. We demonstrate such defects by showing that non-relevant text–such as promotional content–can be easily injected into a document without adversely affecting its position in search results. Unlike previous gradient-based attacks, we demonstrate the existence of these biases in a query-agnostic fashion. In doing so, without the knowledge of topicality, we can still reduce the negative effects of non-relevant content injection by controlling injection position. Our experiments are conducted with simulated on-topic promotional text automatically generated by prompting LLMs with topical context from target documents. We find that contextualisation of a non-relevant text further reduces negative effects whilst likely circumventing existing content filtering mechanisms. In contrast, lexical models are found to be more resilient to such content injection attacks. We then investigate a simple yet effective compensation for the weaknesses of the NRMs in search, validating our hypotheses regarding transformer bias. 2024.findings-acl.656 @@ -15054,9 +15054,9 @@ <fixed-case>ICC</fixed-case> : Quantifying Image Caption Concreteness for Multimodal Dataset Curation MoranYanukaTel Aviv University - MorrisAlperTel Aviv University + MorrisAlperTel Aviv University HadarAverbuch-ElorTel Aviv University and Cornell University - RajaGiryesTel Aviv University + RajaGiryesTel Aviv University 11048-11064 Web-scale training on paired text-image data is becoming increasingly central to multimodal learning, but is challenged by the highly noisy nature of datasets in the wild. Standard data filtering approaches succeed in removing mismatched text-image pairs, but permit semantically related but highly abstract or subjective text. These approaches lack the fine-grained ability to isolate the most concrete samples that provide the strongest signal for learning in a noisy dataset. In this work, we propose a new metric, Image Caption Concreteness (ICC), that evaluates caption text without an image reference to measure its concreteness and relevancy for use in multimodal learning. Our unsupervised approach leverages strong foundation models for measuring visual-semantic information loss in multimodal representations. We demonstrate that this strongly correlates with human evaluation of concreteness in both single-word and caption-level texts. Moreover, we show that curation using ICC complements existing approaches: It succeeds in selecting the highest quality samples from multimodal web-scale datasets to allow for efficient training in resource-constrained settings. 2024.findings-acl.657 @@ -15069,9 +15069,9 @@ RuiWang RuixuanXiao JunboZhaoZhejiang University - XiaoDing + XiaoDing GangChen - HaoboWangZhejiang University + HaoboWangZhejiang University 11065-11082 Within the evolving landscape of deep learning, the dilemma of data quantity and quality has been a long-standing problem. The recent advent of Large Language Models (LLMs) offers a data-centric solution to alleviate the limitations of real-world data with synthetic data generation. However, current investigations into this field lack a unified framework and mostly stay on the surface. Therefore, this paper provides an organization of relevant studies based on a generic workflow of synthetic data generation. By doing so, we highlight the gaps within existing research and outline prospective avenues for future study. This work aims to shepherd the academic and industrial communities towards deeper, more methodical inquiries into the capabilities and applications of LLMs-driven synthetic data generation. 2024.findings-acl.658 @@ -15082,7 +15082,7 @@ When is a Language Process a Language Model? LiDuJohns Hopkins University HoldenLeeJohns Hopkins University - JasonEisnerMicrosoft and Johns Hopkins University + JasonEisnerMicrosoft and Johns Hopkins University RyanCotterellSwiss Federal Institute of Technology 11083-11094 A language model may be viewed as a \Sigma-valued stochastic process for some alphabet \Sigma.However, in some pathological situations, such a stochastic process may “leak” probability mass onto the set of infinite strings and hence is not equivalent to the conventional view of a language model as a distribution over ordinary (finite) strings.Such ill-behaved language processes are referred to as *non-tight* in the literature.In this work, we study conditions of tightness through the lens of stochastic processes.In particular, by regarding the symbol as marking a stopping time and using results from martingale theory, we give characterizations of tightness that generalize our previous work [(Du et al. 2023)](https://arxiv.org/abs/2212.10502). @@ -15093,7 +15093,7 @@ Accelerating Multilingual Language Model for Excessively Tokenized Languages JiminHongKrafton.Inc and Korea Advanced Institute of Science and Technology - GibbeumLeeKRAFTON and KAIST + GibbeumLeeKRAFTON and KAIST JaewoongChoKRAFTON 11095-11111 Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation.We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new language model head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model’s performance is preserved.We show that this targeted fine-tuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks. @@ -15117,9 +15117,9 @@ YongqiLiHong Kong Polytechnic University ZhenZhang WenjieWangNational University of Singapore - LiqiangNieHarbin Institute of Technology (Shenzhen) - WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University - Tat-SengChuaNational University of Singapore + LiqiangNieHarbin Institute of Technology (Shenzhen) + WenjieLiThe Hong Kong Polytechnic University, The Hong Kong Polytechnic University + Tat-SengChuaNational University of Singapore 11119-11129 Generative retrieval is a promising new paradigm in text retrieval that generates identifier strings of relevant passages as the retrieval target. This paradigm leverages powerful generative language models, distinct from traditional sparse or dense retrieval methods. In this work, we identify a viable direction to further enhance generative retrieval via distillation and propose a feasible framework, named DGR. DGR utilizes sophisticated ranking models, such as the cross-encoder, in a teacher role to supply a passage rank list, which captures the varying relevance degrees of passages instead of binary hard labels; subsequently, DGR employs a specially designed distilled RankNet loss to optimize the generative retrieval model, considering the passage rank order provided by the teacher model as labels. This framework only requires an additional distillation step to enhance current generative retrieval systems and does not add any burden to the inference stage. We conduct experiments on four public datasets, and the results indicate that DGR achieves state-of-the-art performance among the generative retrieval methods. Additionally, DGR demonstrates exceptional robustness and generalizability with various teacher models and distillation losses. 2024.findings-acl.662 @@ -15145,10 +15145,10 @@ HaoWangGoogle ShihaoLiang YujiaQin - PengLiTsinghua University - ZhiyuanLiuTsinghua University + PengLiTsinghua University + ZhiyuanLiuTsinghua University MaosongSun - YangLiu + YangLiu 11143-11156 Large Language Models (LLMs) have witnessed remarkable advancements in recent years, prompting the exploration of tool learning, which integrates LLMs with external tools to address diverse real-world challenges. Assessing the capability of LLMs to utilise tools necessitates large-scale and stable benchmarks. However, previous works relied on either hand-crafted online tools with limited scale, or large-scale real online APIs suffering from instability of API status. To address this problem, we introduce StableToolBench, a benchmark evolving from ToolBench, proposing a virtual API server and stable evaluation system. The virtual API server contains a caching system and API simulators which are complementary to alleviate the change in API status. Meanwhile, the stable evaluation system designs solvable pass and win rates using GPT-4 as the automatic evaluator to eliminate the randomness during evaluation. Experimental results demonstrate the stability of StableToolBench, and further discuss the effectiveness of API simulators, the caching system, and the evaluator system. 2024.findings-acl.664 @@ -15159,12 +15159,12 @@ Both Matter: Enhancing the Emotional Intelligence of Large Language Models without Compromising the General Intelligence WeixiangZhaoHarbin Institute of Technology ZhuojunLi - ShilongWangHarbin Institute of Technology + ShilongWangHarbin Institute of Technology YangWang YulinHu YanyanZhaoHarbin Institute of Technology ChenWeixiaomi - BingQinHarbin Institute of Technology + BingQinHarbin Institute of Technology 11157-11176 Emotional Intelligence (EI), consisting of emotion perception, emotion cognition and emotion expression, plays the critical roles in improving user interaction experience for the current large language model (LLM) based conversational general AI assistants. Previous works mainly focus on raising the emotion perception ability of them via naive fine-tuning on EI-related classification or regression tasks. However, this leads to the incomplete enhancement of EI and catastrophic forgetting of the general intelligence (GI). To this end, we first introduce EiBench, a large-scale collection of EI-related tasks in the text-to-text format with task instructions that covers all three aspects of EI, which lays a solid foundation for the comprehensive EI enhancement of LLMs. Then a novel Modular Emotional Intelligence enhancement method (**MoEI**), consisting of Modular Parameter Expansion and intra-inter modulation, is proposed to comprehensively enhance the EI of LLMs without compromise their GI. Extensive experiments on two representative LLM-based assistants, Flan-T5 and LLaMA-2-Chat, demonstrate the effectiveness of MoEI to improving EI while maintain GI. 2024.findings-acl.665 @@ -15177,8 +15177,8 @@ MinwooKim SeunghoKim JunghwanKimselectstar - SeunghyunWonSeoul National University Bundang Hospital - HwaranLeeNAVER AI Lab + SeunghyunWonSeoul National University Bundang Hospital + HwaranLeeNAVER AI Lab EdwardChoiKorea Advanced Institute of Science and Technology 11177-11213 To reliably deploy Large Language Models (LLMs) in a specific country, they must possess an understanding of the nation’s culture and basic knowledge. To this end, we introduce National Alignment, which measures the alignment between an LLM and a targeted country from two aspects: social value alignment and common knowledge alignment. We constructed KorNAT, the first benchmark that measures national alignment between LLMs and South Korea. KorNat contains 4K and 6K multiple-choice questions for social value and common knowledge, respectively. To attain an appropriately aligned ground truth in the social value dataset, we conducted a large-scale public survey with 6,174 South Koreans. For common knowledge, we created the data based on the South Korea text books and GED exams. Our dataset creation process is meticulously designed based on statistical sampling theory, and we also introduce metrics to measure national alignment, including three variations of social value alignment. We tested seven LLMs and found that only few models passed our reference score, indicating there exists room for improvement. Our dataset has received government approval following an assessment by a government-affiliated organization dedicated to evaluating dataset quality. @@ -15191,7 +15191,7 @@ PranabSahoo AyushSingh SriparnaSahaIndian Institute of Technology Patna, India - AmanChadhaAmazon + AmanChadhaAmazon SamratMondal 11214-11226 The mining of adverse drug events (ADEs) is pivotal in pharmacovigilance, enhancing patient safety by identifying potential risks associated with medications, facilitating early detection of adverse events, and guiding regulatory decision-making. Traditional ADE detection methods are reliable but slow, not easily adaptable to large-scale operations, and offer limited information. With the exponential increase in data sources like social media content, biomedical literature, and Electronic Medical Records (EMR), extracting relevant ADE-related information from these unstructured texts is imperative. Previous ADE mining studies have focused on text-based methodologies, overlooking visual cues, limiting contextual comprehension, and hindering accurate interpretation. To address this gap, we present a MultiModal Adverse Drug Event (MMADE) detection dataset, merging ADE-related textual information with visual aids. Additionally, we introduce a framework that leverages the capabilities of LLMs and VLMs for ADE detection by generating detailed descriptions of medical images depicting ADEs, aiding healthcare professionals in visually identifying adverse events. Using our MMADE dataset, we showcase the significance of integrating visual cues from images to enhance overall performance. This approach holds promise for patient safety, ADE awareness, and healthcare accessibility, paving the way for further exploration in personalized healthcare. @@ -15203,7 +15203,7 @@ Space Decomposition for Sentence Embedding WuttikornPonwitayaratVidyasirimedhi Institute of Science and Technology PeeratLimkonchotiwat - EkapolChuangsuwanichChulalongkorn University + EkapolChuangsuwanichChulalongkorn University SaranaNutanong 11227-11239 Determining sentence pair similarity is crucial for various NLP tasks. A common technique to address this is typically evaluated on a continuous semantic textual similarity scale from 0 to 5. However, based on a linguistic observation in STS annotation guidelines, we found that the score in the range [4,5] indicates an upper-range sample, while the rest are lower-range samples. This necessitates a new approach to treating the upper-range and lower-range classes separately. In this paper, we introduce a novel embedding space decomposition method called MixSP utilizing a Mixture of Specialized Projectors, designed to distinguish and rank upper-range and lower-range samples accurately. The experimental results demonstrate that MixSP decreased the overlap representation between upper-range and lower-range classes significantly while outperforming competitors on STS and zero-shot benchmarks. @@ -15213,7 +15213,7 @@ Don’t Augment, Rewrite? Assessing Abusive Language Detection with Synthetic Data - CamillaCasulaUniversity of Trento and Fondazione Bruno Kessler + CamillaCasulaUniversity of Trento and Fondazione Bruno Kessler ElisaLeonardelliFondazione Bruno Kessler SaraTonelli 11240-11247 @@ -15225,7 +15225,7 @@ Improving Low-Resource Machine Translation for Formosan Languages Using Bilingual Lexical Resources FrancisZhengThe University of Tokyo, The University of Tokyo - EdisonMarrese-TaylorThe Univesity of Tokyo and AIST, National Institute of Advanced Industrial Science and Technology + EdisonMarrese-TaylorThe Univesity of Tokyo and AIST, National Institute of Advanced Industrial Science and Technology YutakaMatsuoThe University of Tokyo and The University of Tokyo 11248-11259 This paper investigates how machine translation for low-resource languages can be improved by incorporating information from bilingual lexicons during the training process for mainly translation between Mandarin and Formosan languages, which are all moribund or critically endangered, and we also show that our techniques work for translation between Spanish and Nahuatl, a language pair consisting of languages from completely different language families. About 70% of the approximately 7,000 languages of the world have data in the form of lexicons, a valuable resource for improving low-resource language translation. We collect a dataset of parallel data and bilingual lexicons between Mandarin and 16 different Formosan languages and examine mainly three different approaches: (1) simply using lexical data as additional parallel data, (2) generating pseudo-parallel sentence data to use during training by replacing words in the original parallel sentence data using the lexicon, and (3) a combination of (1) and (2). All three approaches give us gains in both Bleu scores and chrF scores, and we found that (3) provided the most gains, followed by (1) and then (2), which we observed for both translation between Mandarin and the Formosan languages and Spanish-Nahuatl. With technique (3), we saw an average increase of 5.55 in Bleu scores and 10.33 in chrF scores. @@ -15237,14 +15237,14 @@ <fixed-case>CMMLU</fixed-case>: Measuring massive multitask language understanding in <fixed-case>C</fixed-case>hinese - HaonanLi + HaonanLi YixuanZhang FajriKotoMohamed bin Zayed University of Artificial Intelligence - YifeiYang + YifeiYang HaiZhaoShanghai Jiao Tong University YeyunGong NanDuanMicrosoft Research Asia - TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne + TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne 11260-11285 As the capabilities of large language models (LLMs) continue to advance, evaluating their performance is becoming more important and more challenging. This paper aims to address this issue for Mandarin Chinese in the form of CMMLU, a comprehensive Chinese benchmark that covers various subjects, including natural sciences, social sciences, engineering, and the humanities. We conduct a thorough evaluation of more than 20 contemporary multilingual and Chinese LLMs, assessing their performance across different subjects and settings. The results reveal that most existing LLMs struggle to achieve an accuracy of even 60%, which is the pass mark for Chinese exams. This highlights that there is substantial room for improvement in the capabilities of LLMs. Additionally, we conduct extensive experiments to identify factors impacting the models’ performance and propose directions for enhancing LLMs. CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models for Chinese. 2024.findings-acl.671 @@ -15270,8 +15270,8 @@ WenjieWangNational University of Singapore MoxinLi JunrongGuoUniversity of Science and Technology of China - YangZhang - FuliFengUniversity of Science and Technology of China + YangZhang + FuliFengUniversity of Science and Technology of China 11316-11360 The rapid advancement of Large Language Models (LLMs) in the realm of mathematical reasoning necessitates comprehensive evaluations to gauge progress and inspire future directions. Existing assessments predominantly focus on problem-solving from the examinee perspective, overlooking a dual perspective of examiner regarding error identification and correction.From the examiner perspective, we define four evaluation tasks for error identification and correction along with a new dataset with annotated error types and steps. We also design diverse prompts to thoroughly evaluate eleven representative LLMs. Our principal findings indicate that GPT-4 outperforms all models, while open-source model LLaMA-2-7B demonstrates comparable abilities to closed-source models GPT-3.5 and Gemini Pro.Notably, calculation error proves the most challenging error type. Moreover, prompting LLMs with the error types can improve the average correction accuracy by 47.9%. These results reveal potential directions for developing the mathematical reasoning abilities of LLMs.Our code and dataset is available on https://github.com/LittleCirc1e/EIC. 2024.findings-acl.673 @@ -15281,7 +15281,7 @@ Less is <fixed-case>KEN</fixed-case>: a Universal and Simple Non-Parametric Pruning Algorithm for Large Language Models MicheleMastromatteiCampus Bio-Medico University of Rome - Fabio MassimoZanzottoUniversity of Rome Tor Vergata + Fabio MassimoZanzottoUniversity of Rome Tor Vergata 11361-11374 2024.findings-acl.674 mastromattei-zanzotto-2024-less @@ -15289,8 +15289,8 @@ When Do <fixed-case>LLM</fixed-case>s Need Retrieval Augmentation? Mitigating <fixed-case>LLM</fixed-case>s’ Overconfidence Helps Retrieval Augmentation - ShiyuNiInstitute of Computing Technology, Chinese Academy of Sciences - KepingBiChinese Academy of Sciences + ShiyuNiInstitute of Computing Technology, Chinese Academy of Sciences + KepingBiChinese Academy of Sciences JiafengGuoInstitute of Computing Technolgy, Chinese Academy of Sciences XueqiCheng, Chinese Academy of Sciences 11375-11388 @@ -15324,7 +15324,7 @@ JingangWangMeituan XunliangCai DongyanZhaoPeking University - RuiYanRenmin University of China + RuiYanRenmin University of China 11404-11415 Speculative decoding has emerged as a promising technique to accelerate the inference of Large Language Models (LLMs) by employing a small language model to draft a hypothesis sequence, which is then validated by the LLM. The effectiveness of this approach heavily relies on the balance between performance and efficiency of the draft model. In our research, we focus on enhancing the proportion of draft tokens that are accepted to the final output by generating multiple hypotheses instead of just one. This allows the LLM more options to choose from and select the longest sequence that meets its standards. Our analysis reveals that hypotheses produced by the draft model share many common token sequences, suggesting a potential for optimizing computation. Leveraging this observation, we introduce an innovative approach utilizing a directed acyclic graph (DAG) to manage the drafted hypotheses. This structure enables us to efficiently predict and merge recurring token sequences, vastly reducing the computational demands of the draft model. We term this approach Graph-structured Speculative Decoding (GSD). We apply GSD across a range of LLMs, including a 70-billion parameter LLaMA-2 model, and observe a remarkable speedup of 1.70\times to 1.94 \times, significantly surpassing standard speculative decoding. 2024.findings-acl.677 @@ -15335,7 +15335,7 @@ Duwak: Dual Watermarks in Large Language Models ChaoyiZhu JeroenGaljaard - Pin-YuChenInternational Business Machines + Pin-YuChenInternational Business Machines LydiaChenDelft University of Technology 11416-11436 As large language models (LLM) are increasingly used for text generation tasks, it is critical to audit their usages, govern their applications, and mitigate their potential harms. Existing watermark techniques are shown effective in embedding single human-imperceptible and machine-detectable patterns without significantly affecting generated text quality and semantics. However, the efficiency in detecting watermarks, i.e., the minimum number of tokens required to assert detection with significance and robustness against post-editing, is still debatable. In this paper, we propose, Duwak, to fundamentally enhance the efficiency and quality of watermarking by embedding dual secret patterns in both token probability distribution and sampling schemes. To mitigate expression degradation caused by biasing toward certain tokens, we design a contrastive search to watermark the sampling scheme, which minimizes the token repetition and enhances the diversity. We theoretically explain the interdependency of the two watermarks within Duwak. We evaluate Duwak extensively on Llama2 and Vicuna under various post-editing attacks, against four state-of-the-art watermarking techniques and combinations of them. Our results show that Duwak marked text achieves the highest watermarked text quality at the lowest required token count for detection, up to 70% tokens less than existing approaches, especially under post paraphrasing. @@ -15346,9 +15346,9 @@ <fixed-case>C</fixed-case>ode<fixed-case>A</fixed-case>ttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion QibingRenShanghai Jiaotong University - ChangGao + ChangGao JingShaoShanghai AI Laboratory - JunchiYanShanghai Jiao Tong University + JunchiYanShanghai Jiao Tong University XinTanEast China Normal University WaiLamThe Chinese University of Hong Kong LizhuangMaDept. of Computer Sci. & Eng., Shanghai Jiao Tong University @@ -15362,10 +15362,10 @@ Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training QingyanGuo RuiWangMicrosoft - JunliangGuoMicrosoft - XuTan - JiangBianMicrosoft - YujiuYangGraduate School at Shenzhen,Tsinghua University + JunliangGuoMicrosoft + XuTan + JiangBianMicrosoft + YujiuYangGraduate School at Shenzhen,Tsinghua University 11453-11464 While large language models (LLMs) have achieved impressive performance across diverse tasks, recent studies showcase that causal LLMs suffer from the “reversal curse”. It is a typical example that the model knows “A’s father is B”, but is unable to reason “B’s child is A”. This limitation poses a challenge to the advancement of artificial general intelligence (AGI), as it suggests a gap in the models’ ability to comprehend and apply bidirectional reasoning. In this paper, we first conduct substantial evaluation and identify that the root cause of the reversal curse lies in the different word order between the training and inference stage, namely, the poor ability of causal language models to predict antecedent words within the training data. Accordingly, permutation on the training data is considered as a potential solution, since this can make the model predict antecedent words or tokens. However, previous permutation methods may disrupt complete phrases or entities, thereby posing challenges for the model to comprehend and learn from training data. To address this issue, we propose Semantic-aware Permutation Training (SPT), which addresses this issue by segmenting the training sentences into semantic units (i.e., entities or phrases) with an assistant language model and permuting these units before feeding into the model. Extensive experiments demonstrate that SPT effectively mitigates the reversal curse since the performance on reversed questions approximates that on the forward ones, and significantly advances the performance of existing works. 2024.findings-acl.680 @@ -15401,11 +15401,11 @@ <fixed-case>TRAP</fixed-case>: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification - MartinGubriParameter Lab + MartinGubriParameter Lab DennisUlmer - HwaranLeeNAVER AI Lab + HwaranLeeNAVER AI Lab SangdooYunNAVER - Seong JoonOhParameter Lab and Eberhard-Karls-Universität Tübingen + Seong JoonOhParameter Lab and Eberhard-Karls-Universität Tübingen 11496-11517 Large Language Model (LLM) services and models often come with legal rules on *who* can use them and *how* they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function. 2024.findings-acl.683 @@ -15442,7 +15442,7 @@ SitipornSae Lim CanUdomcharoenchaikitVidyasirimedhi Institute of Science and Technology (VISTEC) PeeratLimkonchotiwat - EkapolChuangsuwanichChulalongkorn University + EkapolChuangsuwanichChulalongkorn University SaranaNutanong 11548-11563 NLU models have achieved promising results on standard benchmarks. Despite state-of-the-art accuracy, analysis reveals that many models make predictions using annotation bias rather than the properties we intend the model to learn. Consequently, these models perform poorly on out-of-distribution datasets. Recent advances in bias mitigation show that annotation bias can be alleviated through fine-tuning debiasing objectives. In this paper, we apply causal mediation analysis to gauge how much each model component mediates annotation biases. Using the knowledge from the causal analysis, we improve the model’s robustness against annotation bias through two bias mitigation methods: causal-grounded masking and gradient unlearning. Causal analysis reveals that biases concentrated in specific components, even after employing other training-time debiasing techniques. Manipulating these components by masking out neurons’ activations or updating specific weight blocks both demonstrably improve robustness against annotation artifacts. @@ -15452,8 +15452,8 @@ Perturbed examples reveal invariances shared by language models - RuchitRawalMPI-SWS - MariyaTonevaMax Planck Institute for Software Systems + RuchitRawalMPI-SWS + MariyaTonevaMax Planck Institute for Software Systems 11564-11584 The rapid growth in natural language processing (NLP) research has led to numerous new models, outpacing our understanding of how they compare to established ones. One major reason for this difficulty is saturating benchmarks, which may not well reflect differences in model performance in the wild. In this work, we introduce a novel framework to compare two NLP models by revealing their shared invariance to interpretable input perturbations targeting a specific linguistic capability. Via experiments on models from the same and different architecture families, this framework offers insights about how changes in models (e.g., distillation, size increase) affect linguistic capabilities. Furthermore, our framework enables evaluation of invariances between commercial black-box models (e.g., InstructGPT family) and models that are better understood (e.g., GPT-2). Across experiments, we observe that large language models share many invariances encoded by models of various sizes, whereas the invariances by large models are only shared by other large models. Possessing a wide variety of invariances may be key to the recent successes of large language models, and our framework can shed light on the types of invariances retained or emerging in new models. We make the code publicly available. 2024.findings-acl.687 @@ -15477,11 +15477,11 @@ Discourse Structure-Aware Prefix for Generation-Based End-to-End Argumentation Mining - YangSun + YangSun GuanrongChen - CaihuaYang + CaihuaYang JianzhuBaoHarbin Institute of Technology - BinLiang + BinLiang XiZeng MinYangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences RuifengXuHarbin Institute of Technology @@ -15493,10 +15493,10 @@ Poor-Supervised Evaluation for <fixed-case>S</fixed-case>uper<fixed-case>LLM</fixed-case> via Mutual Consistency - PeiwenYuan + PeiwenYuan ShaoxiongFeng YiweiLi - XinglinWang + XinglinWang BoyuanPan HedaWang YaoHu @@ -15510,9 +15510,9 @@ Addressing Entity Translation Problem via Translation Difficulty and Context Diversity TianLiang - XingWangTencent AI Lab - MingmingYangTencent AI Lab - YujiuYangGraduate School at Shenzhen,Tsinghua University + XingWangTencent AI Lab + MingmingYangTencent AI Lab + YujiuYangGraduate School at Shenzhen,Tsinghua University ShumingShiTencent AI Lab ZhaopengTuTencent AI Lab 11628-11638 @@ -15541,8 +15541,8 @@ YijinLiuWechat AI XianfengZeng ChenzeShaoTencent Inc - FandongMengWeChat AI, Tencent Inc. - JieZhou + FandongMengWeChat AI, Tencent Inc. + JieZhou 11652-11663 Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization, through instruction fine-tuning. The fine-tuning data is generally sequentially concatenated from a specific task instruction, an input sentence, and the corresponding response. Considering the locality modeled by the self-attention mechanism of LLMs, these models face the risk of instruction forgetting when generating responses for long input sentences. To mitigate this issue, we propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences. Theoretical analysis suggests that our straightforward method can alter the model’s learning focus, thereby emphasizing the training of instruction-following capabilities. Concurrently, experimental results demonstrate that our approach consistently outperforms traditional settings across various model scales (1B / 7B / 13B) and different sequence generation tasks (translation and summarization), without any additional data or annotation costs. Notably, our method significantly improves the zero-shot performance on conditional sequence generation, e.g., up to 9.7 BLEU points on WMT zero-shot translation tasks. Further analysis reveals that our method can significantly improve the tranditional model’s instruction following ability by 1x over traditional approch. 2024.findings-acl.693 @@ -15552,11 +15552,11 @@ <fixed-case>XM</fixed-case>o<fixed-case>E</fixed-case>: Sparse Models with Fine-grained and Adaptive Expert Selection YuanhangYang - ShiyiQi - WenchaoGuTechnische Universität München + ShiyiQi + WenchaoGuTechnische Universität München ChaozhengWang CuiyunGaoHarbin Institute of Technology - ZenglinXuFudan University + ZenglinXuFudan University 11664-11674 Sparse models, including sparse Mixture-of-Experts (MoE) models, have emerged as an effective approach for scaling Transformer models. However, they often suffer from computational inefficiency since a significant number of parameters are unnecessarily involved in computations by multiplying values by zero or low activation values. To address this issue, we present XMoE, a novel MoE designed to enhance both the efficacy and efficiency of sparse MoE models. XMoE leverages small experts and a threshold-based router to enable tokens to selectively engage only essential parameters. Our extensive experiments on language modeling and machine translation tasks demonstrate that enhances model performance and can decrease the computation load at MoE layers by over 50% without sacrificing performance. Furthermore, we present the versatility of by applying it to dense models, enabling sparse computation during inference. We provide a comprehensive analysis and make our code available at https://anonymous.4open.science/r/XMoE. 2024.findings-acl.694 @@ -15567,8 +15567,8 @@ <fixed-case>B</fixed-case>ranch<fixed-case>N</fixed-case>orm: Robustly Scaling Extremely Deep Transformers YijinLiuWechat AI XianfengZeng - FandongMengWeChat AI, Tencent Inc. - JieZhou + FandongMengWeChat AI, Tencent Inc. + JieZhou 11675-11687 Recently, DeepNorm scales Transformers into extremely deep (i.e., 1000 layers) and reveals the promising potential of deep scaling. To stabilize the training of deep models, DeepNorm attempts to constrain the model update to a constant value. Although applying such a constraint can benefit the early stage of model training, it may lead to undertrained models during the whole training procedure. In this paper, we propose BranchNorm, which dynamically rescales the non-residual branch of Transformer in accordance with the training period. BranchNorm not only theoretically stabilizes the training with smooth gradient norms at the early stage, but also encourages better convergence in the subsequent training stage. Experimental results on multiple translation tasks demonstrate that BranchNorm achieves a better trade-off between training stability and converge performance. 2024.findings-acl.695 @@ -15577,8 +15577,8 @@ <fixed-case>M</fixed-case>us<fixed-case>TQ</fixed-case>: A Temporal Knowledge Graph Question Answering Dataset for Multi-Step Temporal Reasoning - TingyiZhang - JiaanWangSoochow University + TingyiZhang + JiaanWangSoochow University ZhixuLi JianfengQuSoochow University AnLiuSuzhou University @@ -15593,10 +15593,10 @@ Deal, or no deal (or who knows)? Forecasting Uncertainty in Conversations using Large Language Models AnthonySiciliaNortheastern University - HyunwooKimAllen Institute for Artificial Intelligence + HyunwooKimAllen Institute for Artificial Intelligence KhyathiChandu MaliheAlikhaniNortheastern University - JackHesselSamaya AI + JackHesselSamaya AI 11700-11726 Effective interlocutors account for the uncertain goals, beliefs, and emotions of others. But even the best human conversationalist cannot perfectly anticipate the trajectory of a dialogue. How well can language models represent inherent uncertainty in conversations? We propose FortUne Dial, an expansion of the long-standing “conversation forecasting” task: instead of just accuracy, evaluation is conducted with uncertainty-aware metrics, effectively enabling abstention on individual instances. We study two ways in which language models potentially represent outcome uncertainty (internally, using scores and directly, using tokens) and propose fine-tuning strategies to improve calibration of both representations. Experiments on eight difficult negotiation corpora demonstrate that our proposed fine-tuning strategies (a traditional supervision strategy and an off-policy reinforcement learning strategy) can calibrate smaller open-source models to compete with pre-trained models 10x their size. 2024.findings-acl.697 @@ -15612,7 +15612,7 @@ ShuyangYu YifeiGuo Sim KuanGohXiamen University Malaysia - Ho-KinTangHarbin Institute of Technology + Ho-KinTangHarbin Institute of Technology 11727-11742 Fine-tuning pre-trained language models, particularly large language models, demands extensive computing resources and can result in varying performance outcomes across different domains and datasets. This paper examines the approach of integrating multiple models from diverse training scenarios into a unified model. This unified model excels across various data domains and exhibits the ability to generalize well on out-of-domain data. We propose a knowledge fusion method named Evolver, inspired by evolutionary algorithms, which does not need further training or additional training data. Specifically, our method involves aggregating the weights of different language models into a population and subsequently generating offspring models through mutation and crossover operations. These offspring models are then evaluated against their parents, allowing for the preservation of those models that show enhanced performance on development datasets. Importantly, our model evolving strategy can be seamlessly integrated with existing model merging frameworks, offering a versatile tool for model enhancement. Experimental results on mainstream language models (i.e., encoder-only, decoder-only, encoder-decoder) reveal that Evolver outperforms previous state-of-the-art models by large margins. 2024.findings-acl.698 @@ -15622,10 +15622,10 @@ <fixed-case>S</fixed-case>ca<fixed-case>L</fixed-case>earn: Simple and Highly Parameter-Efficient Task Transfer by Learning to Scale MarkusFrohmannJohannes Kepler Universität Linz - CarolinHoltermannUniversität Hamburg - ShahedMasoudian + CarolinHoltermannUniversität Hamburg + ShahedMasoudian AnneLauscherUniversität Hamburg - NavidRekabsazThomson Reuters + NavidRekabsazThomson Reuters 11743-11776 Multi-task learning (MTL) has shown considerable practical benefits, particularly when using language models (LMs). While this is commonly achieved by learning tasks under a joint optimization procedure, some methods, such as AdapterFusion, divide the problem into two stages: (i) task learning, where knowledge specific to a task is encapsulated within sets of parameters (e.g., adapters), and (ii) transfer, where this already learned knowledge is leveraged for a target task. This separation of concerns provides numerous benefits (e.g., promoting reusability). However, current two stage MTL introduces a substantial number of additional parameters. We address this issue by leveraging the usefulness of linearly scaling the output representations of source adapters for transfer learning. We introduce ScaLearn, a simple and highly parameter-efficient two-stage MTL method that capitalizes on the knowledge of the source tasks by learning a minimal set of scaling parameters that enable effective transfer to a target task. Our experiments on three benchmarks (GLUE, SuperGLUE, and HumSet) and two encoder LMs show that ScaLearn consistently outperforms strong baselines with a small number of transfer parameters (~0.35% of those of AdapterFusion). Remarkably, we observe that ScaLearn maintains its strong abilities even when further reducing parameters, achieving competitive results with only 8 transfer parameters per target task. Our proposed approach thus demonstrates the power of simple scaling as a promise for more efficient task transfer. Our code is available at https://github.com/CPJKU/ScaLearn. 2024.findings-acl.699 @@ -15646,15 +15646,15 @@ <fixed-case>M</fixed-case>at<fixed-case>P</fixed-case>lot<fixed-case>A</fixed-case>gent: Method and Evaluation for <fixed-case>LLM</fixed-case>-Based Agentic Scientific Data Visualization ZhiyuYang ZihanZhouXiamen University - ShuoWang + ShuoWang XinCong XuHanTsinghua University, Tsinghua University YukunYan ZhenghaoLiuNortheastern University - ZhixingTanZhongguancun Laboratory + ZhixingTanZhongguancun Laboratory PengyuanLiuBeijing Language and Culture University DongYu - ZhiyuanLiuTsinghua University + ZhiyuanLiuTsinghua University XiaodongShiXiamen University, Tsinghua University MaosongSun 11789-11804 @@ -15688,7 +15688,7 @@ TingtingCui XiaoqingChengZhengzhou University LiutaoLiutao - DeyiXiongTianjin University + DeyiXiongTianjin University 11817-11837 What a large language model (LLM) would respond in ethically relevant context? In this paper, we curate a large benchmark CMoralEval for morality evaluation of Chinese LLMs. The data sources of CMoralEval are two-fold: 1) a Chinese TV program discussing Chinese moral norms with stories from the society and 2) a collection of Chinese moral anomies from various newspapers and academic papers on morality. With these sources, we aim to create a moral evaluation dataset characterized by diversity and authenticity. We develop a morality taxonomy and a set of fundamental moral principles that are not only rooted in traditional Chinese culture but also consistent with contemporary societal norms. To facilitate efficient construction and annotation of instances in CMoralEval, we establish a platform with AI-assisted instance generation to streamline the annotation process. These help us curate CMoralEval that encompasses both explicit moral scenarios (14,964 instances) and moral dilemma scenarios (15,424 instances), each with instances from different data sources. We conduct extensive experiments with CMoralEval to examine a variety of Chinese LLMs. Experiment results demonstrate that CMoralEval is a challenging benchmark for Chinese LLMs. 2024.findings-acl.703 @@ -15709,8 +15709,8 @@ Investigating the Impact of Model Instability on Explanations and Uncertainty - SaraMarjanovic - IsabelleAugensteinUniversity of Copenhagen + SaraMarjanovic + IsabelleAugensteinUniversity of Copenhagen ChristinaLiomaUniversity of Copenhagen 11854-11879 Explainable AI methods facilitate the understanding of model behaviour, yet, small, imperceptible perturbations to inputs can vastly distort explanations. As these explanations are typically evaluated holistically, before model deployment, it is difficult to assess when a particular explanation is trustworthy. Some studies have tried to create confidence estimators for explanations, but none have investigated an existing link between uncertainty and explanation quality. We artificially simulate epistemic uncertainty in text input by introducing noise at inference time. In this large-scale empirical study, we insert different levels of noise perturbations and measure the effect on the output of pre-trained language models and different uncertainty metrics. Realistic perturbations have minimal effect on performance and explanations, yet masking has a drastic effect. We find that high uncertainty doesn’t necessarily imply low explanation plausibility; the correlation between the two metrics can be moderately positive when noise is exposed during the training process. This suggests that noise-augmented models may be better at identifying salient tokens when uncertain. Furthermore, when predictive and epistemic uncertainty measures are over-confident, the robustness of a saliency map to perturbation can indicate model stability issues. Integrated Gradients shows the overall greatest robustness to perturbation, while still showing model-specific patterns in performance; however, this phenomenon is limited to smaller Transformer-based language models. @@ -15738,9 +15738,9 @@ MicheleMarchi IreneMondella HuiyuanLaiUniversity of Groningen - FeliceDell’OrlettaIstituto di Linguistica Computazionale “A. Zampolli” (ILC) + FeliceDell’OrlettaIstituto di Linguistica Computazionale “A. Zampolli” (ILC) MalvinaNissimUniversity of Groningen - MarcoGueriniFondazione Bruno Kessler + MarcoGueriniFondazione Bruno Kessler 11892-11907 Automatic methods for generating and gathering linguistic data have proven effective for fine-tuning Language Models (LMs) in languages less resourced than English. Still, while there has been emphasis on data quantity, less attention has been given to its quality. In this work, we investigate the impact of human intervention on machine-generated data when fine-tuning dialogical models. In particular, we study (1) whether post-edited dialogues exhibit higher perceived quality compared to the originals that were automatically generated; (2) whether fine-tuning with post-edited dialogues results in noticeable differences in the generated outputs; and (3) whether post-edited dialogues influence the outcomes when considering the parameter size of the LMs. To this end we created HED-IT, a large-scale dataset where machine-generated dialogues are paired with the version post-edited by humans. Using both the edited and unedited portions of HED-IT, we fine-tuned three different sizes of an LM. Results from both human and automatic evaluation show that the different quality of training data is clearly perceived and it has an impact also on the models trained on such data. Additionally, our findings indicate that larger models are less sensitive to data quality, whereas this has a crucial impact on smaller models. These results enhance our comprehension of the impact of human intervention on training data in the development of high-quality LMs. 2024.findings-acl.707 @@ -15773,7 +15773,7 @@ SuwonShonASAPP Hung-yiLeeNational Taiwan University KarenLivescuToyota Technological Institute at Chicago - ShinjiWatanabeCarnegie Mellon University + ShinjiWatanabeCarnegie Mellon University 11923-11938 The Spoken Language Understanding Evaluation (SLUE) suite of benchmark tasks was recently introduced to address the need for openresources and benchmarking of complex spoken language understanding (SLU) tasks, including both classification and sequence generation tasks, on natural speech. The benchmark has demonstrated preliminary success in using pre-trained speech foundation models (SFM) for these SLU tasks. However, the community still lacks a fine-grained understanding of the comparative utility of different SFMs. Inspired by this, we ask: which SFMs offer the most benefits for these complex SLU tasks, and what is the most effective approach for incorporating these SFMs? To answer this, we perform an extensive evaluation of multiple supervised and self-supervised SFMs using several evaluation protocols: (i) frozen SFMs with a lightweight prediction head, (ii) frozen SFMs with a complex prediction head, and (iii) fine-tuned SFMs with a lightweight prediction head. Although the supervised SFMs are pre-trained on much more speech recognition data (with labels), they do not always outperform self-supervised SFMs; the latter tend to perform at least as well as, and sometimes better than, supervised SFMs, especially on the sequence generation tasks in SLUE. While there is no universally optimal way of incorporating SFMs, the complex prediction head gives the best performance for most tasks, although it increases the inference time. We also introduce an open-source toolkit and performance leaderboard, SLUE-PERB, for these tasks and modeling strategies. 2024.findings-acl.709 @@ -15784,8 +15784,8 @@ Towards Multiple References Era – Addressing Data Leakage and Limited Reference Diversity in Machine Translation Evaluation XianfengZeng YijinLiuWechat AI - FandongMengWeChat AI, Tencent Inc. - JieZhou + FandongMengWeChat AI, Tencent Inc. + JieZhou 11939-11951 Recent research has shown a weak correlation between n-gram-based metrics and human evaluations in machine translation task, particularly when evaluating large language models (LLMs). Additionally, the data leakage risk in LLMs may cause an overestimation problem when evaluating LLMs on downstream tasks. In this work, we identify the limited diversity of references as the primary cause for the inferior performance of n-gram-based metrics and the overestimation problem. To address this issue, we propose to utilize multiple references generated by LLMs, coupled with an effective selection strategy focused on accuracy and diversity, to improve the alignment between automatic metrics and human evaluations. We validate our approach on the WMT22 Metrics benchmark with 4 languages and observe a maximum accuracy gain of 9.5% in F200spBLEU, which makes it on par with computationally expensive neural-based metrics. We also show that using multi-reference with n-gram-based metrics significantly alleviates the overestimation problem when evaluating LLMs with data leakage. Further analysis explores the factors that affect the quality of generated references, offering insights into data synthesis by LLMs. 2024.findings-acl.710 @@ -15799,8 +15799,8 @@ Øistein E.AndersenComputer Laboratory ShivaTaslimipoorUniversity of Cambridge HelenYannakoudakisComputer Laboratory, University of Cambridge and King’s College London - ZhengYuanKing’s College London, University of London - ChristopherBryantComputer Laboratory + ZhengYuanKing’s College London, University of London + ChristopherBryantComputer Laboratory MarekReiImperial College London PaulaButteryUniversity of Cambridge 11952-11967 @@ -15811,8 +15811,8 @@ <fixed-case>BATS</fixed-case>: <fixed-case>B</fixed-case>enchm<fixed-case>A</fixed-case>rking Text Simplicity 🦇 - ChristinKreutzTechnische Hochschule Mittelhessen - FabianHaakFachhochschule Köln + ChristinKreutzTechnische Hochschule Mittelhessen + FabianHaakFachhochschule Köln BjörnEngelmann PhilippSchaerTH Köln - University of Applied Sciences 11968-11989 @@ -15824,11 +15824,11 @@ <fixed-case>A</fixed-case>ustro<fixed-case>T</fixed-case>ox: A Dataset for Target-Based <fixed-case>A</fixed-case>ustrian <fixed-case>G</fixed-case>erman Offensive Language Detection PiaPachingerTechnische Universität Wien - JanisGoldzycher + JanisGoldzycher AnnaPlanitzer WojciechKusaAllegro - AllanHanburyComplexity Science Hub and Technische Universität Wien - JuliaNeidhardtTechnische Universität Wien + AllanHanburyComplexity Science Hub and Technische Universität Wien + JuliaNeidhardtTechnische Universität Wien 11990-12001 Model interpretability in toxicity detection greatly profits from token-level annotations. However, currently, such annotations are only available in English. We introduce a dataset annotated for offensive language detection sourced from a news forum, notable for its incorporation of the Austrian German dialect, comprising 4,562 user comments. In addition to binary offensiveness classification, we identify spans within each comment constituting vulgar language or representing targets of offensive statements. We evaluate fine-tuned Transformer models as well as large language models in a zero- and few-shot fashion. The results indicate that while fine-tuned models excel in detecting linguistic peculiarities such as vulgar dialect, large language models demonstrate superior performance in detecting offensiveness in AustroTox. 2024.findings-acl.713 @@ -15838,7 +15838,7 @@ Discovering influential text using convolutional neural networks MeganAyers - LukeSanford + LukeSanford MargaretRobertsUniversity of California, San Diego EddieYangUniversity of California, San Diego 12002-12027 @@ -15849,13 +15849,13 @@ <fixed-case>LC</fixed-case>4<fixed-case>EE</fixed-case>: <fixed-case>LLM</fixed-case>s as Good Corrector for Event Extraction - MengnaZhu - KaishengZeng - JibingWuJibingWu - LihuaLiuNational University of Defense Technology + MengnaZhu + KaishengZeng + JibingWuJibingWu + LihuaLiuNational University of Defense Technology HongbinHuangNational University of Defense Technology - LeiHouTsinghua University, Tsinghua University - JuanziLi + LeiHouTsinghua University, Tsinghua University + JuanziLi 12028-12038 Event extraction (EE) is a critical task in natural language processing, yet deploying a practical EE system remains challenging. On one hand, powerful large language models (LLMs) currently show poor performance because EE task is more complex than other tasks. On the other hand, state-of-the-art (SOTA) small language models (SLMs) for EE tasks are typically developed through fine-tuning, lack flexibility, and have considerable room for improvement. We propose an approach, **L**LMs-as-**C**orrector for **E**vent **E**xtraction (**LC4EE**), aiming to leverage the superior extraction capability of SLMs and the instruction-following ability of LLMs to construct a robust and highly available EE system. By utilizing LLMs to identify and correct errors of SLMs predictions based on automatically generated feedback information, EE performances can be improved significantly. Experimental results on the representative datasets ACE2005 and MAVEN-Arg for Event Detection (ED) and EE tasks validated the effectiveness of our method. 2024.findings-acl.715 @@ -15867,7 +15867,7 @@ YihongDongPeking University XueJiangPeking University HuanyuLiu - ZhiJinPeking University and Peking University + ZhiJinPeking University and Peking University BinGuBeijing Institute of Control Engineering MengfeiYangChina Academy of Space Technology GeLiPeking University Shenzhen Graduate School @@ -15890,9 +15890,9 @@ <fixed-case>A</fixed-case>ncient <fixed-case>C</fixed-case>hinese Glyph Identification Powered by Radical Semantics YangChiJilin University - FaustoGiunchiglia + FaustoGiunchiglia ChuntaoLiJilin University - HaoXuJilin University + HaoXuJilin University 12065-12074 The ancestor of Chinese character – the ancient characters from about 1300 BC to 200 BC are not fixed in their writing glyphs. At the same or different points in time, one character can possess multiple glyphs that are different in shapes or radicals. Nearly half of ancient glyphs have not been deciphered yet. This paper proposes an innovative task of ancient Chinese glyph identification, which aims at inferring the Chinese character label for the unknown ancient Chinese glyphs which are not in the training set based on the image and radical information. Specifically, we construct a Chinese glyph knowledge graph (CGKG) associating glyphs in different historical periods according to the radical semantics, and propose a multimodal Chinese glyph identification framework (MCGI) fusing the visual, textual, and the graph data. The experiment is designed on a real Chinese glyph dataset spanning over 1000 years, it demonstrates the effectiveness of our method, and reports the potentials of each modality on this task. It provides a preliminary reference for the automatic ancient Chinese character deciphering at the glyph level. 2024.findings-acl.718 @@ -15904,7 +15904,7 @@ SettaluriSravanthiIndian Institute of Technology Bombay, Indian Institute of Technology, Bombay MeetDoshi PavanTankala - RudraMurthyIBM India Ltd + RudraMurthyIBM India Ltd RajDabreNational Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology PushpakBhattacharyyaIndian Institute of Technology, Bombay, Dhirubhai Ambani Institute Of Information and Communication Technology 12075-12097 @@ -15915,8 +15915,8 @@ <fixed-case>E</fixed-case>mo<fixed-case>T</fixed-case>rans<fixed-case>KG</fixed-case>: An Innovative Emotion Knowledge Graph to Reveal Emotion Transformation - HuanZhaoHunan University - XupengZhaHunan University + HuanZhaoHunan University + XupengZhaHunan University ZixingZhangHunan University 12098-12110 This paper introduces EmoTransKG, an innovative Emotion Knowledge Graph (EKG) that establishes connections and transformations between emotions across diverse open-textual events. Compared to existing EKGs, which primarily focus on linking emotion keywords to related terms or on assigning sentiment dimension ratings to emotion words by humans, EmoTransKG aims to represent the general knowledge involved in emotion transformation. Specifically, in conversations, successive emotions expressed by a single speaker are temporally considered as the head and tail entities, with open-text utterances (events) occurring between them representing the relation. To explore the knowledge of emotion transformations described in EmoTransKG, we develop a Transformer-based translational model called EmoTransNet, which predictively trains tail entities by interpreting the relation as an operation that transforms the source emotion into the target emotion. Particularly, our designed EmoTransNet serves as a plug-in module that seamlessly integrates with any conversational emotion recognition (CER) models for emotion retrofitting. Experimental results on two CER datasets demonstrate that the incorporation of EmoTransNet with baseline models results in substantial improvements, and the qualitative visualization of entities and relations clearly clarify their unique roles in emotion transformations. These experiments confirm the quality and effectiveness of EmoTransKG. @@ -15927,9 +15927,9 @@ How Vocabulary Sharing Facilitates Multilingualism in <fixed-case>LL</fixed-case>a<fixed-case>MA</fixed-case>? FeiYuan - ShuaiYuan + ShuaiYuan ZhiyongWuShanghai Artificial Intelligence Laboratory - LeiLiSchool of Computer Science, Carnegie Mellon University + LeiLiSchool of Computer Science, Carnegie Mellon University 12111-12130 Large Language Models (LLMs), often show strong performance on English tasks, while exhibiting limitations on other languages. What is an LLM’s multilingual capability when it is trained only on certain languages? The underlying mechanism remains unclear. This study endeavors to examine the multilingual capability of LLMs from the vocabulary sharing perspective by conducting an exhaustive analysis across 101 languages. Through the investigation of the performance gap before and after embedding fine-tuning, we discovered four distinct quadrants. By delving into each quadrant we provide actionable and efficient guidelines for tuning these languages. Extensive experiments reveal that existing LLMs possess multilingual capabilities that surpass our expectations, and we can significantly improve the multilingual performance of LLMs based on these attributes of each quadrant . 2024.findings-acl.721 @@ -15940,9 +15940,9 @@ Prefix Text as a Yarn: Eliciting Non-<fixed-case>E</fixed-case>nglish Alignment in Foundation Language Model RunzheZhanUniversity of Macau XinyiYang - DerekWongUniversity of Macau + DerekWongUniversity of Macau LidiaChao - YueZhangWestlake University + YueZhangWestlake University 12131-12145 While supervised fine-tuning (SFT) has been a straightforward approach for tailoring the output of foundation large language model (LLM) to specific preferences, concerns have been raised about the depth of this alignment, with some critiques suggesting it is merely “superficial”. We critically examine this hypothesis within the scope of cross-lingual generation tasks, proposing that the effectiveness of SFT may be constrained by its reliance on prior tokens to guide cross-lingual generation. Based on this crucial insight, and in response to the challenges posed by the costly and limited availability of non-English data for SFT, we introduce a novel training-free alignment method named PreTTY, which employs minimal task-related prior tokens to bridge the foundation LLM and the SFT LLM, achieving comparable performance without training. Experiments on machine translation and part-of-speech tagging across seven languages demonstrate the efficacy of PreTTY in cross-lingual settings. Remarkably, by initiating the decoding process with only one or two prior tokens, foundation LLMs can attain up to 98% of the performance metrics of their SFT counterparts. This method presents a cost-effective alternative to traditional SFT and advances the democratization of multilingual LLMs. 2024.findings-acl.722 @@ -15952,12 +15952,12 @@ Dual Prompt Tuning based Contrastive Learning for Hierarchical Text Classification SishiXiongChina Telecom - YuZhao - JieZhang + YuZhao + JieZhang LiMengxiang ZhongjiangHe XuelongLiNorthwestern Polytechnical University - ShuangyongSong + ShuangyongSong 12146-12158 Hierarchical text classification aims at categorizing texts into a multi-tiered tree-structured hierarchy of labels. Existing methods pay more attention to capture hierarchy-aware text feature by exploiting explicit parent-child relationships, while interactions between peer labels are rarely taken into account, resulting in severe label confusion within each layer. In this work, we propose a novel Dual Prompt Tuning (DPT) method, which emphasizes identifying discrimination among peer labels by performing contrastive learning on each hierarchical layer. We design an innovative hand-crafted prompt containing slots for both positive and negative label predictions to cooperate with contrastive learning. In addition, we introduce a label hierarchy self-sensing auxiliary task to ensure cross-layer label consistency. Extensive experiments demonstrate that DPT achieves significant improvements and outperforms the current state-of-the-art methods on BGC and RCV1-V2 benchmark datasets. 2024.findings-acl.723 @@ -15967,8 +15967,8 @@ Probing the Emergence of Cross-lingual Alignment during <fixed-case>LLM</fixed-case> Training HetongWang - PasqualeMinerviniUniversity of Edinburgh, University of Edinburgh - EdoardoPontiUniversity of Edinburgh + PasqualeMinerviniUniversity of Edinburgh, University of Edinburgh + EdoardoPontiUniversity of Edinburgh 12159-12173 Multilingual Large Language Models (LLMs) achieve remarkable levels of zero-shot cross-lingual transfer performance. We speculate that this is predicated on their ability to align languages without explicit supervision from parallel sentences. While representations of translationally equivalent sentences in different languages are known to be similar after convergence, however, it remains unclear how such cross-lingual alignment emerges during pre-training of LLMs. Our study leverages intrinsic probing techniques, which identify which subsets of neurons encode linguistic features, to correlate the degree of cross-lingual neuron overlap with the zero-shot cross-lingual transfer performance for a given model. In particular, we rely on checkpoints of BLOOM, a multilingual autoregressive LLM, across different training steps and model scales. We observe a high correlation between neuron overlap and downstream performance, which supports our hypothesis on the conditions leading to effective cross-lingual transfer. Interestingly, we also detect a degradation of both implicit alignment and multilingual abilities in certain phases of the pre-training process, providing new insights into the multilingual pretraining dynamics. 2024.findings-acl.724 @@ -15979,7 +15979,7 @@ <fixed-case>STSPL</fixed-case>-<fixed-case>SSC</fixed-case>: Semi-Supervised Few-Shot Short Text Clustering with Semantic text similarity Optimized Pseudo-Labels WenhuaNieNational Yang Ming Chiao Tung University LinDeng - Chang-BoLiu + Chang-BoLiu JialingWeiJialingWei RuitongHan HaoranZheng @@ -15997,7 +15997,7 @@ WeiLiuxiaomi JianLuan BinWangAI Lab, Xiaomi Inc. - DeyiXiongTianjin University + DeyiXiongTianjin University 12186-12215 Increasing the number of parameters in large language models (LLMs) usually improves performance in downstream tasks but raises compute and memory costs, making deployment difficult in resource-limited settings. Quantization techniques, which reduce the bits needed for model weights or activations with minimal performance loss, have become popular due to the rise of LLMs. However, most quantization studies use pre-trained LLMs, and the impact of quantization on instruction-tuned LLMs and the relationship between perplexity and benchmark performance of quantized LLMs are not well understood. Evaluation of quantized LLMs is often limited to language modeling and a few classification tasks, leaving their performance on other benchmarks unclear. To address these gaps, we propose a structured evaluation framework consisting of three critical dimensions: (1) knowledge & capacity, (2) alignment, and (3) efficiency, and conduct extensive experiments across ten diverse benchmarks. Our experimental results indicate that LLMs with 4-bit quantization can retain performance comparable to their non-quantized counterparts, and perplexity can serve as a proxy metric for quantized LLMs on most benchmarks. Furthermore, quantized LLMs with larger parameter scales can outperform smaller LLMs. Despite the memory savings achieved through quantization, it can also slow down the inference speed of LLMs. Consequently, substantial engineering efforts and hardware support are imperative to achieve a balanced optimization of decoding speed and memory consumption in the context of quantized LLMs. 2024.findings-acl.726 @@ -16059,10 +16059,10 @@ Decomposing Argumentative Essay Generation via Dialectical Planning of Complex Reasoning YuhangHe JianzhuBaoHarbin Institute of Technology - YangSun - BinLiang + YangSun + BinLiang MinYang - BingQinHarbin Institute of Technology + BingQinHarbin Institute of Technology RuifengXuHarbin Institute of Technology 12305-12322 Argumentative Essay Generation (AEG) is a challenging task in computational argumentation, where detailed logical reasoning and effective rhetorical skills are essential.Previous methods on argument generation typically involve planning prior to generation.However, the planning strategies in these methods overlook the exploration of the logical reasoning process.Inspired by argument structure-related theories, we propose an argumentative planning strategy for prompting large language models (LLMs) to generate high-quality essays.This strategy comprises two stages: (1) Sketch planning, which creates a rough outline of the essay, and (2) Dialectical planning, which refines the outline through critical self-reflection.Such a planning strategy enables LLMs to write argumentative essays that are more logical, diverse, and persuasive.Furthermore, due to the scarcity of existing AEG datasets, we construct three new datasets.These datasets are from two domains: exam essays and news editorials, covering both Chinese and English.Automatic and manual evaluation on four datasets show that our method can generate more dialectical and persuasive essays with higher diversity compared to several strong baselines. @@ -16074,7 +16074,7 @@ Large Language Models are Few-Shot Training Example Generators: A Case Study in Fallacy Recognition TariqAlhindi SmarandaMuresanAmazon and Columbia University - PreslavNakovMohamed bin Zayed University of Artificial Intelligence + PreslavNakovMohamed bin Zayed University of Artificial Intelligence 12323-12334 Recognizing fallacies is crucial for ensuring the quality and validity of arguments across various domains. However, computational fallacy recognition faces challenges due to the diverse genres, domains, and types of fallacies found in datasets. This leads to a highly multi-class, and even multi-label, setup with substantial class imbalance. In this study, we aim to enhance existing models for fallacy recognition by incorporating additional context and by leveraging large language models to generate synthetic data, thus increasing the representation of the infrequent classes. We experiment with GPT3.5 to generate synthetic examples and we examine the impact of prompt settings for this. Moreover, we explore zero-shot and few-shot scenarios to evaluate the effectiveness of using the generated examples for training smaller models within a unified fallacy recognition framework. Furthermore, we analyze the overlap between the synthetic data and existing fallacy datasets. Finally, we investigate the usefulness of providing supplementary context for detecting fallacy types that need such context, e.g., diversion fallacies. Our evaluation results demonstrate consistent improvements across fallacy types, datasets, and generators. The code and the synthetic datasets are all publicly available. 2024.findings-acl.732 @@ -16085,7 +16085,7 @@ Concept-aware Data Construction Improves In-context Learning of Language Models MichalŠtefánik MarekKadlčíkMasaryk University - PetrSojkaFaculty of Informatics, Masaryk University + PetrSojkaFaculty of Informatics, Masaryk University 12335-12352 Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs’ ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings.In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learners are much more effective in in-context learning a majority of unseen tasks compared to traditional instruction tuning, and fare comparably also to previous in-context learners trained in large-scale multitask learning requiring magnitudes of more training data. 2024.findings-acl.733 @@ -16094,9 +16094,9 @@ Beyond Text: Leveraging Multi-Task Learning and Cognitive Appraisal Theory for Post-Purchase Intention Analysis - GerardYeo + GerardYeo ShazFurniturewala - KokilJaidkaNational University of Singapore + KokilJaidkaNational University of Singapore 12353-12360 Supervised machine-learning models for predicting user behavior offer a challenging classification problem with lower average prediction performance scores than other text classification tasks. This study evaluates multi-task learning frameworks grounded in Cognitive Appraisal Theory to predict user behavior as a function of users’ self-expression and psychological attributes. Our experiments show that users’ language and traits improve predictions above and beyond models predicting only from text. Our findings highlight the importance of integrating psychological constructs into NLP to enhance the understanding and prediction of user actions. We close with a discussion of the implications for future applications of large language models for computational psychology. 2024.findings-acl.734 @@ -16107,7 +16107,7 @@ Non-Autoregressive Machine Translation as Constrained <fixed-case>HMM</fixed-case> HaoranLi ZhanmingJieSalesforce Research - WeiLuSingapore University of Technology and Design + WeiLuSingapore University of Technology and Design 12361-12372 In non-autoregressive translation (NAT), directed acyclic Transformers (DAT) have demonstrated their ability to achieve comparable performance to the autoregressive Transformers.In this paper, we first show that DAT is essentially a fully connected left-to-right Hidden Markov Model (HMM), with the source and target sequences being observations and the token positions being latent states.Even though generative models like HMM do not suffer from label bias in traditional task settings (e.g., sequence labeling), we argue here that the left-to-right HMM in NAT may still encounter this issue due to the missing observations at the inference stage.To combat label bias, we propose two constrained HMMs: 1) Adaptive Window HMM, which explicitly balances the number of outgoing transitions at different states; 2) Bi-directional HMM, i.e., a combination of left-to-right and right-to-left HMMs, whose uni-directional components can implicitly regularize each other’s biases via shared parameters.Experimental results on WMT’14 EnDe and WMT’17 ZhEn demonstrate that our methods can achieve better or comparable performance to the original DAT using various decoding methods.We also demonstrate that our methods effectively reduce the impact of label bias. 2024.findings-acl.735 @@ -16116,13 +16116,13 @@ Multi-modal Stance Detection: New Datasets and Model - BinLiang - AngLi - JingqianZhaoHarbin Institute of Technology + BinLiang + AngLi + JingqianZhaoHarbin Institute of Technology LinGuiKing’s College London, University of London MinYangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences YueYuNational University of Defense Technology and PengCheng Lab - Kam-FaiWongThe Chinese University of Hong Kong + Kam-FaiWongThe Chinese University of Hong Kong RuifengXuHarbin Institute of Technology 12373-12387 Stance detection is a challenging task that aims to identify public opinion from social media platforms with respect to specific targets. Previous work on stance detection largely focused on pure texts. In this paper, we study multi-modal stance detection for tweets consisting of texts and images, which are prevalent in today’s fast-growing social media platforms where people often post multi-modal messages. To this end, we create five new multi-modal stance detection datasets of different domains based on Twitter, in which each example consists of a text and an image. In addition, we propose a simple yet effective Targeted Multi-modal Prompt Tuning framework (TMPT), where target information is leveraged to learn multi-modal stance features from textual and visual modalities. Experimental results on our five benchmark datasets show that the proposed TMPT achieves state-of-the-art performance in multi-modal stance detection. @@ -16132,9 +16132,9 @@ Enhanced Language Model Truthfulness with Learnable Intervention and Uncertainty Expression - FarimaFatahi BayatUniversity of Michigan - Ann Arbor + FarimaFatahi BayatUniversity of Michigan - Ann Arbor XinLiuUniversity of Michigan - Ann Arbor - H.JagadishUniversity of Michigan - Ann Arbor + H.JagadishUniversity of Michigan - Ann Arbor LuWangNortheastern University, Northeastern University and University of Michigan 12388-12400 Large language models (LLMs) can generate long-form and coherent text, yet they often hallucinate facts, which undermines their reliability. To mitigate this issue, inference-time methods steer LLM representations toward the “truthful directions” previously learned for truth elicitation. However, applying these truthful directions with the same intensity fails to generalize across different query contexts. We propose LITO, a Learnable Intervention method for Truthfulness Optimization that automatically identifies the optimal intervention intensity tailored to each specific context. LITO explores a sequence of model generations based on increasing levels of intervention intensities. It selects the most accurate response or refuses to answer when the predictions are highly uncertain. Experiments on multiple LLMs and question-answering datasets demonstrate that LITO improves truthfulness while preserving task accuracy. The adaptive nature of LITO counters the limitations of one-size-fits-all intervention methods, maximizing truthfulness by reflecting the model’s internal knowledge only when it is confident. Our code is available at https://github.com/launchnlp/LITO. @@ -16144,12 +16144,12 @@ <fixed-case>MM</fixed-case>-<fixed-case>LLM</fixed-case>s: Recent Advances in <fixed-case>M</fixed-case>ulti<fixed-case>M</fixed-case>odal Large Language Models - DuzhenZhang + DuzhenZhang YahanYuKyoto University, Kyoto University JiahuaDong ChenxingLi DanSu - ChenhuiChuKyoto University + ChenhuiChuKyoto University DongYuTencent AI Lab 12401-12430 In the past year, MultiModal Large Language Models (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a [real-time tracking website](https://mm-llms.github.io/) for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain. @@ -16159,25 +16159,25 @@ <fixed-case>CIF</fixed-case>-Bench: A <fixed-case>C</fixed-case>hinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models - YizhiLiUniversity of Manchester and University of Sheffield + YizhiLiUniversity of Manchester and University of Sheffield GeZhang XingweiQuHong Kong University of Science and Technology JialiLiNational University of Singapore ZhaoqunLi NoahWang - HaoLi + HaoLi RuibinYuan - YinghaoMaQueen Mary University of London + YinghaoMaQueen Mary University of London KaiZhang WangchunshuZhouAIWaves Inc. YimingLiang LeiZhang - LeiMaPeking University and Beijing Academy of Artifical Intelligence + LeiMaPeking University and Beijing Academy of Artifical Intelligence JiajunZhangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences ZuowenLiBeijing Foreign Studies University WenhaoHuang ChenghuaLinUniversity of Manchester - JieFuHong Kong University of Science and Technology + JieFuHong Kong University of Science and Technology 12431-12446 The advancement of large language models (LLMs) has enhanced the ability to generalize across a wide range of unseen natural language processing (NLP) tasks through instruction-following.Yet, their effectiveness often diminishes in low-resource languages like Chinese, exacerbated by biased evaluations from data leakage, casting doubt on their true generalizability to new linguistic territories. In response, we introduce the Chinese Instruction-Following Benchmark (**CIF-Bench**), designed to evaluate the zero-shot generalizability of LLMs to the Chinese language. CIF-Bench comprises 150 tasks and 15,000 input-output pairs, developed by native speakers to test complex reasoning and Chinese cultural nuances across 20 categories. To mitigate data contamination, we release only half of the dataset publicly, with the remainder kept private, and introduce diversified instructions to minimize score variance, totaling 45,000 data instances.Our evaluation of 28 selected LLMs reveals a noticeable performance gap, with the best model scoring only 52.9%, highlighting the limitations of LLMs in less familiar language and task contexts.This work not only uncovers the current limitations of LLMs in handling Chinese language tasks but also sets a new standard for future LLM generalizability research, pushing towards the development of more adaptable, culturally informed, and linguistically diverse models. 2024.findings-acl.739 @@ -16190,7 +16190,7 @@ FlorianStrubDeepMind RahmaChaabouniGoogle PaulMichelDeepMind - EmmanuelDupouxEHESS + EmmanuelDupouxEHESS OlivierPietquinCohere and Earth Species Project 12447-12472 While reinforcement learning (RL) has been proven essential for tuning large language models (LLMs), it can lead to reward over-optimization (ROO). Existing approaches address ROO by adding KL regularization, requiring computationally expensive hyperparameter tuning. Additionally, KL regularization focuses solely on regularizing the language policy, neglecting a potential source of regularization: the reward function itself. Inspired by demonstration-guided RL, we here introduce the Reward Calibration from Demonstration (RCfD), which leverages human demonstrations and a reward model to recalibrate the reward objective. Formally, given a prompt, the RCfD objective minimizes the distance between the demonstrations’ and LLM’s rewards rather than directly maximizing the reward function. This objective shift avoids incentivizing the LLM to exploit the reward model and promotes more natural and diverse language generation.We show the effectiveness of RCfD in three RL language tasks, where it achieves comparable performance to carefully tuned baselines while mitigating ROO. @@ -16202,8 +16202,8 @@ Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss WeiHeUniversity of Sheffield MarcoIdiartUniversidade Federal do Rio Grande do Sul - CarolinaScartonUniversity of Sheffield - AlineVillavicencioUniversity of Exeter and University of Sheffield + CarolinaScartonUniversity of Sheffield + AlineVillavicencioUniversity of Exeter and University of Sheffield 12473-12485 Accurately modeling idiomatic or non-compositional language has been a longstanding challenge in Natural Language Processing (NLP). This is partly because these expressions do not derive their meanings solely from their constituent words, but also due to the scarcity of relevant data resources, and their impact on the performance of downstream tasks such as machine translation and simplification. In this paper we propose an approach to model idiomaticity effectively using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models by using adaptive contrastive learning and resampling miners to build an idiomatic-aware learning objective. Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics. 2024.findings-acl.741 @@ -16216,7 +16216,7 @@ HangYanAI lab QipengGuoShanghai AI Laboratory HaijunLv - XipengQiuFudan University + XipengQiuFudan University 12486-12502 Large language models have achieved remarkable success, but their extensive parameter size necessitates substantial memory for training, thereby setting a high threshold. While the recently proposed low-memory optimization (LOMO) reduces memory footprint, its optimization technique, akin to stochastic gradient descent, is sensitive to hyper-parameters and exhibits suboptimal convergence, failing to match the performance of the prevailing optimizer for large language models, AdamW. Through analysis of the Adam optimizer, we found that, compared to momentum, the adaptive learning rate is more critical for bridging the gap. Building on this insight, we introduce the low-memory optimization with adaptive learning rate (AdaLomo), which offers an adaptive learning rate for each parameter and exhibits superior convergence performance compared to LOMO theoretically. To maintain memory efficiency, we employ non-negative matrix factorization for the second-order moment estimation. Additionally, we suggest the use of a grouped update normalization to stabilize convergence. Our experiments with instruction-tuning and further pre-training demonstrate that AdaLomo achieves results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models. The code is accessible at https://github.com/OpenLMLab/LOMO. 2024.findings-acl.742 @@ -16227,7 +16227,7 @@ Propagation and Pitfalls: Reasoning-based Assessment of Knowledge Editing through Counterfactual Tasks WenyueHuaRutgers University, New Brunswick JiangGuo - MingwenDong + MingwenDong HenghuiZhuAmazon PatrickNgAmazon ZhiguoWang @@ -16243,7 +16243,7 @@ TaliaTseriotou XeniaMiscouridouUniversity of Cyprus and Imperial College London AdamTsakalidisCedefop and Alan Turing Institute - MariaLiakataQueen Mary University London + MariaLiakataQueen Mary University London 12526-12537 Through the rise of social media platforms, longitudinal language modelling has received much attention over the latest years, especially in downstream tasks such as mental health monitoring of individuals where modelling linguistic content in a temporal fashion is crucial. A key limitation in existing work is how to effectively model temporal sequences within Transformer-based language models. In this work we address this challenge by introducing a novel approach for predicting ‘Moments of Change’ (MoC) in the mood of online users, by simultaneously considering user linguistic and time-aware context. A Hawkes process-inspired transformation layer is applied over the proposed architecture to model the influence of time on users’ posts – capturing both their immediate and historical dynamics. We perform experiments on the two existing datasets for the MoC task and showcase clear performance gains when leveraging the proposed layer. Our ablation study reveals the importance of considering temporal dynamics in detecting subtle and rare mood changes. Our results indicate that considering linguistic and temporal information in a hierarchical manner provide valuable insights into the temporal dynamics of modelling user generated content over time, with applications in mental health monitoring. 2024.findings-acl.744 @@ -16259,7 +16259,7 @@ ShanshanGuo JianhuaHanHuawei Technologies Ltd. HangXuHuawei Noah‘s Ark Lab - ShikuiMaDataa Robotics + ShikuiMaDataa Robotics XiaodanLiang 12538-12559 Understanding and following natural language instructions while navigating through complex, real-world environments poses a significant challenge for general-purpose robots. These environments often include obstacles and pedestrians, making it essential for autonomous agents to possess the capability of self-corrected planning to adjust their actions based on feedback from the surroundings. However, the majority of existing vision-and-language navigation (VLN) methods primarily operate in less realistic simulator settings and do not incorporate environmental feedback into their decision-making processes. To address this gap, we introduce a novel zero-shot framework called CorNav, utilizing a large language model for decision-making and comprising two key components: 1) incorporating environmental feedback for refining future plans and adjusting its actions, and 2) multiple domain experts for parsing instructions, scene understanding, and refining predicted actions. In addition to the framework, we develop a 3D simulator that renders realistic scenarios using Unreal Engine 5. To evaluate the effectiveness and generalization of navigation agents in a zero-shot multi-task setting, we create a benchmark called NavBench. Our empirical study involves deploying 7 baselines across four tasks, i.e., goal-conditioned navigation given a specific object category, goal-conditioned navigation given simple instructions, finding abstract objects based on high-level instructions, and step-by-step instruction following. Extensive experiments demonstrate that CorNav consistently outperforms all baselines by a significant margin across all tasks. On average, CorNav achieves a success rate of 28.1%, surpassing the best baseline’s performance of 20.5%. @@ -16270,18 +16270,18 @@ <fixed-case>S</fixed-case>ci<fixed-case>MMIR</fixed-case>: Benchmarking Scientific Multi-modal Information Retrieval SiweiWuNanjing University of Science and Technology - YizhiLiUniversity of Manchester and University of Sheffield + YizhiLiUniversity of Manchester and University of Sheffield KangZhu GeZhang YimingLiang KaijingMa ChenghaoXiao HaoranZhang - BohaoYangUniversity of Manchester + BohaoYangUniversity of Manchester WenhuChenUniversity of Waterloo and Google WenhaoHuang NouraAl MoubayedDurham University - JieFuHong Kong University of Science and Technology + JieFuHong Kong University of Science and Technology ChenghuaLinUniversity of Manchester 12560-12574 Multi-modal information retrieval (MMIR) is a rapidly evolving field where significant progress has been made through advanced representation learning and cross-modality alignment research, particularly in image-text pairing.However, current benchmarks for evaluating MMIR performance on image-text pairings overlook the scientific domain, which has a notable gap with the generic data since the caption of scientific charts and tables usually describes the analysis of experimental results or scientific principles in contrast to human activity or scenery depicted in generic images.To bridge this gap, we develop a scientific domain-specific MMIR benchmark (SciMMIR) by leveraging open-access research paper corpora to extract data relevant to the scientific domain. This benchmark comprises 530K meticulously curated image-text pairs, extracted from figures and tables with detailed captions from scientific documents.We further annotate the image-text pairs with a two-level subset-subcategory hierarchy to facilitate a more comprehensive evaluation of the baselines. We conduct zero-shot and fine-tuned evaluations on prominent multi-modal image-captioning and visual language models, such as CLIP, BLIP, and BLIP-2.Our findings offer critical insights for MMIR in the scientific domain, including the impact of pre-training and fine-tuning settings and the effects of different visual and textual encoders. @@ -16292,7 +16292,7 @@ Diving Deep into the Motion Representation of Video-Text Models ChinmayaDevarajUniversity of Maryland, College Park - CorneliaFermullerUniversity of Maryland, College Park + CorneliaFermullerUniversity of Maryland, College Park YiannisAloimonosUniversity of Maryland, College Park 12575-12584 Videos are more informative than images becausethey capture the dynamics of the scene.By representing motion in videos, we can capturedynamic activities. In this work, we introduceGPT-4 generated motion descriptions thatcapture fine-grained motion descriptions of activitiesand apply them to three action datasets.We evaluated several video-text models on thetask of retrieval of motion descriptions. Wefound that they fall far behind human expertperformance on two action datasets, raisingthe question of whether video-text models understandmotion in videos. To address it, weintroduce a method of improving motion understandingin video-text models by utilizingmotion descriptions. This method proves tobe effective on two action datasets for the motiondescription retrieval task. The results drawattention to the need for quality captions involvingfine-grained motion information in existingdatasets and demonstrate the effectiveness ofthe proposed pipeline in understanding finegrainedmotion during video-text retrieval. @@ -16317,7 +16317,7 @@ AnirudhSomSRI International KaranSikkaSRI International HelenGentSRI International - AjayDivakaranSRI International + AjayDivakaranSRI International AndreasKathol DimitraVergyri 12612-12627 @@ -16331,7 +16331,7 @@ KhiemPhiState University of New York at Stony Brook NoushinSalek Faramarzi, State University of New York at Stony Brook ChenluWangState University of New York at Stony Brook - RitwikBanerjeeState University of New York, Stony Brook + RitwikBanerjeeState University of New York, Stony Brook 12628-12643 Whataboutism, a potent tool for disrupting narratives and sowing distrust, remains under-explored in quantitative NLP research. Moreover, past work has not distinguished its use as a strategy for misinformation and propaganda from its use as a tool for pragmatic and semantic framing. We introduce new datasets from Twitter/X and YouTube, revealing overlaps as well as distinctions between whataboutism, propaganda, and the tu quoque fallacy. Furthermore, drawing on recent work in linguistic semantics, we differentiate the ‘what about’ lexical construct from whataboutism. Our experiments bring to light unique challenges in its accurate detection, prompting the introduction of a novel method using attention weights for negative sample mining. We report significant improvements of 4% and 10% over previous state-of-the-art methods in our Twitter and YouTube collections, respectively. 2024.findings-acl.750 @@ -16360,8 +16360,8 @@ <fixed-case>LLM</fixed-case>s as Narcissistic Evaluators: When Ego Inflates Evaluation Scores - YiqiLiuUniversity of Manchester - NafiseMoosaviUniversity of Sheffield + YiqiLiuUniversity of Manchester + NafiseMoosaviUniversity of Sheffield ChenghuaLinUniversity of Manchester 12688-12701 Automatic evaluation of generated textual content presents an ongoing challenge within the field of NLP. Given the impressive capabilities of modern language models (LMs) across diverse NLP tasks, there is a growing trend to employ these models in creating innovative evaluation metrics for automated assessment of generation tasks. This paper investigates a pivotal question: Do language model-driven evaluation metrics inherently exhibit bias favoring texts generated by the same underlying language model? Specifically, we assess whether prominent LM-based evaluation metrics (e.g. BARTScore, T5Score, and GPTScore) demonstrate a favorable bias toward their respective underlying LMs in the context of summarization tasks. Our findings unveil a latent bias, particularly pronounced when such evaluation metrics are used in a reference-free manner without leveraging gold summaries. These results underscore that assessments provided by generative evaluation models can be influenced by factors beyond the inherent text quality, highlighting the necessity of developing more reliable evaluation protocols in the future. @@ -16389,7 +16389,7 @@ NemikaTyagiArizona State University Md NayemUddinArizona State University NeerajVarshney - ChittaBaralArizona State University + ChittaBaralArizona State University 12717-12733 This study explores the sycophantic tendencies of Large Language Models (LLMs), where these models tend to provide answers that match what users want to hear, even if they are not entirely correct. The motivation behind this exploration stems from the common behavior observed in individuals searching the internet for facts with partial or misleading knowledge. Similar to using web search engines, users may recall fragments of misleading keywords and submit them to an LLM, hoping for a comprehensive response. Our empirical analysis of several LLMs shows the potential danger of these models amplifying misinformation when presented with misleading keywords. Additionally, we thoroughly assess four existing hallucination mitigation strategies to reduce LLMs sycophantic behavior. Our experiments demonstrate the effectiveness of these strategies for generating factually correct statements. Furthermore, our analyses delve into knowledge-probing experiments on factual keywords and different categories of sycophancy mitigation. 2024.findings-acl.755 @@ -16411,7 +16411,7 @@ Choose Your Transformer: Improved Transferability Estimation of Transformer Models on Classification Tasks LukasGarbaciauskas - MaxPlonerHumboldt Universität Berlin + MaxPlonerHumboldt Universität Berlin AlanAkbikHumboldt Universität Berlin 12752-12768 There currently exists a multitude of pre-trained transformer language models (LMs) that are readily available. From a practical perspective, this raises the question of which pre-trained LM will perform best if fine-tuned for a specific downstream NLP task. However, exhaustively fine-tuning all available LMs to determine the best-fitting model is computationally infeasible. To address this problem, we present an approach that inexpensively estimates a ranking of the expected performance of a given set of candidate LMs for a given task. Following a layer-wise representation analysis, we extend existing approaches such as H-score and LogME by aggregating representations across all layers of the transformer model. We present an extensive analysis of 20 transformer LMs, 6 downstream NLP tasks, and various estimators (linear probing, kNN, H-score, and LogME). Our evaluation finds that averaging the layer representations significantly improves the Pearson correlation coefficient between the true model ranks and the estimate, increasing from 0.58 to 0.86 for LogME and from 0.65 to 0.88 for H-score. @@ -16442,7 +16442,7 @@ LeslyMiculicichGoogle NanyunPengUniversity of California, Los Angeles Chen-YuLeeGoogle - TomasPfisterGoogle + TomasPfisterGoogle 12782-12803 Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses by accurately citing verifiable sources. However, existing methods, by either feeding LMs with raw or preprocessed materials, remain prone to errors. To address this, we introduce CaLM, a novel verification framework. CaLM leverages the insight that a robust grounded response should be consistent with information derived solely from its cited sources. Our framework empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs. Larger LM responses that closely align with the smaller LMs’ output, which relies exclusively on cited documents, are verified. Responses showing discrepancies are iteratively refined through a feedback loop. Experiments on three open-domain question-answering datasets demonstrate significant performance gains of 1.5% to 7% absolute average without any required model fine-tuning. 2024.findings-acl.759 @@ -16484,10 +16484,10 @@ <fixed-case>O</fixed-case>pen<fixed-case>C</fixed-case>ode<fixed-case>I</fixed-case>nterpreter: Integrating Code Generation with Execution and Refinement TianyuZheng GeZhang - TianhaoShen + TianhaoShen XuelingLiu Bill YuchenLin - JieFuHong Kong University of Science and Technology + JieFuHong Kong University of Science and Technology WenhuChenUniversity of Waterloo and Google XiangYueCarnegie Mellon University 12834-12859 @@ -16513,13 +16513,13 @@ ZaidAlyafeai KhalidAlmubarak AhmedAshraf - DeemaAlnuhait + DeemaAlnuhait SaiedAlshahrani Gubran A. Q.Abdulrahman GamilAhmed QaisGawah ZeadSaleh - MustafaGhaleb + MustafaGhaleb YousefAli Maged S.Al-shaibani 12878-12901 @@ -16539,7 +16539,7 @@ DaveVan VeenStanford University TanBui StevenTruongVinbrain JSC and Toronto University - CurtisLanglotzStanford University + CurtisLanglotzStanford University 12902-12915 In order to enable extraction of structured clinical data from unstructured radiology reports, we introduce RadGraph-XL, a large-scale, expert-annotated dataset for clinical entity and relation extraction. RadGraph-XL consists of 2,300 radiology reports, which are annotated with over 410,000 entities and relations by board-certified radiologists. Whereas previous approaches focus solely on chest X-rays, RadGraph-XL includes data from four anatomy-modality pairs - chest CT, abdomen/pelvis CT, brain MR, and chest X-rays. Then, in order to automate structured information extraction, we use RadGraph-XL to train transformer-based models for clinical entity and relation extraction. Our evaluations include comprehensive ablation studies as well as an expert reader study that evaluates trained models on out-of-domain data. Results demonstrate that our model surpasses the performance of previous methods by up to 52% and notably outperforms GPT-4 in this domain. We release RadGraph-XL as well as our trained model to foster further innovation and research in structured clinical information extraction. 2024.findings-acl.765 @@ -16560,11 +16560,11 @@ Selective “Selective Prediction”: Reducing Unnecessary Abstention in Vision-Language Reasoning TejasSrinivasanUniversity of Southern California - JackHesselSamaya AI + JackHesselSamaya AI TanmayGuptaAllen Institute for Artificial Intelligence Bill YuchenLin YejinChoiDepartment of Computer Science, University of Washington - JesseThomasonUniversity of Southern California and Amazon + JesseThomasonUniversity of Southern California and Amazon KhyathiChandu 12935-12948 Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without increasing the error rate of the system’s predictions. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables three VLMs (BLIP2, InstructBLIP and LLaVA-1.5) to answer up to 20% more questions on the VQAv2 and A-OKVQA tasks without decreasing system accuracy, thus improving overall system reliability. Our code is available at https://github.com/tejas1995/ReCoVERR. @@ -16575,7 +16575,7 @@ Language Model Priors and Data Augmentation Strategies for Low-resource Machine Translation: A Case Study Using <fixed-case>F</fixed-case>innish to <fixed-case>N</fixed-case>orthern <fixed-case>S</fixed-case>ámi JonneSäleväBrandeis University - ConstantineLignosBrandeis University + ConstantineLignosBrandeis University 12949-12956 We investigate ways of using monolingual data in both the source and target languages for improving low-resource machine translation. As a case study, we experiment with translation from Finnish to Northern Sámi.Our experiments show that while conventional backtranslation remains a strong contender, using synthetic target-side data when training backtranslation models can be helpful as well.We also show that monolingual data can be used to train a language model which can act as a regularizer without any augmentation of parallel data. 2024.findings-acl.768 @@ -16596,9 +16596,9 @@ <fixed-case>KIWI</fixed-case>: A Dataset of Knowledge-Intensive Writing Instructions for Answering Research Questions FangyuanXuUniversity of Texas at Austin and University of Texas at Austin KyleLoAllen Institute for Artificial Intelligence - LucaSoldainiAllen Institute for Artificial Intelligence + LucaSoldainiAllen Institute for Artificial Intelligence BaileyKuehl - EunsolChoiUniversity of Texas, Austin + EunsolChoiUniversity of Texas, Austin DavidWaddenAllen Institute for Artificial Intelligence 12969-12990 Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents. In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer. To evaluate the capabilities of current LLMs on this task, we construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain. Given a research question, an initial model-generated answer and a set of relevant papers, an expert annotator iteratively issues instructions for the model to revise and improve its answer. We collect 1,260 interaction turns from 234 interaction sessions with three state-of-the-art LLMs. Each turn includes a user instruction, a model response, and a human evaluation of the model response. Through a detailed analysis of the collected responses, we find that all models struggle to incorporate new information into an existing answer, and to perform precise and unambiguous edits. Further, we find that models struggle to judge whether their outputs successfully followed user instructions, with accuracy at least 10 points short of human agreement. Our findings indicate that KIWI will be a valuable resource to measure progress and improve LLMs’ instruction-following capabilities for knowledge intensive writing tasks. @@ -16608,7 +16608,7 @@ <fixed-case>XL</fixed-case>-<fixed-case>H</fixed-case>ead<fixed-case>T</fixed-case>ags: Leveraging Multimodal Retrieval Augmentation for the Multilingual Generation of News Headlines and Tags - Faisal TarequeShohan + Faisal TarequeShohan Mir TafseerNayeem SamsulIslam Abu UbaidaAkash @@ -16623,9 +16623,9 @@ <fixed-case>I</fixed-case>n<fixed-case>F</fixed-case>o<fixed-case>B</fixed-case>ench: Evaluating Instruction Following Ability in Large Language Models YiweiQin KaiqiangSongTencent AI Lab - YebowenHuUniversity of Central Florida + YebowenHuUniversity of Central Florida WenlinYaoTencent AI Lab - SangwooChoCapital One + SangwooChoCapital One XiaoyangWangTencent AI Lab XuanshengWu FeiLiuEmory University @@ -16654,7 +16654,7 @@ GaganBhatia El Moatez BillahNagoudiUniversity of British Columbia HasanCavusogluSauder School of Business - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 13064-13087 We introduce FinTral, a suite of state-of-the-art multimodal large language models (LLMs) built upon the Mistral-7b model and tailored for financial analysis. FinTral integrates textual, numerical, tabular, and image data. We enhance FinTral with domain-specific pretraining, instruction fine-tuning, and RLAIF training by exploiting a large collection of textual and visual datasets we curate for this work. We also introduce an extensive benchmark featuring nine tasks and 25 datasets for evaluation, including hallucinations in the financial domain. Our FinTral model trained with direct preference optimization employing advanced Tools and Retrieval methods, dubbed FinTral-DPO-T&R, demonstrates an exceptional zero-shot performance. It outperforms ChatGPT-3.5 in all tasks and surpasses GPT-4 in five out of nine tasks, marking a significant advancement in AI-driven financial technology. We also demonstrate that FinTral has the potential to excel in real-time analysis and decision-making in diverse financial contexts. 2024.findings-acl.774 @@ -16672,8 +16672,8 @@ ChuangGan LiangyanGuiUIUC Yu-XiongWangSchool of Computer Science, Carnegie Mellon University and Department of Computer Science, University of Illinois Urbana-Champaign - YimingYangSchool of Computer Science, Carnegie Mellon University - KurtKeutzerUniversity of California Berkeley + YimingYangSchool of Computer Science, Carnegie Mellon University + KurtKeutzerUniversity of California Berkeley TrevorDarrellElectrical Engineering & Computer Science Department 13088-13110 Large Multimodal Models (LMM) are built across modalities and the misalignment between two modalities can result in “hallucination”, generating textual outputs that are not grounded by the multimodal information in context. To address the multimodal misalignment issue, we adapt the Reinforcement Learning from Human Feedback (RLHF) from the text domain to the vision-language alignment, where human annotators are asked to compare two responses and pinpoint the more hallucinated one, and the vision-language model is trained to maximize the simulated human rewards. We propose a new alignment algorithm called Factually Augmented RLHF that augments the reward model with additional factual information such as image captions and ground-truth multi-choice options, which alleviates the reward hacking phenomenon in RLHF and further improves the performance. We also enhance the GPT-4-generated training data (for vision instruction tuning) with previously available human-written image-text pairs to improve the general capabilities of our model. To evaluate the proposed approach in real-world scenarios, we develop a new evaluation benchmark MMHAL-BENCH with a special focus on penalizing hallucinations. As the first LMM trained with RLHF, our approach achieves remarkable improvement on the LLaVA-Bench dataset with the 96% performance level of the text-only GPT-4 (while previous best methods can only achieve the 87% level), and an improvement of 60% on MMHAL-BENCH over other baselines. @@ -16686,7 +16686,7 @@ NeerajVarshney PavelDolin AgastyaSeth - ChittaBaralArizona State University + ChittaBaralArizona State University 13111-13128 As Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications, their safety concerns become critical areas of NLP research. This has resulted in the development of various LLM defense strategies. Unfortunately, despite the shared goal of improving the safety of LLMs, the evaluation suites across various research works are disjoint and lack diverse inputs to ensure accurate and precise evaluation estimates. Furthermore, the important factor of ‘over-defensiveness’ on the safe inputs has largely remained overlooked. Addressing these limitations, this paper presents a systematic evaluation, comparison, and analysis of various LLM defense strategies over both ‘safety’ and ‘over-defensiveness’. To this end, we compile a large and diverse collection of safe and unsafe prompts, design precise evaluation methodology, and study the efficacy of various LLM defense strategies on multiple state-of-the-art LLMs. Our work reveals a number of crucial findings that we believe will pave the way and also facilitate further research in the critical area of improving the safety of LLMs. 2024.findings-acl.776 @@ -16707,12 +16707,12 @@ <tex-math>360^\circ</tex-math><fixed-case>REA</fixed-case>: Towards A Reusable Experience Accumulation with <tex-math>360^\circ</tex-math> Assessment for Multi-Agent System - ShenGaoUniversity of Electronic Science and Technology of China + ShenGaoUniversity of Electronic Science and Technology of China HaoLi ZhengliangShi ChengruiHuang - QuanTu - ShuoShang + QuanTu + ShuoShang ZhiliangTianNational University of Defense Technology MinlieHuangTsinghua University, Tsinghua University 13149-13162 @@ -16724,7 +16724,7 @@ Extracting Polymer Nanocomposite Samples from Full-Length Documents GhazalKhalighinejadDepartment of Computer Science, Duke University DefneCirci - L.Brinson + L.Brinson BhuwanDhingraDuke University 13163-13175 This paper investigates the use of large language models (LLMs) for extracting sample lists of polymer nanocomposites (PNCs) from full-length materials science research papers. The challenge lies in the complex nature of PNC samples, which have numerous attributes scattered throughout the text. The complexity of annotating detailed information on PNCs limits the availability of data, making conventional document-level relation extraction techniques impractical due to the challenge in creating comprehensive named entity span annotations.To address this, we introduce a new benchmark and an evaluation technique for this task and explore different prompting strategies in a zero-shot manner. We also incorporate self-consistency to improve the performance. Our findings show that even advanced LLMs struggle to extract all of the samples from an article. Finally, we analyze the errors encountered in this process, categorizing them into three main challenges, and discuss potential strategies for future research to overcome them. @@ -16753,7 +16753,7 @@ Toucan: Many-to-Many Translation for 150 <fixed-case>A</fixed-case>frican Language Pairs AbdelRahimElmadanyUniversity of British Columbia IfeAdebara - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 13189-13206 We address a notable gap in Natural Language Processing (NLP) by introducing a collection of resources designed to improve Machine Translation (MT) for low-resource languages, with a specific focus on African languages. First, We introduce two language models (LMs), Cheetah-1.2B and Cheetah-3.7B, with 1.2 billion and 3.7 billion parameters respectively. Next, we finetune the aforementioned models to create Toucan, an Afrocentric machine translation model designed to support 156 African language pairs. To evaluate Toucan, we carefully develop an extensive machine translation benchmark, dubbed Afro-Lingu-MT, tailored for evaluating machine translation. Toucan significantly outperforms other models, showcasing its remarkable performance on MT for African languages. Finally, we train a new model, spBLEU-1K, to enhance translation evaluation metrics, covering 1K languages, including African languages. This work aims to advance the field of NLP, fostering cross-cultural understanding and knowledge exchange, particularly in regions with limited language resources such as Africa. 2024.findings-acl.781 @@ -16768,7 +16768,7 @@ YoshinoriMaedaSony Group Corporation KeiichiYamadaSony Group Corporation HiromiWakakiSony Group Corporation - JulianMcAuleyUniversity of California, San Diego, University of California, San Diego + JulianMcAuleyUniversity of California, San Diego, University of California, San Diego 13207-13219 We consider the task of building a dialogue system that can motivate users to adopt positive lifestyle changes, Motivational Interviewing (MI). Addressing such a task requires a system that could infer how to motivate the user effectively. We propose DIIR, a framework that is capable of learning and applying conversation strategies in the form of natural language inductive rules from expert demonstrations. Automatic and human evaluation on instruction-following large language models show natural language strategies descriptions discovered by DIIR can improve active listening skills, reduce unsolicited advice, and promote more collaborative and less authoritative conversations, outperforming in-context demonstrations that are over 50 times longer. 2024.findings-acl.782 @@ -16779,7 +16779,7 @@ Evaluating Structural Generalization in Neural Machine Translation RyomaKumon DaikiMatsuoka - HitomiYanakathe University of Tokyo + HitomiYanakathe University of Tokyo 13220-13239 Compositional generalization refers to the ability to generalize to novel combinations of previously observed words and syntactic structures.Since it is regarded as a desired property of neural models, recent work has assessed compositional generalization in machine translation as well as semantic parsing.However, previous evaluations with machine translation have focused mostly on lexical generalization (i.e., generalization to unseen combinations of known words).Thus, it remains unclear to what extent models can translate sentences that require structural generalization (i.e., generalization to different sorts of syntactic structures).To address this question, we construct SGET, a machine translation dataset covering various types of compositional generalization with control of words and sentence structures.We evaluate neural machine translation models on SGET and show that they struggle more in structural generalization than in lexical generalization.We also find different performance trends in semantic parsing and machine translation, which indicates the importance of evaluations across various tasks. 2024.findings-acl.783 @@ -16811,9 +16811,9 @@ Improving Machine Translation with Large Language Models: A Preliminary Study with Cooperative Decoding JialiZeng - FandongMengWeChat AI, Tencent Inc. + FandongMengWeChat AI, Tencent Inc. YongjingYin - JieZhou + JieZhou 13275-13288 Contemporary translation engines based on the encoder-decoder framework have made significant strides in development.However, the emergence of Large Language Models (LLMs) has disrupted their position by presenting the potential for achieving superior translation quality.To uncover the circumstances in which LLMs excel and explore how their strengths can be harnessed to enhance translation quality,we first conduct a comprehensive analysis to assess the strengths and limitations of various commercial NMT systems and MT-oriented LLMs. Our findings indicate that neither NMT nor MT-oriented LLMs alone can effectively address all the translation issues, but MT-oriented LLMs show promise as a complementary solution to NMT systems.Building upon these insights, we propose Cooperative Decoding (CoDec), which treats NMT systems as a pretranslation model and MT-oriented LLMs as a supplemental solution to handle complex scenarios beyond the capability of NMT alone.Experimental results on the WMT22 test sets and a newly collected test set WebCrawl demonstrate the effectiveness and efficiency of CoDec, highlighting its potential as a robust solution for combining NMT systems with MT-oriented LLMs in the field of machine translation. 2024.findings-acl.786 @@ -16859,14 +16859,14 @@ <fixed-case>S</fixed-case>ec<fixed-case>F</fixed-case>ormer: Fast and Accurate Privacy-Preserving Inference for Transformer Models via <fixed-case>SMPC</fixed-case> - JinglongLuo + JinglongLuo YehongZhangPeng Cheng Laboratory - ZhuoZhangHarbin Institute of Technology + ZhuoZhangHarbin Institute of Technology JiaqiZhangPengCheng Laboratory XinMu HuiWang YueYuNational University of Defense Technology and PengCheng Lab - ZenglinXuFudan University + ZenglinXuFudan University 13333-13348 2024.findings-acl.790 luo-etal-2024-secformer @@ -16886,12 +16886,12 @@ History-Aware Conversational Dense Retrieval - FengranMo + FengranMo ChenQu KelongMao - TianyuZhu - ZhanSu - KaiyuHuangBeijing Jiaotong University + TianyuZhu + ZhanSu + KaiyuHuangBeijing Jiaotong University Jian-YunNieUniversity of Montreal 13366-13378 Conversational search facilitates complex information retrieval by enabling multi-turn interactions between users and the system. Supporting such interactions requires a comprehensive understanding of the conversational inputs to formulate a good search query based on historical information. In particular, the search query should include the relevant information from the previous conversation turns.However, current approaches for conversational dense retrieval primarily rely on fine-tuning a pre-trained ad-hoc retriever using the whole conversational search session, which can be lengthy and noisy. Moreover, existing approaches are limited by the amount of manual supervision signals in the existing datasets.To address the aforementioned issues, we propose a **H**istory-**A**ware **Conv**ersational **D**ense **R**etrieval (HAConvDR) system, which incorporates two ideas: context-denoised query reformulation and automatic mining of supervision signals based on the actual impact of historical turns.Experiments on two public conversational search datasets demonstrate the improved history modeling capability of HAConvDR, in particular for long conversations with topic shifts. @@ -16902,11 +16902,11 @@ Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models YikaiZhang - QianyuHeFudan University + QianyuHeFudan University XintaoWang SiyuYuan JiaqingLiangFudan University - YanghuaXiaoFudan University + YanghuaXiaoFudan University 13379-13389 Multi-Modal Knowledge Graphs (MMKGs) have proven valuable for various downstream tasks. However, scaling them up is challenging because building large-scale MMKGs often introduces mismatched images (i.e., noise). Most entities in KGs belong to the long tail, meaning there are few images of them available online. This scarcity makes it difficult to determine whether a found image matches the entity. To address this, we draw on the Triangle of Reference Theory and suggest enhancing vision-language models with concept guidance. Specifically, we introduce COG, a two-stage framework with COncept-Guided vision-language models. The framework comprises a Concept Integration module, which effectively identifies image-text pairs of long-tailed entities, and an Evidence Fusion module, which offers explainability and enables human verification. To demonstrate the effectiveness of COG, we create a dataset of 25k image-text pairs of long-tailed entities. Our comprehensive experiments show that COG not only improves the accuracy of recognizing long-tailed image-text pairs compared to baselines but also offers flexibility and explainability. 2024.findings-acl.793 @@ -16915,10 +16915,10 @@ <fixed-case>Z</fixed-case>ero<fixed-case>S</fixed-case>tance: Leveraging <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case> for Open-Domain Stance Detection via Dataset Generation - ChenyeZhao - YingjieLiWestlake University + ChenyeZhao + YingjieLiWestlake University CorneliaCarageaUniversity of Illinois, Chicago - YueZhangWestlake University + YueZhangWestlake University 13390-13405 Zero-shot stance detection that aims to detect the stance (typically against, favor, or neutral) towards unseen targets has attracted considerable attention. However, most previous studies only focus on targets from a single or limited text domains (e.g., financial domain), and thus zero-shot models cannot generalize well to unseen targets of diverse domains (e.g., political domain). In this paper, we consider a more realistic task, i.e., open-domain stance detection, which aims at training a model that is able to generalize well to unseen targets across multiple domains of interest. Particularly, we propose a novel dataset generation method ZeroStance, which leverages ChatGPT to construct a synthetic open-domain dataset CHATStance that covers a wide range of domains. We then train an open-domain model on our synthetic dataset after proper data filtering. Extensive results indicate that our model, when trained on this synthetic dataset, shows superior generalization to unseen targets of diverse domains over baselines on most benchmarks. Our method requires only a task description in the form of a prompt and is much more cost-effective and data-efficient than previous methods. We will release our code and data to facilitate future research. 2024.findings-acl.794 @@ -16928,7 +16928,7 @@ Boosting Zero-Shot Crosslingual Performance using <fixed-case>LLM</fixed-case>-Based Augmentations with Effective Data Selection BarahFazili - AshishAgrawal + AshishAgrawal PreethiJyothiIndian Institute of Technology Bombay 13406-13422 Large language models (LLMs) are very proficient text generators. We leverage this capability of LLMs to generate task-specific data via zero-shot prompting and promote cross-lingual transfer for low-resource target languages. Given task-specific data in a source language and a teacher model trained on this data, we propose using this teacher to label LLM generations and employ a set of simple data selection strategies that use the teacher’s label probabilities. Our data selection strategies help us identify a representative subset of diverse generations that help boost zero-shot accuracies while being efficient, in comparison to using all the LLM generations (without any subset selection). We also highlight other important design choices that affect cross-lingual performance such as the use of translations of source data and what labels are best to use for the LLM generations. We observe significant performance gains across sentiment analysis and natural language inference tasks (of up to a maximum of 7.13 absolute points and 1.5 absolute points on average) across a number of target languages (Hindi, Marathi, Urdu, Swahili) and domains. @@ -16938,11 +16938,11 @@ Reinforcement Tuning for Detecting Stances and Debunking Rumors Jointly with Large Language Models - RuichaoYang + RuichaoYang WeiGaoSingapore Management University JingMaHong Kong Baptist University - HongzhanLinHong Kong Baptist University - BoWangSchool of Artificial Intelligence, Jilin University + HongzhanLinHong Kong Baptist University + BoWangSchool of Artificial Intelligence, Jilin University 13423-13439 Learning multi-task models for jointly detecting stance and verifying rumors poses challenges due to the need for training data of stance at post level and rumor veracity at claim level, which are difficult to obtain. To address this issue, we leverage large language models (LLMs) as the foundation annotators for the joint stance detection (SD) and rumor verification (RV) tasks, dubbed as JSDRV. We introduce a novel reinforcement tuning framework to enhance the joint predictive capabilities of LLM-based SD and RV components. Specifically, we devise a policy for selecting LLM-annotated data at the two levels, employing a hybrid reward mechanism to choose high-quality labels for effective LLM fine-tuning on both tasks. Results demonstrate that JSDRV improves the capabilities of LLMs in the joint tasks, not only outperforming state-of-the-art methods but also generalizing to non-LLMs accommodated as task models. 2024.findings-acl.796 @@ -16953,7 +16953,7 @@ Exploring the Potential of Dense Information in Multimodal Alignment ZhiyuanFan ZhihongChenStanford University - BenyouWangThe Chinese University of Hong Kong, Shenzhen + BenyouWangThe Chinese University of Hong Kong, Shenzhen 13440-13451 Despite the success of data augmentation in improving CLIP model, existing methods that utilize LLM or SAM to enrich the information in captions still suffer from several limitations, including insufficient detail and excessive hallucinations, ultimately resulting in compromised alignment and masking the true potential of dense information. This can lead to erroneous conclusions about CLIP’s ability to handle rich data, impeding the development of more effective models. To address the limitations of existing methods, we introduce a novel pipeline that generates highly detailed, factually accurate captions for images, which facilitates in-depth analysis of the potential for dense information in multimodal alignment. Contrary to previous findings, our investigation revealed that lengthening captions boosts performance across diverse benchmarks, even surpassing the effectiveness of meticulously crafted hard negative samples. Building on these insights, DELIP is introduced, demonstrably enhancing both foundational multimodal alignment and compositional reasoning abilities. Finally, we explore strategies to expand the context window of the text encoder, unlocking the potential of richer data for CLIP and paving the way for advancements in leveraging dense information for multimodal alignment. 2024.findings-acl.797 @@ -16975,7 +16975,7 @@ <fixed-case>I</fixed-case>nstruct<fixed-case>E</fixed-case>val: Instruction-Tuned Text Evaluator from Human Preference WenhaoWu - WeiLiInstitute of Computing Technology, Chinese Academy of Sciences + WeiLiInstitute of Computing Technology, Chinese Academy of Sciences XinyanXiaoBaidu JiachenLiuBaidu Inc. SujianLiPeking University @@ -16989,7 +16989,7 @@ A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models DangCuong DungLeVinUniversity - ThaiLeIndiana University + ThaiLeIndiana University 13475-13491 Existing works have shown that fine-tuned textual transformer models achieve state-of-the-art prediction performances but are also vulnerable to adversarial text perturbations. Traditional adversarial evaluation is often done only after fine-tuning the models and ignoring the training data. In this paper, we want to prove that there is also a strong correlation between training data and model robustness. To this end, we extract 13 different features representing a wide range of input fine-tuning corpora properties and use them to predict the adversarial robustness of the fine-tuned models. Focusing mostly on encoder-only transformer models BERT and RoBERTa with additional results for BART, ELECTRA and GPT2, we provide diverse evidence to support our argument. First, empirical analyses show that (a) extracted features can be used with a lightweight classifier such as Random Forest to effectively predict the attack success rate and (b) features with the most influence on the model robustness have a clear correlation with the robustness. Second, our framework can be used as a fast and effective additional tool for robustness evaluation since it (a) saves 30x-193x runtime compared to the traditional technique, (b) is transferable across models, (c) can be used under adversarial training, and (d) robust to statistical randomness. Our code is publicly available at https://github.com/CaptainCuong/RobustText_ACL2024. 2024.findings-acl.800 @@ -16998,12 +16998,12 @@ <fixed-case>I</fixed-case>nstruct<fixed-case>G</fixed-case>raph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment - JianingWang + JianingWang JundaWu - YupengHouUniversity of California, San Diego - YaoLiuEast China Normal University - MingGao - JulianMcAuleyUniversity of California, San Diego, University of California, San Diego + YupengHouUniversity of California, San Diego + YaoLiuEast China Normal University + MingGao + JulianMcAuleyUniversity of California, San Diego, University of California, San Diego 13492-13510 Do current large language models (LLMs) better solve graph reasoning and generation tasks with parameter updates? In this paper, we propose InstructGraph, a framework that empowers LLMs with the abilities of graph reasoning and generation by instruction tuning and preference alignment. Specifically, we first propose a structured format verbalizer to unify all graph data into a universal code-like format, which can simply represent the graph without any external graph-specific encoders. Furthermore, a graph instruction tuning stage is introduced to guide LLMs in solving graph reasoning and generation tasks. Finally, we identify potential hallucination problems in graph tasks and sample negative instances for preference alignment, the target of which is to enhance the output’s reliability of the model. Extensive experiments across multiple graph-centric tasks exhibit that InstructGraph can achieve the best performance and outperform GPT-4 and LLaMA2 by more than 13% and 38%, respectively. 2024.findings-acl.801 @@ -17027,13 +17027,13 @@ Competition-Level Problems are Effective <fixed-case>LLM</fixed-case> Evaluators YimingHuang ZhenghaoLin - XiaoLiuMicrosoft Research Asia + XiaoLiuMicrosoft Research Asia YeyunGong ShuaiLuMicrosoft FangyuLei YaoboLiang YelongShenMicrosoft - ChenLinXiamen University + ChenLinXiamen University NanDuanMicrosoft Research Asia WeizhuChenMicrosoft GenAI 13526-13544 @@ -17044,12 +17044,12 @@ Large Language Models for Automated Open-domain Scientific Hypotheses Discovery - ZonglinYang + ZonglinYang XinyaDuUniversity of Texas at Dallas - JunxianLiNanyang Technological University + JunxianLiNanyang Technological University JieZheng SoujanyaPoriaSingapore University of Technology and Design - ErikCambriaNanyang Technological University + ErikCambriaNanyang Technological University 13545-13565 Hypothetical induction is recognized as the main reasoning type when scientists make observations about the world and try to propose hypotheses to explain those observations. Past research on hypothetical induction is under a constrained setting: (1) the observation annotations in the dataset are carefully manually handpicked sentences (resulting in a close-domain setting); and (2) the ground truth hypotheses are mostly commonsense knowledge, making the task less challenging. In this work, we tackle these problems by proposing the first dataset for social science academic hypotheses discovery, with the final goal to create systems that automatically generate valid, novel, and helpful scientific hypotheses, given only a pile of raw web corpus. Unlike previous settings, the new dataset requires (1) using open-domain data (raw web corpus) as observations; and (2) proposing hypotheses even new to humanity. A multi-module framework is developed for the task, including three different feedback mechanisms to boost performance, which exhibits superior performance in terms of both GPT-4 based and expert-based evaluation.To the best of our knowledge, this is the first work showing that LLMs are able to generate novel (”not existing in literature”) and valid (”reflecting reality”) scientific hypotheses. 2024.findings-acl.804 @@ -17068,11 +17068,11 @@ Training a Better <fixed-case>C</fixed-case>hinese Spelling Correction Model via Prior-knowledge Guided Teacher - ChiWei - ShaobinHuang - RongshengLiHarbin Engineering University - NaiyuYan - RuiWang + ChiWei + ShaobinHuang + RongshengLiHarbin Engineering University + NaiyuYan + RuiWang 13578-13589 Recent advancements in Chinese Spelling Correction (CSC) predominantly leverage pre-trained language models (PLMs). However, a notable challenge with fine-tuned PLM-based CSC models is their tendency to over-correct, leading to poor generalization for error patterns outside the standard distribution. To address this, we developed a teacher network guided by prior knowledge for distillation learning of CSC models. Unlike traditional teacher networks, which depend on task-related pre-training, our method infuses task-related prior information into the teacher network, offering guidance beyond mere labels to the student network. This strategy significantly enhances the CSC model’s language modeling capabilities, crucial for minimizing over-correction. Importantly, our approach is model-independent and the teacher network does not require task-related pre-training, making it broadly applicable for enhancing various PLM-based CSC models with minimal additional computational resources. Extensive experiments on widely used benchmarks demonstrate that our method achieves new state-of-the-art results. Additionally, we explored the potential of generalizing our method to other non-autoregressive text-generation tasks. 2024.findings-acl.806 @@ -17081,14 +17081,14 @@ The Revolution of Multimodal Large Language Models: A Survey - DavideCaffagni - FedericoCocchiUniversity of Pisa - LucaBarsellotti - NicholasMoratelli - SaraSarto - LorenzoBaraldi + DavideCaffagni + FedericoCocchiUniversity of Pisa + LucaBarsellotti + NicholasMoratelli + SaraSarto + LorenzoBaraldi LorenzoBaraldiUniversità degli Studi di Modena e Reggio Emilia - MarcellaCorniaUniversity of Modena and Reggio Emilia + MarcellaCorniaUniversity of Modena and Reggio Emilia RitaCucchiaraUniversità di Modena e Reggio Emilia 13590-13618 Connecting text and visual modalities plays an essential role in generative intelligence. For this reason, inspired by the success of large language models, significant research efforts are being devoted to the development of Multimodal Large Language Models (MLLMs). These models can seamlessly integrate visual and textual modalities, while providing a dialogue-based interface and instruction-following capabilities. In this paper, we provide a comprehensive review of recent visual-based MLLMs, analyzing their architectural choices, multimodal alignment strategies, and training techniques. We also conduct a detailed analysis of these models across a wide range of tasks, including visual grounding, image generation and editing, visual understanding, and domain-specific applications. Additionally, we compile and describe training datasets and evaluation benchmarks, conducting comparisons among existing models in terms of performance and computational requirements. Overall, this survey offers a comprehensive overview of the current state of the art, laying the groundwork for future MLLMs. @@ -17098,7 +17098,7 @@ <fixed-case>OOP</fixed-case>: Object-Oriented Programming Evaluation Benchmark for Large Language Models - ShuaiWang + ShuaiWang LiangDing LiShenSun Yat-Sen University YongLuoWuhan University @@ -17113,15 +17113,15 @@ Code Needs Comments: Enhancing Code <fixed-case>LLM</fixed-case>s with Comment Augmentation DeminSongShanghai AI Laboratory - HonglinGuoFudan University + HonglinGuoFudan University YunhuaZhou ShuhaoXing YudongWangShanghai AI Laboratory ZifanSongTongji University - WenweiZhangShanghai AI Laboratory + WenweiZhangShanghai AI Laboratory QipengGuoShanghai AI Laboratory HangYanAI lab - XipengQiuFudan University + XipengQiuFudan University DahuaLinThe Chinese University of Hong Kong 13640-13656 The programming skill is one crucial ability for Large Language Models (LLMs), necessitating a deep understanding of programming languages (PLs) and their correlation with natural languages (NLs). We examine the impact of pre-training data on code-focused LLMs’ performance by assessing the comment density as a measure of PL-NL alignment. Given the scarcity of code-comment aligned data in pre-training corpora, we introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language. We conducted experiments on three code-focused LLMs and observed consistent improvements in performance on two widely-used programming skill benchmarks. Notably, the model trained on the augmented data outperformed both the model used for generating comments and the model further trained on the data without augmentation. @@ -17133,8 +17133,8 @@ Efficient Domain Adaptation for Non-Autoregressive Machine Translation WangJieYou PeiGuo - JuntaoLiSoochow University, China - KehaiChenHarbin Institute of Technology (Shenzhen) + JuntaoLiSoochow University, China + KehaiChenHarbin Institute of Technology (Shenzhen) MinZhangHarbin Institute of Technology, Shenzhen 13657-13670 Domain adaptation remains a challenge in the realm of Neural Machine Translation (NMT), even in the era of large language models (LLMs). Existing non-parametric approaches like nearest neighbor machine translation have made small Autoregressive Translation (AT) models achieve efficient domain generalization and adaptation without updating parameters, but leaving the Non-Autoregressive Translation (NAT) counterparts under-explored. To fill this blank, we introduce Bi-kNN, an innovative and efficient domain adaptation approach for NAT models that tailors a k-nearest-neighbor algorithm for NAT. Specifically, we introduce an effective datastore construction and correlated updating strategies to conform the parallel nature of NAT. Additionally, we train a meta-network that seamlessly integrates the NN distribution with the NMT distribution robustly during the iterative decoding process of NAT. Our experimental results across four benchmark datasets demonstrate that our Bi-kNN not only achieves significant improvements over the Base-NAT model (7.8 BLEU on average) but also exhibits enhanced efficiency. @@ -17146,8 +17146,8 @@ Exploring Reversal Mathematical Reasoning Ability for Large Language Models PeiGuo WangJieYou - JuntaoLiSoochow University, China - YanBowen + JuntaoLiSoochow University, China + YanBowen MinZhangHarbin Institute of Technology, Shenzhen 13671-13685 Large language models (LLMs) have presented remarkable capabilities in the wide range of natural language understanding and reasoning tasks. Despite their success, a few works indicate that LLMs suffer from the “reversal curse”, in which LLMs can’t employ the inverted structure “B is A” when they are trained based on “A is B”. To explore the effect of the “reversal curse” for LLMs on complex mathematical reasoning tasks, we present two reversal datasets upon GSM8K and MathQA and verify that LLMs also struggle to solve reversal mathematical problems. We analyze the potential reason and attribute it to the insufficient modeling of the relationship between reasoning steps caused by the left-to-right objective. Consequently, based on the characteristics of multi-step reasoning, we design a novel training method to improve the general and reversal reasoning abilities. Finally, we conduct experiments on four mathematical datasets, and the results demonstrate that our method significantly improves the general reasoning capacities and alleviates the reversal problem. Our datasets and codes are available at https: //github.com/AllForward/ReversalMath. @@ -17157,10 +17157,10 @@ A Unified Joint Approach with Topological Context Learning and Rule Augmentation for Knowledge Graph Completion - JingtaoGuo + JingtaoGuo ChunxiaZhangSchool of Computer Science and Technology, Beijing Institute of Technology LingxiLi - XiaojunXue + XiaojunXue ZhendongNiuBeijing Institute of Technology 13686-13696 Knowledge graph completion (KGC) task is to infer the missing knowledge in the knowledge graph based on known factual triples. However, present KGC approaches still face the following two challenges. Those methods perform simple linear update on relation representation, and only local neighborhood information is aggregated, which makes it difficult to capture logic semantic between relations and global topological context information. To tackle the above challenges, we propose a unified joint approach with Topological Context learning and Rule Augmentation (TCRA) for KGC. The TCRA framework consists of an entity topological context learning mechanism based on dual-branch hierarchical graph attention network, and a relation rule context learning mechanism based on Rule-Transformer and rule-to-relation aggregator. The former mechanism encodes the topological structure features of entities, aggregates the local neighborhood topological context information of entities on the three levels (entity, relation and triple), and build clusters of global head or tail entities related to the same relation. It can capture the local and global topological context information of entities related to the same relation. The latter mechanism introduces chain-like Horn rules as the context information of relations, and encodes the logical semantic of relations to enrich the relation representation. Experimental performances on three benchmark datasets FB15k-237, WN18RR and Kinship indicate the effectiveness and superiority of our proposed approach. The codes are publicly available. @@ -17174,7 +17174,7 @@ MohitIyyerUniversity of Massachusetts Amherst XuezhiWangGoogle NoahConstant - JerryWeiAnthropic and Stanford University + JerryWeiAnthropic and Stanford University JasonWeiOpenAI ChrisTar Yun-HsuanSungGoogle @@ -17191,7 +17191,7 @@ <fixed-case>ROSE</fixed-case> Doesn’t Do That: Boosting the Safety of Instruction-Tuned Large Language Models with Reverse Prompt Contrastive Decoding QihuangZhong LiangDing - JuhuaLiuWuhan University + JuhuaLiuWuhan University BoDuWuhan University DachengTaoUniversity of Sydney 13721-13736 @@ -17206,7 +17206,7 @@ JingpingLiuEast China University of Science and Technology SihangJiangFudan University HaiyunJiangSUN YAT-SEN UNIVERSITY - YanghuaXiaoFudan University + YanghuaXiaoFudan University JiaqingLiangFudan University ZujieLiangAnt Group FengWei @@ -17224,9 +17224,9 @@ YingqianMin KunZhouRenmin University of China DaweiGaoAlibaba Group - XinZhaoRenmin University of China + XinZhaoRenmin University of China HeHuRenmin University of China, Renmin University of China - YaliangLiAlibaba Group + YaliangLiAlibaba Group 13748-13761 Recently, multi-task instruction tuning has been utilized to improve sentence representation learning (SRL). It enables SRL models to generate task-specific representations with the guidance of task instruction, thus exhibiting strong generalization ability on unseen tasks. However, these methods mostly neglect the potential interference problems across different tasks and instances, which may affect the training of the model.To address this issue, we propose a data curriculum method, namely **Data-CUBE**, that arranges the order of all the multi-task data for training, to minimize the interference risks from two aspects.At the task level, we aim to find the optimal task order to minimize the total cross-task interference risk and formulate this problem as the traveling salesman problem, which is further solved by a specially designed simulated annealing algorithm. At the instance level, we propose a measurement method to quantify the difficulty of all instances per task, and then arrange instances in an easy-to-difficult order for training.Experimental results show that our approach can boost the performance of state-of-the-art methods. Our code and data will be publicly released. 2024.findings-acl.816 @@ -17236,8 +17236,8 @@ Combating Label Sparsity in Short Text Topic Modeling via Nearest Neighbor Augmentation YangLin - XinyuMa - XinGaoPeking University + XinyuMa + XinGaoPeking University RuiqingLi YashaWang XuChu @@ -17251,7 +17251,7 @@ <fixed-case>R</fixed-case>efute<fixed-case>B</fixed-case>ench: Evaluating Refuting Instruction-Following for Large Language Models JianhaoYanWestlake University YunLuowestlake university - YueZhangWestlake University + YueZhangWestlake University 13775-13791 The application scope of large language models (LLMs) is increasingly expanding. In practical use, users might provide feedback based on the model’s output, hoping for a responsive model that can complete responses according to their feedback. Whether the model can appropriately respond to users’ refuting feedback and consistently follow through with execution has not been thoroughly analyzed. In light of this, this paper proposes a comprehensive benchmark, RefuteBench, covering tasks such as question answering, machine translation, and email writing. The evaluation aims to assess whether models can positively accept feedback in form of refuting instructions and whether they can consistently adhere to user demands throughout the conversation. We conduct evaluations on numerous LLMs and find that LLMs are stubborn, i.e. exhibit inclination to their internal knowledge, often failing to comply with user feedback. Additionally, as the length of the conversation increases, models gradually forget the user’s stated feedback and roll back to their own responses. We further propose a recall-and-repeat prompts as a simple and effective way to enhance the model’s responsiveness to feedback. 2024.findings-acl.818 @@ -17271,9 +17271,9 @@ Argument-Based Sentiment Analysis on Forward-Looking Statements Chin-YiLinNational Taiwan University - Chung-ChiChenAIST, National Institute of Advanced Industrial Science and Technology - Hen-HsenHuangInstitute of Information Science, Academia Sinica - Hsin-HsiChenNational Taiwan University + Chung-ChiChenAIST, National Institute of Advanced Industrial Science and Technology + Hen-HsenHuangInstitute of Information Science, Academia Sinica + Hsin-HsiChenNational Taiwan University 13804-13815 This paper introduces a novel approach to analyzing the forward-looking statements in equity research reports by integrating argument mining with sentiment analysis. Recognizing the limitations of traditional models in capturing the nuances of future-oriented analysis, we propose a refined categorization of argument units into claims, premises, and scenarios, coupled with a unique sentiment analysis framework. Furthermore, we incorporate a temporal dimension to categorize the anticipated impact duration of market events. To facilitate this study, we present the Equity Argument Mining and Sentiment Analysis (Equity-AMSA) dataset. Our research investigates the extent to which detailed domain-specific annotations can be provided, the necessity of fine-grained human annotations in the era of large language models, and whether our proposed framework can improve performance in downstream tasks over traditional methods. Experimental results reveal the significance of manual annotations, especially for scenario identification and sentiment analysis. The study concludes that our annotation scheme and dataset contribute to a deeper understanding of forward-looking statements in equity research reports. 2024.findings-acl.820 @@ -17283,9 +17283,9 @@ Paying More Attention to Source Context: Mitigating Unfaithful Translations from Large Language Model HongbinZhang - KehaiChenHarbin Institute of Technology (Shenzhen) + KehaiChenHarbin Institute of Technology (Shenzhen) XuefengBai - YangXiang + YangXiang MinZhangHarbin Institute of Technology, Shenzhen 13816-13836 Large language models (LLMs) have showcased their remarkable capabilities to handle various downstream tasks, including multilingual machine translation ability. Despite their impressive performance, decoder-only LLMs lack an explicit alignment between source and target contexts, leading to translation that may not faithfully represent the original content. To address this, we propose three learning strategies to encourage LLMs to pay more attention to the source context during translation: 1) adjusting attention weights on the source context by adaptive attention re-weighting; 2) suppressing the irrelevant target prefix using contrastive decoding; 3) avoiding excessive reliance on the target prefix through target-constrained tuning. To verify the effectiveness of our model, we curate a new dataset specifically focusing on unfaithful translations generated by LLMs. Experimental results on both human-collected and general test sets verify the effectiveness of our model across multiple language pairs. Further human evaluation demonstrates the efficacy of our method in reducing hallucinatory translation and improving the fidelity of translations. @@ -17325,7 +17325,7 @@ MeishanZhangHarbin Institute of Technology (Shenzhen), China and Tianjin University, China XueboLiuHarbin Institute of Technolgy, Shenzhen ZhaocongLi - DerekWongUniversity of Macau + DerekWongUniversity of Macau MinZhangHarbin Institute of Technology, Shenzhen 13868-13881 Tuning-based large language models for machine translation (aka large translation model, LTM) have demonstrated significant performance in the field of machine translation. Despite their success, these models often face difficulties in leveraging demonstrations to further improve their performance. To tackle this challenge, we introduce a novel approach that integrates demonstration-aware training and inference strategies within the framework of tuning-based LTMs, hereby referred to as demonstration-aware LTMs. During training, we enrich the model’s learning process by incorporating both sentence- and document-level demonstrations derived from its original training dataset. During inference, the model synergizes its own contextual translations with retrieved high-quality demonstrations, leading to more precise and contextually appropriate outputs. Empirical results reveal that our demonstration-aware LTM not only mitigates the negative impacts traditionally associated with demonstrations but also secures substantial improvements in translation accuracy, particularly in domain-specific and document-level translation tasks. Source code and scripts are freely available at https://github.com/ChenLi0620/Demo-Aware-LLM-MT. @@ -17338,7 +17338,7 @@ DohyeonLeeSeoul National University JongyoonKimSeoul National University Seung-wonHwangSeoul National University - JoonsukParkUniversity of Richmond + JoonsukParkUniversity of Richmond 13882-13893 Pre-trained language models (PLMs) exhibit promise in retrieval tasks but struggle with out-of-domain data due to distribution shifts.Addressing this, generative domain adaptation (DA), known as GPL, tackles distribution shifts by generating pseudo queries and labels to train models for predicting query-document relationships in new domains.However, it overlooks the domain distribution, causing the model to struggle with aligning the distribution in the target domain.We, therefore, propose a Distribution-Aware Domain Adaptation (DADA) to guide the model to consider the domain distribution knowledge at the level of both a single document and the corpus, which is referred to as observation-level feedback and domain-level feedback, respectively.Our method effectively adapts the model to the target domain and expands document representation to unseen gold query terms using domain and observation feedback, as demonstrated by empirical results on the BEIR benchmark. 2024.findings-acl.825 @@ -17363,11 +17363,11 @@ FedericoRanaldiUniversity of Roma “Tor Vergata” Elena SofiaRuzzettiUniversità degli Studi di Roma Tor Vergata DarioOnorati“La Sapienza” University of Rome - LeonardoRanaldiIdiap Research Institute + LeonardoRanaldiIdiap Research Institute CristinaGiannone AndreaFavalliAlmawave RanieroRomagnoliUniversity of Roma “La Sapienza” - Fabio MassimoZanzottoUniversity of Rome Tor Vergata + Fabio MassimoZanzottoUniversity of Rome Tor Vergata 13909-13920 Understanding textual description to generate code seems to be an achieved capability of instruction-following Large Language Models (LLMs) in zero-shot scenario. However, there is a severe possibility that this translation ability may be influenced by having seen target textual descriptions and the related code. This effect is known as Data Contamination.In this study, we investigate the impact of Data Contamination on the performance of GPT-3.5 in the Text-to-SQL code-generating tasks. Hence, we introduce a novel method to detect Data Contamination in GPTs and examine GPT-3.5’s Text-to-SQL performances using the known Spider Dataset and our new unfamiliar dataset Termite. Furthermore, we analyze GPT-3.5’s efficacy on databases with modified information via an adversarial table disconnection (ATD) approach, complicating Text-to-SQL tasks by removing structural pieces of information from the database. Our results indicate a significant performance drop in GPT-3.5 on the unfamiliar Termite dataset, even with ATD modifications, highlighting the effect of Data Contamination on LLMs in Text-to-SQL translation tasks. 2024.findings-acl.827 @@ -17379,7 +17379,7 @@ MubasharaAkhtar NikeshSubedi, University of Utah VivekGuptaUniversity of Pennsylvania, United States - SaharTahmasebiTIB – Leibniz Information Centre for Science and Technology + SaharTahmasebiTIB – Leibniz Information Centre for Science and Technology OanaCocarascuKing’s College London ElenaSimperlKing’s College London 13921-13937 @@ -17391,8 +17391,8 @@ Real World Conversational Entity Linking Requires More Than Zero-Shots MohannaHoveyda - ArjenVriesInstitute for Computing and Information Sciences, Radboud University Nijmegen, Radboud University - FaeghehHasibiRadboud University + ArjenVriesInstitute for Computing and Information Sciences, Radboud University Nijmegen, Radboud University + FaeghehHasibiRadboud University Maartende RijkeUniversity of Amsterdam 13938-13946 Entity linking (EL) in conversations faces notable challenges in practical applications, primarily due to scarcity of entity-annotated conversational datasets and sparse knowledge bases (KB) containing domain-specific, long-tail entities. We designed targeted evaluation scenarios to measure the efficacy of EL models under resource constraints. Our evaluation employs two KBs: Fandom, exemplifying real-world EL complexities, and the widely used Wikipedia. First, we assess EL models’ ability to generalize to a new unfamiliar KB using Fandom and a novel zero-shot conversational entity linking dataset that we curated based on Reddit discussions on Fandom entities. We then evaluate the adaptability of EL models to conversational settings without prior training. Our results indicate that current zero-shot EL models falter when introduced to new, domain-specific KBs without prior training, significantly dropping in performance.Our findings reveal that previous evaluation approaches fall short of capturing real-world complexities for zero-shot EL, highlighting the necessity for new approaches to design and assess conversational EL models to adapt to limited resources. The evaluation frame-work and dataset proposed are tailored to facilitate this research. @@ -17402,16 +17402,16 @@ <fixed-case>CP</fixed-case>sy<fixed-case>C</fixed-case>oun: A Report-based Multi-turn Dialogue Reconstruction and Evaluation Framework for <fixed-case>C</fixed-case>hinese Psychological Counseling - ChenhaoZhangShanghai Artificial Intelligence Laboratory, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences and Huazhong University of Science and Technology + ChenhaoZhangShanghai Artificial Intelligence Laboratory, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences and Huazhong University of Science and Technology RenhaoLiUniversity of Macau - MinghuanTanShenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences + MinghuanTanShenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences MinYangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences - JingweiZhu - DiYang + JingweiZhu + DiYang JiahaoZhao GuanchengYe - ChengmingLiShenzhen MSU-BIT University - XipingHuBeijing Institute of Technology + ChengmingLiShenzhen MSU-BIT University + XipingHuBeijing Institute of Technology 13947-13966 Using large language models (LLMs) to assist psychological counseling is a significant but challenging task at present. Attempts have been made on improving empathetic conversations or acting as effective assistants in the treatment with LLMs. However, the existing datasets lack consulting knowledge, resulting in LLMs lacking professional consulting competence. Moreover, how to automatically evaluate multi-turn dialogues within the counseling process remains an understudied area. To bridge the gap, we propose CPsyCoun, a report-based multi-turn dialogue reconstruction and evaluation framework for Chinese psychological counseling. To fully exploit psychological counseling reports, a two-phase approach is devised to construct high-quality dialogues while a comprehensive evaluation benchmark is developed for the effective automatic evaluation of multi-turn psychological consultations. Competitive experimental results demonstrate the effectiveness of our proposed framework in psychological counseling. We open-source the datasets and model for future research. 2024.findings-acl.830 @@ -17422,9 +17422,9 @@ Tox-<fixed-case>BART</fixed-case>: Leveraging Toxicity Attributes for Explanation Generation of Implicit Hate Speech NeemeshYadavIndraprastha Institute of Information Technology, Delhi SarahMasudIndraprastha Institute of Information Technology Delhi (IIIT-Delhi) - VikramGoyalIndraprastha Institute of Information Technology, Delhi + VikramGoyalIndraprastha Institute of Information Technology, Delhi Md ShadAkhtarIndraprastha Institute of Information Technology, Delhi - TanmoyChakrabortyIndian Institute of Technology, Delhi + TanmoyChakrabortyIndian Institute of Technology, Delhi 13967-13983 Employing language models to generate explanations for an incoming implicit hate post is an active area of research. The explanation is intended to make explicit the underlying stereotype and aid content moderators. The training often combines top-k relevant knowledge graph (KG) tuples to provide world knowledge and improve performance on standard metrics. Interestingly, our study presents conflicting evidence for the role of the quality of KG tuples in generating implicit explanations. Consequently, simpler models incorporating external toxicity signals outperform KG-infused models. Compared to the KG-based setup, we observe a comparable performance for SBIC (LatentHatred) datasets with a performance variation of +0.44 (+0.49), +1.83 (-1.56), and -4.59 (+0.77) in BLEU, ROUGE-L, and BERTScore. Further human evaluation and error analysis reveal that our proposed setup produces more precise explanations than zero-shot GPT-3.5, highlighting the intricate nature of the task. 2024.findings-acl.831 @@ -17436,9 +17436,9 @@ JamesEnouen HootanNakhost SaynaEbrahimiGoogle - SercanArikGoogle - YanLiuUniversity of Southern California - TomasPfisterGoogle + SercanArikGoogle + YanLiuUniversity of Southern California + TomasPfisterGoogle 13984-14011 Large language models (LLMs) have attracted great interest in many real-world applications; however, their “black-box” nature necessitates scalable and faithful explanations. Shapley values have matured as an explainability method for deep learning, but extending them to LLMs is difficult due to long input contexts and autoregressive output generation. We introduce , an efficient post-hoc explanation method incorporating LLM-specific techniques, which leads to significant runtime improvements: token-level explanations in minutes not hours, and document-level explanations within seconds. We demonstrate how such explanations can improve end-to-end performance of retrieval augmented generation by localizing important words within long documents and reranking passages collected by retrieval systems. On various open-domain question answering benchmarks, we show TextGenSHAP improves the retrieval recall and prediction accuracy significantly. 2024.findings-acl.832 @@ -17452,7 +17452,7 @@ ZhaoyeFei HangYanAI lab DahuaLinThe Chinese University of Hong Kong - XipengQiuFudan University + XipengQiuFudan University 14012-14023 Data plays a fundamental role in the training of Large Language Models (LLMs). While attention has been paid to the collection and composition of datasets, determining the data sampling strategy in training remains an open question. Most LLMs are trained with a simple strategy, random sampling. However, this sampling strategy ignores the unbalanced nature of training data distribution, which can be sub-optimal. In this paper, we propose ClusterClip Sampling to balance the text distribution of training data for better model training. Specifically, ClusterClip Sampling utilizes data clustering to reflect the data distribution of the training set and balances the common samples and rare samples during training based on the cluster results. A repetition clip operation is introduced to mitigate the overfitting issue led by samples from certain clusters. Extensive experiments validate the effectiveness of ClusterClip Sampling, which outperforms random sampling and other cluster-based sampling variants under various training datasets and large language models. 2024.findings-acl.833 @@ -17478,11 +17478,11 @@ Unsupervised Sign Language Translation and Generation ZhengshengGuo - ZhiweiHeShanghai Jiao Tong University + ZhiweiHeShanghai Jiao Tong University WenxiangJiaoTencent AI Lab - XingWangTencent AI Lab - RuiWangShanghai Jiao Tong University - KehaiChenHarbin Institute of Technology (Shenzhen) + XingWangTencent AI Lab + RuiWangShanghai Jiao Tong University + KehaiChenHarbin Institute of Technology (Shenzhen) ZhaopengTuTencent AI Lab YongXu MinZhangHarbin Institute of Technology, Shenzhen @@ -17494,13 +17494,13 @@ Mitigating Data Scarcity in Semantic Parsing across Languages with the Multilingual Semantic Layer and its Dataset - Abelardo CarlosMartinez LorenzoUniversity of Roma “La Sapienza” - Pere-LluísHuguet Cabot + Abelardo CarlosMartinez LorenzoUniversity of Roma “La Sapienza” + Pere-LluísHuguet Cabot KarimGhonimUniversity of Roma “La Sapienza” - LuXuUniversity of Roma “La Sapienza” + LuXuUniversity of Roma “La Sapienza” Hee-SooChoi AlberteFernández-Castro - RobertoNavigliSapienza University of Rome + RobertoNavigliSapienza University of Rome 14056-14080 Data scarcity is a prevalent challenge in the era of Large Language Models (LLMs). The insatiable hunger of LLMs for large corpora becomes even more pronounced when dealing with non-English and low-resource languages. The issue is particularly exacerbated in Semantic Parsing (SP), i.e. the task of converting text into a formal representation. The complexity of semantic formalisms makes training human annotators and subsequent data annotation unfeasible on a large scale, especially across languages. To mitigate this, we first introduce the Multilingual Semantic Layer (MSL), a conceptual evolution of previous formalisms, which decouples from disambiguation and external inventories and simplifies the task. MSL provides the necessary tools to encode the meaning across languages, paving the way for developing a high-quality semantic parsing dataset across different languages in a semi-automatic strategy. Subsequently, we manually refine a portion of this dataset and fine-tune GPT-3.5 to propagate these refinements across the dataset. Then, we manually annotate 1,100 sentences in eleven languages, including low-resource ones. Finally, we assess our dataset’s quality, showcasing the performance gap reduction across languages in Semantic Parsing. 2024.findings-acl.836 @@ -17511,11 +17511,11 @@ Efficient Sparse Attention needs Adaptive Token Release ChaoranZhang LixinZouSchool of Cyber Science and Engineering, Wuhan University - DanLuoLehigh University + DanLuoLehigh University XiangyangLuoState Key Lab of Mathematical Engineering and Advanced Computing ZihaoLiWuhan University MinTangMonash University - ChenliangLi + ChenliangLi 14081-14094 2024.findings-acl.837 zhang-etal-2024-efficient @@ -17533,7 +17533,7 @@ WeihuaPeng DuyuTangTencent AI Lab DandanTu - BingQinHarbin Institute of Technology + BingQinHarbin Institute of Technology 14095-14113 Despite the impressive performance on information-seeking tasks, large language models (LLMs) still struggle with hallucinations. Attributed LLMs, which augment generated text with in-line citations, demonstrate potential in mitigating hallucinations and improving verifiability. However, current approaches suffer from suboptimal citation quality due to their reliance on in-context learning. Furthermore, the practice of merely citing document identifiers complicates the process for users to pinpoint specific supporting evidence. In this work, we introduce FRONT, a training framework that teaches LLMs to generate Fine-grained grounded citations. By initially grounding fine-grained supporting quotes, which then guide the generation process, these quotes not only provide supervision signals to improve citation quality but also serve as fine-grained attributions. Experiments on the ALCE benchmark demonstrate the efficacy of FRONT in generating superior grounded responses and highly supportive citations. With LLaMA-2-7B, the framework significantly outperforms all the baselines, achieving an average of 14.21% improvement in citation quality across all datasets, even surpassing ChatGPT. 2024.findings-acl.838 @@ -17543,9 +17543,9 @@ <fixed-case>R</fixed-case>e<fixed-case>L</fixed-case>i<fixed-case>K</fixed-case>: Retrieve and <fixed-case>L</fixed-case>in<fixed-case>K</fixed-case>, Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget RiccardoOrlando - Pere-LluísHuguet Cabot + Pere-LluísHuguet Cabot EdoardoBarbaUniversity of Roma “La Sapienza” - RobertoNavigliSapienza University of Rome + RobertoNavigliSapienza University of Rome 14114-14132 Entity Linking (EL) and Relation Extraction (RE) are fundamental tasks in Natural Language Processing, serving as critical components in a wide range of applications. In this paper, we propose ReLiK, a Retriever-Reader architecture for both EL and RE, where, given an input text, the Retriever module undertakes the identification of candidate entities or relations that could potentially appear within the text. Subsequently, the Reader module is tasked to discern the pertinent retrieved entities or relations and establish their alignment with the corresponding textual spans. Notably, we put forward an innovative input representation that incorporates the candidate entities or relations alongside the text, making it possible to link entities or extract relations in a single forward pass and to fully leverage pre-trained language models contextualization capabilities, in contrast with previous Retriever-Reader-based methods, which require a forward pass for each candidate. Our formulation of EL and RE achieves state-of-the-art performance in both in-domain and out-of-domain benchmarks while using academic budget training and with up to 40x inference speed compared to competitors. Finally, we show how our architecture can be used seamlessly for Information Extraction (cIE), i.e. EL + RE, and setting a new state of the art by employing a shared Reader that simultaneously extracts entities and relations. 2024.findings-acl.839 @@ -17570,7 +17570,7 @@ <fixed-case>FENICE</fixed-case>: Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction AlessandroScirè KarimGhonimUniversity of Roma “La Sapienza” - RobertoNavigliSapienza University of Rome + RobertoNavigliSapienza University of Rome 14148-14161 Recent advancements in text summarization, particularly with the advent of Large Language Models (LLMs), have shown remarkable performance. However, a notable challenge persists as a substantial number of automatically-generated summaries exhibit factual inconsistencies, such as hallucinations. In response to this issue, various approaches for the evaluation of consistency for summarization have emerged. Yet, these newly-introduced metrics face several limitations, including lack of interpretability, focus on short document summaries (e.g., news articles), and computational impracticality, especially for LLM-based metrics. To address these shortcomings, we propose Factuality Evaluation of summarization based on Natural language Inference and Claim Extraction (FENICE), a more interpretable and efficient factuality-oriented metric. FENICE leverages an NLI-based alignment between information in the source document and a set of atomic facts, referred to as claims, extracted from the summary. Our metric sets a new state of the art on AGGREFACT, the de-facto benchmark for factuality evaluation. Moreover, we extend our evaluation to a more challenging setting by conducting a human annotation process of long-form summarization. In the hope of fostering research in summarization factuality evaluation, we release the code of our metric and our factuality annotations of long-form summarization at https://github.com/Babelscape/FENICE. 2024.findings-acl.841 @@ -17580,11 +17580,11 @@ Self-Para-Consistency: Improving Reasoning Tasks at Low Cost for Large Language Models WenqingChenSUN YAT-SEN UNIVERSITY - WeichengWang + WeichengWang ZhixuanChuAnt Group - KuiRen - ZibinZhengSUN YAT-SEN UNIVERSITY - ZhichaoLu + KuiRen + ZibinZhengSUN YAT-SEN UNIVERSITY + ZhichaoLu 14162-14167 Recently, the self-consistency decoding strategy has shown the ability to improve performance for complex reasoning tasks with large language models (LLMs). However, the costs may be high because the sampling process of the strategy generates some low-probability text, resulting in low-quality reasoning paths. As a consequence, it requires a relatively large sampling number to obtain good aggregation performance. In this paper, we propose an alternative strategy, self-para-consistency. It first generates multiple paraphrases for each test question, then generates reasoning paths for the original and all the paraphrased questions based on greedy decoding, and finally selects the most consistent answer. Since all the candidate paths have relatively high probabilities, the sampling number could be much smaller than the self-consistency strategy. Extensive experiments on complex reasoning datasets demonstrate the effectiveness of our method in reducing the sampling number. 2024.findings-acl.842 @@ -17593,7 +17593,7 @@ Looking Right is Sometimes Right: Investigating the Capabilities of Decoder-only <fixed-case>LLM</fixed-case>s for Sequence Labeling - DavidDukić + DavidDukić JanŠnajder 14168-14181 Pre-trained language models based on masked language modeling (MLM) excel in natural language understanding (NLU) tasks. While fine-tuned MLM-based encoders consistently outperform causal language modeling decoders of comparable size, recent decoder-only large language models (LLMs) perform on par with smaller MLM-based encoders. Although their performance improves with scale, LLMs fall short of achieving state-of-the-art results in information extraction (IE) tasks, many of which are formulated as sequence labeling (SL). We hypothesize that LLMs’ poor SL performance stems from causal masking, which prevents the model from attending to tokens on the right of the current token. Yet, how exactly and to what extent LLMs’ performance on SL can be improved remains unclear. We explore techniques for improving the SL performance of open LLMs on IE tasks by applying layer-wise removal of the causal mask (CM) during LLM fine-tuning. This approach yields performance gains competitive with state-of-the-art SL models, matching or outperforming the results of CM removal from all blocks. Our findings hold for diverse SL tasks, demonstrating that open LLMs with layer-dependent CM removal outperform strong MLM-based encoders and even instruction-tuned LLMs. @@ -17604,8 +17604,8 @@ m<fixed-case>CSQA</fixed-case>: Multilingual Commonsense Reasoning Dataset with Unified Creation Strategy by Language Models and Humans YusukeSakaiNara Institute of Science and Technology, Japan - HidetakaKamigaitoDivision of Information Science, Nara Institute of Science and Technology - TaroWatanabeNara Institute of Science and Technology, Japan + HidetakaKamigaitoDivision of Information Science, Nara Institute of Science and Technology + TaroWatanabeNara Institute of Science and Technology, Japan 14182-14214 It is very challenging to curate a dataset for language-specific knowledge and common sense in order to evaluate natural language understanding capabilities of language models. Due to the limitation in the availability of annotators, most current multilingual datasets are created through translation, which cannot evaluate such language-specific aspects. Therefore, we propose Multilingual CommonsenseQA (mCSQA) based on the construction process of CSQA but leveraging language models for a more efficient construction, e.g., by asking LM to generate questions/answers, refine answers and verify QAs followed by reduced human efforts for verification. Constructed dataset is a benchmark for cross-lingual language-transfer capabilities of multilingual LMs, and experimental results showed high language-transfer capabilities for questions that LMs could easily solve, but lower transfer capabilities for questions requiring deep knowledge or commonsense. This highlights the necessity of language-specific datasets for evaluation and training. Finally, our method demonstrated that multilingual LMs could create QA including language-specific knowledge, significantly reducing the dataset creation cost compared to manual creation. The datasets are available at https://huggingface.co/datasets/yusuke1997/mCSQA. 2024.findings-acl.844 @@ -17632,8 +17632,8 @@ YiSu YunpengTai YixinJiSoochow University - JuntaoLiSoochow University, China - YanBowen + JuntaoLiSoochow University, China + YanBowen MinZhangHarbin Institute of Technology, Shenzhen 14232-14244 Large Language Models (LLMs) have demonstrated an impressive capability known as In-context Learning (ICL), which enables them to acquire knowledge from textual demonstrations without the need for parameter updates.However, many studies have highlighted that the model’s performance is sensitive to the choice of demonstrations, presenting a significant challenge for practical applications where we lack prior knowledge of user queries.Consequently, we need to construct an extensive demonstration pool and incorporate external databases to assist the model, leading to considerable time and financial costs.In light of this, some recent research has shifted focus towards zero-shot ICL, aiming to reduce the model’s reliance on external information by leveraging their inherent generative capabilities. Despite the effectiveness of these approaches, the content generated by the model may be unreliable, and the generation process is time-consuming.To address these issues, we propose Demonstration Augmentation for In-context Learning (DAIL), which employs the model’s previously predicted historical samples as demonstrations for subsequent ones.DAIL brings no additional inference cost and does not rely on the model’s generative capabilities.Our experiments reveal that DAIL can significantly improve the model’s performance over direct zero-shot inference and can even outperform few-shot ICL without any external information. @@ -17643,9 +17643,9 @@ Pushing the Limits of Zero-shot End-to-End Speech Translation - IoannisTsiamas + IoannisTsiamas Gerard I.Gállego - José A. R.Fonollosa + José A. R.Fonollosa Marta R.Costa-jussà 14245-14267 Data scarcity and the modality gap between the speech and text modalities are two major obstacles of end-to-end Speech Translation (ST) systems, thus hindering their performance. Prior work has attempted to mitigate these challenges by leveraging external MT data and optimizing distance metrics that bring closer the speech-text representations. However, achieving competitive results typically requires some ST data. For this reason, we introduce ZeroSwot, a method for zero-shot ST that bridges the modality gap without any paired ST data. Leveraging a novel CTC compression and Optimal Transport, we train a speech encoder using only ASR data, to align with the representation space of a massively multilingual MT model. The speech encoder seamlessly integrates with the MT model at inference, enabling direct translation from speech to text, across all languages supported by the MT model. Our experiments show that we can effectively close the modality gap without ST data, while our results on MuST-C and CoVoST demonstrate our method’s superiority over not only previous zero-shot models, but also supervised ones, achieving state-of-the-art results. @@ -17656,7 +17656,7 @@ <fixed-case>NUMC</fixed-case>o<fixed-case>T</fixed-case>: Numerals and Units of Measurement in Chain-of-Thought Reasoning using Large Language Models AnchengXu - MinghuanTanShenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences + MinghuanTanShenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences LeiWangSalesForce MinYangShenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Chinese Academy of Sciences RuifengXuHarbin Institute of Technology @@ -17668,9 +17668,9 @@ On The Persona-based Summarization of Domain-Specific Documents - AnkanMullick - SombitBoseIndian Institute of Technology Kharagpur, - RounakSaha + AnkanMullick + SombitBoseIndian Institute of Technology Kharagpur, + RounakSaha AyanBhowmickMerlyn Mind Inc. PawanGoyalIIT Kharagpur NiloyGangulyIndian Institute of Technology Kharagpur, @@ -17695,11 +17695,11 @@ Word Sense Linking: Disambiguating Outside the Sandbox - Andrei StefanBejgu + Andrei StefanBejgu EdoardoBarba LuigiProcopio AlberteFernández-Castro - RobertoNavigli + RobertoNavigli 14332-14347 Word Sense Disambiguation (WSD) is the task of associating a word in a given context with its most suitable meaning among a set of possible candidates. While the task has recently witnessed renewed interest, with systems achieving performances above the estimated inter-annotator agreement, at the time of writing it still struggles to find downstream applications. We argue that one of the reasons behind this is the difficulty of applying WSD to plain text. Indeed, in the standard formulation, models work under the assumptions that a) all the spans to disambiguate have already been identified, and b) all the possible candidate senses of each span are provided, both of which are requirements that are far from trivial. In this work, we present a new task called Word Sense Linking (WSL) where, given an input text and a reference sense inventory, systems have to both identify which spans to disambiguate and then link them to their most suitable meaning.We put forward a transformer-based architecture for the task and thoroughly evaluate both its performance and those of state-of-the-art WSD systems scaled to WSL, iteratively relaxing the assumptions of WSD. We hope that our work will foster easier integration of lexical semantics into downstream applications. 2024.findings-acl.851 @@ -17733,10 +17733,10 @@ Unsupervised Real-Time Hallucination Detection based on the Internal States of Large Language Models WeihangSu ChangyueWang - QingyaoAiTsinghua University, Tsinghua University + QingyaoAiTsinghua University, Tsinghua University YiranHu - ZhijingWuBeijing Institute of Technology - YujiaZhouTsinghua University, Tsinghua University + ZhijingWuBeijing Institute of Technology + YujiaZhouTsinghua University, Tsinghua University YiqunLiuTsinghua University 14379-14391 Hallucinations in large language models (LLMs) refer to the phenomenon of LLMs producing responses that are coherent yet factually inaccurate. This issue undermines the effectiveness of LLMs in practical applications, necessitating research into detecting and mitigating hallucinations of LLMs. Previous studies have mainly concentrated on post-processing techniques for hallucination detection, which tend to be computationally intensive and limited in effectiveness due to their separation from the LLM’s inference process. To overcome these limitations, we introduce MIND, an unsupervised training framework that leverages the internal states of LLMs for real-time hallucination detection without requiring manual annotations. Additionally, we present HELM, a new benchmark for evaluating hallucination detection across multiple LLMs, featuring diverse LLM outputs and the internal states of LLMs during their inference process. Our experiments demonstrate that MIND outperforms existing state-of-the-art methods in hallucination detection. @@ -17761,7 +17761,7 @@ NguyenHung-QuangVinUniversity SauravManchandaAmazon MinlongPengBaidu - Kok-SengWongVinUniversity + Kok-SengWongVinUniversity KhoaDoanVinUniversity 14403-14421 Despite outstanding performance in a variety of Natural Language Processing (NLP) tasks, recent studies have revealed that NLP models are vulnerable to adversarial attacks that slightly perturb the input to cause the models to misbehave. Several attacks can even compromise the model without requiring access to the model architecture or model parameters (i.e., a blackbox setting), and thus are detrimental to existing NLP applications. To perform these attacks, the adversary queries the victim model many times to determine the most important parts in an input text and transform. In this work, we propose a lightweight and attack-agnostic defense whose main goal is to perplex the process of generating an adversarial example in these query-based black-box attacks; that is to fool the textual fooler. This defense, named AdvFooler, works by randomizing the latent representation of the input at inference time. Different from existing defenses, AdvFooler does not necessitate additional computational overhead during training nor does it rely on assumptions about the potential adversarial perturbation set while having a negligible impact on the model’s accuracy. Our theoretical and empirical analyses highlight the significance of robustness resulting from confusing the adversary via randomizing the latent space, as well as the impact of randomization on clean accuracy. Finally, we empirically demonstrate near state-of-the-art robustness of AdvFooler against representative adversarial attacks on two benchmark datasets. @@ -17783,8 +17783,8 @@ <fixed-case>FOCUS</fixed-case>: Forging Originality through Contrastive Use in Self-Plagiarism for Language Models KaixinLan - TaoFangUniversity of Macau - DerekWongUniversity of Macau + TaoFangUniversity of Macau + DerekWongUniversity of Macau YaboXu LidiaChao CeciliaZhaoUniversity of Macau, New York University and Ohio State University, Columbus @@ -17797,7 +17797,7 @@ Amanda: Adaptively Modality-Balanced Domain Adaptation for Multimodal Emotion Recognition XinxinZhang - JunSun + JunSun SiminHong TaihaoLiZhejiang Lab 14448-14458 @@ -17809,8 +17809,8 @@ <fixed-case>M</fixed-case>ed<fixed-case>REQAL</fixed-case>: Examining Medical Knowledge Recall of Large Language Models via Question Answering JurajVladikaTechnische Universität München - PhillipSchneider - FlorianMatthesTechnische Universität München + PhillipSchneider + FlorianMatthesTechnische Universität München 14459-14469 In recent years, Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora. They can leverage this knowledge for downstream tasks like question answering (QA), even in complex areas involving health topics. Considering their high potential for facilitating clinical work in the future, understanding the quality of encoded medical knowledge and its recall in LLMs is an important step forward. In this study, we examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews – studies synthesizing evidence-based answers for specific medical questions. Through experiments on the new MedREQAL dataset, comprising question-answer pairs extracted from rigorous systematic reviews, we assess six LLMs, such as GPT and Mixtral, analyzing their classification and generation performance. Our experimental insights into LLM performance on the novel biomedical QA dataset reveal the still challenging nature of this task. 2024.findings-acl.860 @@ -17823,9 +17823,9 @@ WassaySajjad MukeetRazaLahore University of Management Sciences EmaanAbbas - Abdul HameedAzeemiLahore University of Management Sciences + Abdul HameedAzeemiLahore University of Management Sciences Ihsan AyyubQaziLahore University of Management Sciences - Agha AliRazaLahore University of Management Sciences + Agha AliRazaLahore University of Management Sciences 14470-14480 Deepfakes, particularly in the auditory domain, have become a significant threat, necessitating the development of robust countermeasures. This paper addresses the escalating challenges posed by deepfake attacks on Automatic Speaker Verification (ASV) systems. We present a novel Urdu deepfake audio dataset for deepfake detection, focusing on two spoofing attacks – Tacotron and VITS TTS. The dataset construction involves careful consideration of phonemic cover and balance and comparison with existing corpora like PRUS and PronouncUR. Evaluation with AASIST-L model shows EERs of 0.495 and 0.524 for VITS TTS and Tacotron-generated audios, respectively, with variability across speakers. Further, this research implements a detailed human evaluation, incorporating a user study to gauge whether people are able to discern deepfake audios from real (bonafide) audios. The ROC curve analysis shows an area under the curve (AUC) of 0.63, indicating that individuals demonstrate a limited ability to detect deepfakes (approximately 1 in 3 fake audio samples are regarded as real). Our work contributes a valuable resource for training deepfake detection models in low-resource languages like Urdu, addressing the critical gap in existing datasets. The dataset is publicly available at: https://github.com/CSALT-LUMS/urdu-deepfake-dataset. 2024.findings-acl.861 @@ -17847,7 +17847,7 @@ MeishanZhangHarbin Institute of Technology (Shenzhen), China and Tianjin University, China HaoFeiNational University of Singapore BinWang - ShengqiongWu + ShengqiongWu YixinCaoFudan University FeiLiWuhan University MinZhangHarbin Institute of Technology, Shenzhen @@ -17861,11 +17861,11 @@ Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data YandaLi ChiZhangWestlake University - GangYuTencent - WanqiYang + GangYuTencent + WanqiYang ZhibinWangTencent LightAI Lab BinFu - GuoshengLinNanyang Technological University + GuoshengLinNanyang Technological University ChunhuaShenZhejiang University LingChenUniversity of Technology Sydney YunchaoWeiBeijing Jiaotong University @@ -17878,12 +17878,12 @@ Modeling Overregularization in Children with Small Language Models AkariHaga - SakuSugawaraNational Institute of Informatics + SakuSugawaraNational Institute of Informatics AkiyoFukatsuTokyo University, Tokyo Institute of Technology MiyuOba HirokiOuchiNAIST - TaroWatanabeNara Institute of Science and Technology, Japan - YoheiOsekiUniversity of Tokyo + TaroWatanabeNara Institute of Science and Technology, Japan + YoheiOsekiUniversity of Tokyo 14532-14550 The imitation of the children’s language acquisition process has been explored to make language models (LMs) more efficient.In particular, errors caused by children’s regularization (so-called overregularization, e.g., using wroted for the past tense of write) have been widely studied to reveal the mechanisms of language acquisition. Existing research has analyzed regularization in language acquisition only by modeling word inflection directly, which is unnatural in light of human language acquisition. In this paper, we hypothesize that language models that imitate the errors children make during language acquisition have a learning process more similar to humans. To verify this hypothesis, we analyzed the learning curve and error preferences of verb inflections in small-scale LMs using acceptability judgments. We analyze the differences in results by model architecture, data, and tokenization. Our model shows child-like U-shaped learning curves clearly for certain verbs, but the preferences for types of overgeneralization did not fully match the observations in children. 2024.findings-acl.865 @@ -17892,7 +17892,7 @@ Fantastic Semantics and Where to Find Them: Investigating Which Layers of Generative <fixed-case>LLM</fixed-case>s Reflect Lexical Semantics - ZhuLiu + ZhuLiu CunliangKong YingLiuTsinghua University, Tsinghua University MaosongSun @@ -17904,9 +17904,9 @@ Harnessing Large Language Models as Post-hoc Correctors - ZhiqiangZhongAarhus University - KuangyuZhouMicrosoft - DavideMottinAarhus University + ZhiqiangZhongAarhus University + KuangyuZhouMicrosoft + DavideMottinAarhus University 14559-14574 As Machine Learning (ML) models grow in size and demand higher-quality training data, the expenses associated with re-training and fine-tuning these models are escalating rapidly. Inspired by recent impressive achievements of Large Language Models (LLMs) in different fields, this paper delves into the question: can LLMs efficiently improve an ML’s performance at a minimal cost? We show that, through our proposed training-free framework LLMCorr, an LLM can work as a post-hoc corrector to propose corrections for the predictions of an arbitrary ML model. In particular, we form a contextual knowledge database by incorporating the dataset’s label information and the ML model’s predictions on the validation dataset. Leveraging the in-context learning capability of LLMs, we ask the LLM to summarise the instances in which the ML model makes mistakes and the correlation between primary predictions and true labels. Following this, the LLM can transfer its acquired knowledge to suggest corrections for the ML model’s predictions. Our experimental results on text analysis and the challenging molecular predictions show that LLMCorr improves the performance of a number of models by up to 39%. 2024.findings-acl.867 @@ -17915,11 +17915,11 @@ Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on <fixed-case>LLM</fixed-case> - JingcongLiangFudan University + JingcongLiangFudan University RongYeByteDance MengHan RuofeiLai - XinyuZhangHuawei Technologies Ltd. + XinyuZhangHuawei Technologies Ltd. XuanjingHuangFudan University ZhongyuWeiFudan University 14575-14595 @@ -17930,12 +17930,12 @@ <fixed-case>C</fixed-case>ycle<fixed-case>A</fixed-case>lign: Iterative Distillation from Black-box <fixed-case>LLM</fixed-case> to White-box Models for Better Human Alignment - JixiangHongRenmin University of China - QuanTu + JixiangHongRenmin University of China + QuanTu ChangyuChenRenmin University of China GaoXing JiZhangAlibaba Group - RuiYanRenmin University of China + RuiYanRenmin University of China 14596-14609 Language models trained on large-scale corpus often generate harmful responses that are harmful and contrary to human values. A prevalent approach for human alignment is reinforcement learning from human feedback (RLHF), utilizing algorithms such as proximal policy optimization (PPO). However, these methods are often characterized by complexity, instability, and substantial resource consumption. Considering that existing large language models (LLMs) like ChatGPT are already relatively well-aligned and cost-friendly, researchers propose to align the language model with human preferences from AI feedback. Nevertheless, the common practices, that unidirectionally distill the responses, are constrained by the inherent capability of LLMs. To address it, we introduce CycleAlign, a framework that distills alignment capabilities from the parameter-invisible LLMs (black-box) to the parameter-visible models (white-box) in an iterative manner. CycleAlign iteratively improves both the white-box and black-box models by integrating static and dynamic in-context learning and a belief alignment method.Empirical results illustrate that the model fine-tuned by CycleAlign remarkably exceeds existing methods, and achieves the state-of-the-art performance in alignment with human value. 2024.findings-acl.869 @@ -17944,9 +17944,9 @@ Towards a new research agenda for multimodal enterprise document understanding: What are we missing? - ArminehNourbakhshSchool of Computer Science, Carnegie Mellon University and J.P. Morgan Chase + ArminehNourbakhshSchool of Computer Science, Carnegie Mellon University and J.P. Morgan Chase SameenaShahJ.P. Morgan Chase - CarolynRoseSchool of Computer Science, Carnegie Mellon University + CarolynRoseSchool of Computer Science, Carnegie Mellon University 14610-14622 The field of multimodal document understanding has produced a suite of models that have achieved stellar performance across several tasks, even coming close to human performance on certain benchmarks. Nevertheless, the application of these models to real-world enterprise datasets remains constrained by a number of limitations. In this position paper, we discuss these limitations in the context of three key aspects of research: dataset curation, model development, and evaluation on downstream tasks. By analyzing 14 datasets and 7 SotA models, we identify major gaps in their utility in the context of a real-world scenario. We demonstrate how each limitation impedes the widespread use of SotA models in enterprise settings, and present a set of research challenges that are motivated by these limitations. Lastly, we propose a research agenda that is aimed at driving the field towards higher impact in enterprise applications. 2024.findings-acl.870 @@ -17956,11 +17956,11 @@ <fixed-case>CAUSE</fixed-case>: Counterfactual Assessment of User Satisfaction Estimation in Task-Oriented Dialogue Systems AminAbolghasemi - ZhaochunRenLeiden University + ZhaochunRenLeiden University ArianAskari - MohammadAliannejadiUniversity of Amsterdam + MohammadAliannejadiUniversity of Amsterdam Maartende RijkeUniversity of Amsterdam - SuzanVerberneUniversiteit Leiden + SuzanVerberneUniversiteit Leiden 14623-14635 An important unexplored aspect in previous work on user satisfaction estimation for Task-Oriented Dialogue (TOD) systems is their evaluation in terms of robustness for the identification of user dissatisfaction: current benchmarks for user satisfaction estimation in TOD systems are highly skewed towards dialogues for which the user is satisfied. The effect of having a more balanced set of satisfaction labels on performance is unknown. However, balancing the data with more dissatisfactory dialogue samples requires further data collection and human annotation, which is costly and time-consuming. In this work, we leverage large language models (LLMs) and unlock their ability to generate satisfaction-aware counterfactual dialogues to augment the set of original dialogues of a test collection. We gather human annotations to ensure the reliability of the generated samples. We evaluate two open-source LLMs as user satisfaction estimators on our augmented collection against state-of-the-art fine-tuned models. Our experiments show that when used as few-shot user satisfaction estimators, open-source LLMs show higher robustness to the increase in the number of dissatisfaction labels in the test collection than the fine-tuned state-of-the-art models. Our results shed light on the need for data augmentation approaches for user satisfaction estimation in TOD systems. We release our aligned counterfactual dialogues, which are curated by human annotation, to facilitate further research on this topic. 2024.findings-acl.871 @@ -17969,11 +17969,11 @@ Measuring Retrieval Complexity in Question Answering Systems - MatteoGabburo + MatteoGabburo Nicolaas PaulJedema SiddhantGarg - Leonardo F. R.Ribeiro - AlessandroMoschitti + Leonardo F. R.Ribeiro + AlessandroMoschitti 14636-14650 In this paper, we investigate which questions are challenging for retrieval-based Question Answering (QA). We (i) propose retrieval complexity (RC), a novel metric conditioned on the completeness of retrieved documents, which measures the difficulty of answering questions, and (ii) propose an unsupervised pipeline to measure RC given an arbitrary retrieval system.Our proposed pipeline measures RC more accurately than alternative estimators, including LLMs, on six challenging QA benchmarks. Further investigation reveals that RC scores strongly correlate with both QA performance and expert judgment across five of the six studied benchmarks, indicating that RC is an effective measure of question difficulty.Subsequent categorization of high-RC questions shows that they span a broad set of question shapes, including multi-hop, compositional, and temporal QA, indicating that RC scores can categorize a new subset of complex questions. Our system can also have a major impact on retrieval-based systems by helping to identify more challenging questions on existing datasets. 2024.findings-acl.872 @@ -17985,9 +17985,9 @@ JiayuSongQueen Mary, University of London JennyChimQueen Mary University London AdamTsakalidisCedefop and Alan Turing Institute - JuliaIveQueen Mary, University of London + JuliaIveQueen Mary, University of London DanaAtzil-SlonimBar-Ilan University - MariaLiakataQueen Mary University London + MariaLiakataQueen Mary University London 14651-14672 We introduce a hybrid abstractive summarisation approach combining hierarchical VAEs with LLMs to produce clinically meaningful summaries from social media user timelines, appropriate for mental health monitoring. The summaries combine two different narrative points of view: (a) clinical insights in third person, generated by feeding into an LLM clinical expert-guided prompts, and importantly, (b) a temporally sensitive abstractive summary of the user’s timeline in first person, generated by a novel hierarchical variational autoencoder, TH-VAE. We assess the generated summaries via automatic evaluation against expert summaries and via human evaluation with clinical experts, showing that timeline summarisation by TH-VAE results in more factual and logically coherent summaries rich in clinical utility and superior to LLM-only approaches in capturing changes over time. 2024.findings-acl.873 @@ -17999,9 +17999,9 @@ <fixed-case>PIXAR</fixed-case>: Auto-Regressive Language Modeling in Pixel Space YintaoTai - XiyangLiao - AlessandroSugliaHeriot-Watt University - AntonioVergariUniversity of Edinburgh, University of Edinburgh + XiyangLiao + AlessandroSugliaHeriot-Watt University + AntonioVergariUniversity of Edinburgh, University of Edinburgh 14673-14695 Recent work showed the possibility of building open-vocabulary large language models (LLMs) that directly operate on pixel representations. These models are implemented as autoencoders that reconstruct masked patches of rendered text.However, these pixel-based LLMs are limited to discriminative tasks (e.g., classification) and, similar to BERT, cannot be used to generate text.Therefore, they cannot be used for generative tasks such as free-form question answering. In this work, we introduce PIXAR, the first pixel-based autoregressive LLM that performs text generation. Consisting of only a decoder, PIXAR can perform free-form generative tasks while keeping the number of parameters on par with previous encoder-decoder models.Furthermore, we highlight the challenges of generating text as non-noisy images and show this is due to using a maximum likelihood objective. To overcome this problem, we propose an adversarial pretraining stage that improves the readability and accuracy of PIXAR by 8.1 on LAMBADA and 8.5 on bAbI— making it comparable to GPT-2 on text generation tasks.This paves the way to build open-vocabulary LLMs that operate on perceptual input only and calls into question the necessity of the usual symbolic input representation, i.e., text as (sub)tokens. 2024.findings-acl.874 @@ -18013,12 +18013,12 @@ DaMa LuChenShanghai Jiaotong University PengyuWang - HongshenXuShanghai Jiaotong University + HongshenXuShanghai Jiaotong University HanqiLi LiangtaiSun SuZhu ShuaiFan - KaiYuShanghai Jiao Tong University + KaiYuShanghai Jiao Tong University 14696-14707 Large language models (LLMs) have demonstrated proficiency across various natural language processing (NLP) tasks but often require additional training, such as continual pre-training and supervised fine-tuning. However, the costs associated with this, primarily due to their large parameter count, remain high. This paper proposes leveraging sparsity in pre-trained LLMs to expedite this training process. By observing sparsity in activated neurons during forward iterations, we identify the potential for computational speed-ups by excluding inactive neurons. We address associated challenges by extending existing neuron importance evaluation metrics and introducing a ladder omission rate scheduler. Our experiments on Llama-2 demonstrate that Sparsity-Accelerated Training (SAT) achieves comparable or superior performance to standard training while significantly accelerating the process. Specifically, SAT achieves a 45% throughput improvement in continual pre-training and saves 38% training time in supervised fine-tuning. It offers a simple, hardware-agnostic, and easily deployable framework for additional LLM training. 2024.findings-acl.875 @@ -18039,8 +18039,8 @@ Do Language Models Exhibit Human-like Structural Priming Effects? JaapJumelet - WillemZuidemaUniversity of Amsterdam - ArabellaSinclairUniversity of Aberdeen + WillemZuidemaUniversity of Amsterdam + ArabellaSinclairUniversity of Aberdeen 14727-14742 We explore which linguistic factors—at the sentence and token level—play an important role in influencing language model predictions, and investigate whether these are reflective of results found in humans and human corpora (Gries and Kootstra, 2017). We make use of the structural priming paradigm—where recent exposure to a structure facilitates processing of the same structure—to investigate where priming effects manifest, and what factors predict them. We find these effects can be explained via the inverse frequency effect found in human priming, where rarer elements within a prime increase priming effects, as well as lexical dependence between prime and target. Our results provide an important piece in the puzzle of understanding how properties within their context affect structural prediction in language models. 2024.findings-acl.877 @@ -18057,14 +18057,14 @@ YuhanWu HongchengGuo RuitongGanThe Hong Kong Polytechnic University, Hong Kong Polytechnic University - ZehaoNi - JianYangAlibaba Group - ManZhang + ZehaoNi + JianYangAlibaba Group + ManZhang ZhaoxiangZhangInstitute of automation, Chinese academy of science, Chinese Academy of Sciences - WanliOuyangShanghai AI Lab + WanliOuyangShanghai AI Lab KeXuBeijing University of Aeronautics and Astronautics WenhaoHuang - JieFuHong Kong University of Science and Technology + JieFuHong Kong University of Science and Technology JunranPeng 14743-14777 The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-playing abilities in LLMs. RoleLLM comprises four stages: (1) Role Profile Construction for 100 roles; (2) Context-Based Instruction Generation (Context-Instruct) for role-specific knowledge extraction; (3) Role Prompting using GPT (RoleGPT) for speaking style imitation; and (4) Role-Conditioned Instruction Tuning (RoCIT) for fine-tuning open-source models along with role customization. By Context-Instruct and RoleGPT, we create RoleBench, the first systematic and fine-grained character-level benchmark dataset for role-playing with 168,093 samples. Moreover, RoCIT on RoleBench yields RoleLLaMA (English) and RoleGLM (Chinese), significantly enhancing role-playing abilities and even achieving comparable results with RoleGPT (using GPT-4). @@ -18089,10 +18089,10 @@ Views Are My Own, but Also Yours: Benchmarking Theory of Mind Using Common Ground AdilSoubkiState University of New York at Stony Brook JohnMurzaku, State University of New York at Stony Brook - ArashYousefi JordehiUniversity of Guilan + ArashYousefi JordehiUniversity of Guilan PeterZengState University of New York at Stony Brook MagdalenaMarkowska - Seyed AbolghasemMirroshandelUniversity of Guilan + Seyed AbolghasemMirroshandelUniversity of Guilan OwenRambowStony Brook University 14815-14823 Evaluating the theory of mind (ToM) capabilities of language models (LMs) has recently received a great deal of attention. However, many existing benchmarks rely on synthetic data, which risks misaligning the resulting experiments with human behavior. We introduce the first ToM dataset based on naturally occurring spoken dialogs, Common-ToM, and show that LMs struggle to demonstrate ToM. We then show that integrating a simple, explicit representation of beliefs improves LM performance on Common-ToM. @@ -18102,7 +18102,7 @@ <fixed-case>MAPLE</fixed-case>: Multilingual Evaluation of Parameter Efficient Finetuning of Large Language Models - DivyanshuAggarwal + DivyanshuAggarwal AshutoshSathe IshaanWattsGoogle DeepMind SunayanaSitaramMicrosoft @@ -18129,8 +18129,8 @@ Multi-Task Transfer Matters During Instruction-Tuning DavidMuellerJohns Hopkins University - MarkDredzeDepartment of Computer Science, Whiting School of Engineering - NicholasAndrewsJohns Hopkins University + MarkDredzeDepartment of Computer Science, Whiting School of Engineering + NicholasAndrewsJohns Hopkins University 14880-14891 Instruction-tuning trains a language model on hundreds of tasks jointly to improve a model’s ability to learn in-context;however, the mechanisms that drive in-context learning are poorly understood and, as a result, the role of instruction-tuning on in-context generalization is poorly understood as well.In this work, we study the impact of instruction-tuning on multi-task transfer: how well a model’s parameters adapt to an unseen task via fine-tuning.We find that instruction-tuning negatively impacts a model’s transfer to unseen tasks, and that model transfer and in-context generalization are highly correlated, suggesting that this catastrophic forgetting may impact in-context learning.We study methods to improve model transfer, finding that multi-task training—how well the training tasks are optimized—can significantly impact ICL generalization; additionally, we find that continual training on unsupervised pre-training data can mitigate forgetting and improve ICL generalization as well.Finally, we demonstrate that, early into training, the impact of instruction-tuning on model transfer to tasks impacts in-context generalization on that task.Overall, we provide significant evidence that multi-task transfer is deeply connected to a model’s ability to learn a task in-context. 2024.findings-acl.883 @@ -18139,7 +18139,7 @@ What Makes a Good Order of Examples in In-Context Learning - QiGuo + QiGuo LeiyuWangnanjing university YidongWang WeiYePeking University @@ -18155,10 +18155,10 @@ YunyeGongSRI International RobikShresthaRochester Institute of Technology JaredClaypooleSRI International - MichaelCogswellSRI International + MichaelCogswellSRI International ArijitRayBoston University - ChristopherKananUniversity of Rochester - AjayDivakaranSRI International + ChristopherKananUniversity of Rochester + AjayDivakaranSRI International 14905-14918 We propose a novel VQA dataset, BloomVQA, to facilitate comprehensive evaluation of large vision-language models on comprehension tasks. Unlike current benchmarks that often focus on fact-based memorization and simple reasoning tasks without theoretical grounding, we collect multiple-choice samples based on picture stories that reflect different levels of comprehension, as laid out in Bloom’s Taxonomy, a classic framework for learning assessment widely adopted in education research. Our data maps to a novel hierarchical graph representation which enables automatic data augmentation and novel measures characterizing model consistency. We perform graded evaluation and reliability analysis on recent multi-modal models. In comparison to low-level tasks, we observe decreased performance on tasks requiring advanced comprehension and cognitive skills with up to 38.0% drop in VQA accuracy. In comparison to earlier models, GPT-4V demonstrates improved accuracy over all comprehension levels and also shows a tendency of bypassing visual inputs especially for higher-level tasks. Current models also show consistency patterns misaligned with human comprehension in various scenarios, demonstrating the need for improvement based on theoretically-grounded criteria. The dataset can be accessed at https://huggingface.co/datasets/ygong/BloomVQA. 2024.findings-acl.885 @@ -18167,7 +18167,7 @@ <fixed-case>A</fixed-case>ttribution<fixed-case>B</fixed-case>ench: How Hard is Automatic Attribution Evaluation? - YifeiLi + YifeiLi XiangYueCarnegie Mellon University ZeyiLiaoOhio State University, Columbus HuanSunThe Ohio State University, Columbus @@ -18193,10 +18193,10 @@ <fixed-case>I</fixed-case>nstruct<fixed-case>E</fixed-case>d: Soft-Instruction Tuning for Model Editing with Hops XiaoQiHan RuLiShanxi University - XiaoliLi + XiaoliLi JiyeLiangShanxi University - ZifangZhang - JeffPanUniversity of Edinburgh, University of Edinburgh + ZifangZhang + JeffPanUniversity of Edinburgh, University of Edinburgh 14953-14968 The task of model editing becomes popular for correcting inaccurate or outdated parametric knowledge in Large Language Models (LLMs). However, there are major limitations of state of the art (SOTA) model editing methods, including the excessive memorization issue caused by the direct editing methods, as well as the error propagation and knowledge conflict issues from the memory enhancement methods, resulting in hindering models’ *portability*, e.g., the ability to transfer the new knowledge to related one-hop or multi-hop content. To address these issues, we propose the InstructEd method, the idea of which is to insert soft instructions into the attention module so as to facilitate interactions between instructions and questions and to understand and utilize new facts. Our main findings are: (i) InstructEd has achieved SOTA performance on three datasets for one-hop/multi-hop evaluation with LLaMAs and GPT2, achieving 10% (5%) improvement in one-hop (multi-hop) model editing.(ii) Different from earlier methods on editing parameters in FFN, we show that editing attention can also help. (iii) Model editing is highly related to retrieval augmented methods, which can help improve the locality of model editing while slightly decrease the editing performance with hops. 2024.findings-acl.888 @@ -18205,16 +18205,16 @@ <fixed-case>TLCR</fixed-case>: Token-Level Continuous Reward for Fine-grained Reinforcement Learning from Human Feedback - EunseopYoonKAIST - Hee SukYoonKorea Advanced Institute of Science & Technology + EunseopYoonKAIST + Hee SukYoonKorea Advanced Institute of Science & Technology SooHwanEomKorea Advanced Institute of Science & Technology GunsooHanKakao Brain DanielNamKakao Brain Corp. DaejinJoKorea University and Kakao Brain Kyoung-WoonOnKakao - MarkHasegawa-JohnsonUniversity of Illinois, Urbana Champaign + MarkHasegawa-JohnsonUniversity of Illinois, Urbana Champaign SungwoongKimKorea University - ChangYooKorea Advanced Institute of Science and Technology + ChangYooKorea Advanced Institute of Science and Technology 14969-14981 Reinforcement Learning from Human Feedback (RLHF) leverages human preference data to train language models to align more closely with human essence. These human preference data, however, are labeled at the sequence level, creating a mismatch between sequence-level preference labels and tokens, which are autoregressively generated from the language model. Although several recent approaches have tried to provide token-level (i.e., dense) rewards for each individual token, these typically rely on predefined discrete reward values (e.g., positive: +1, negative: -1, neutral: 0), failing to account for varying degrees of preference inherent to each token. To address this limitation, we introduce TLCR (Token-Level Continuous Reward) for RLHF, which incorporates a discriminator trained to distinguish positive and negative tokens, and the confidence of the discriminator is used to assign continuous rewards to each token considering the context. Extensive experiments show that our proposed TLCR leads to consistent performance improvements over previous sequence-level or token-level discrete rewards on open-ended generation benchmarks. 2024.findings-acl.889 @@ -18224,16 +18224,16 @@ Found in the middle: Calibrating Positional Attention Bias Improves Long Context Utilization Cheng-YuHsiehUniversity of Washington - Yung-SungChuangMassachusetts Institute of Technology + Yung-SungChuangMassachusetts Institute of Technology Chun-LiangLiGoogle ZifengWangGoogle LongLeGoogle AbhishekKumarGoogle DeepMind - JamesGlass + JamesGlass AlexanderRatnerDepartment of Computer Science, University of Washington Chen-YuLeeGoogle RanjayKrishnaDepartment of Computer Science - TomasPfisterGoogle + TomasPfisterGoogle 14982-14995 Large language models (LLMs), even when specifically trained to process long input contexts, struggle to capture relevant information located in the middle of their input. This phenomenon has been known as the lost-in-the-middle problem. In this work, we make three contributions. First, we set out to understand the factors that cause this phenomenon. In doing so, we establish a connection between lost-in-the-middle to LLMs’ intrinsic attention bias: LLMs exhibit an U-shaped attention bias where the tokens at the beginning and at the end of its input receive higher attention, regardless of their relevance. Second, we mitigate this positional bias through a calibration mechanism, found-in-the-middle, that allows the model to attend to contexts faithfully according to their relevance, even though when they are in the middle. Third, we show found-in-the-middle not only achieves better performance in locating relevant information within a long context, but also eventually leads to improved retrieval-augmented generation (RAG) performance across various tasks, outperforming existing methods by up to 10 percentage point. These findings open up future directions in understanding LLM attention bias and its potential consequences. 2024.findings-acl.890 @@ -18243,7 +18243,7 @@ S3-<fixed-case>DST</fixed-case>: Structured Open-Domain Dialogue Segmentation and State Tracking in the Era of <fixed-case>LLM</fixed-case>s Sarkar Snigdha SarathiDas - ChiragShahUniversity of Washington + ChiragShahUniversity of Washington MengtingWanMicrosoft JenniferNevillePurdue University and Purdue University LongqiYangMicrosoft @@ -18258,11 +18258,11 @@ Set the Clock: Temporal Alignment of Pretrained Language Models - BowenZhao - ZanderBrumbaughDepartment of Computer Science + BowenZhao + ZanderBrumbaughDepartment of Computer Science YizhongWangDepartment of Computer Science, University of Washington HannanehHajishirziUniversity of Washington, University of Washington, Allen Institute for Artificial Intelligence and University of Washington, Seattle - NoahSmithUniversity of Washington and Allen Institute for Artificial Intelligence + NoahSmithUniversity of Washington and Allen Institute for Artificial Intelligence 15015-15040 Language models (LMs) are trained on web text originating from many points in time and, in general, without any explicit temporal grounding. This work investigates the temporal chaos of pretrained LMs and explores various methods to align their internal knowledge to a target time, which we call “temporal alignment.” To do this, we first automatically construct a dataset containing 20K time-sensitive questions and their answers for each year from 2000 to 2023. Based on this dataset, we empirically show that pretrained LMs (e.g., LLaMa2), despite having a recent pretraining cutoff (e.g., 2022), mostly answer questions using earlier knowledge (e.g., in 2019). We then develop several methods, from prompting to finetuning, to align LMs to use their most recent knowledge when answering questions, and investigate various factors in this alignment. Our experiments demonstrate that aligning LLaMa2 to the year 2022 can enhance its performance by up to 62% according to that year’s answers. This improvement occurs even without explicitly mentioning time information, indicating the possibility of aligning models’ internal sense of time after pretraining. Finally, we find that alignment to a historical time is also possible, with up to 2.8\times the performance of the unaligned LM in 2010 if finetuning models to that year. These findings hint at the sophistication of LMs’ internal knowledge organization and the necessity of tuning them properly. 2024.findings-acl.892 @@ -18274,7 +18274,7 @@ BeyzaErmisCohere AI LuizaPozzobon SaraHookerCohere For AI - PatrickLewis + PatrickLewis 15041-15058 To date, toxicity mitigation in language models has almost entirely been focused on single-language settings. As language models embrace multilingual capabilities, it’s crucial our safety measures keep pace. Recognizing this research gap, our approach expands the scope of conventional toxicity mitigation to address the complexities presented by multiple languages. In the absence of sufficient annotated datasets across languages, we employ translated data to evaluate and enhance our mitigation techniques. We also compare finetuning mitigation approaches against retrieval-augmented techniques under both static and continual toxicity mitigation scenarios. This allows us to examine the effects of translation quality and the cross-lingual transfer on toxicity mitigation. We also explore how model size and data quantity affect the success of these mitigation efforts. Covering nine languages, our study represents a broad array of linguistic families and levels of resource availability, ranging from high to mid-resource languages. Through comprehensive experiments, we provide insights into the complexities of multilingual toxicity mitigation, offering valuable insights and paving the way for future research in this increasingly important field. 2024.findings-acl.893 @@ -18286,9 +18286,9 @@ AnshArora XuanliHeUniversity College London, University of London MaximilianMozesCohere - SrinibasSwain - MarkDrasMacquarie University - QiongkaiXuMacquarie University + SrinibasSwain + MarkDrasMacquarie University + QiongkaiXuMacquarie University 15059-15075 The democratization of pre-trained language models through open-source initiatives has rapidly advanced innovation and expanded access to cutting-edge technologies. However, this openness also brings significant security risks, including backdoor attacks, where hidden malicious behaviors are triggered by specific inputs, compromising natural language processing (NLP) system integrity and reliability. This paper suggests that merging a backdoored model with other homogeneous models can significantly remediate backdoor vulnerabilities even if such models are not entirely secure. In our experiments, we verify our hypothesis on various models (BERT-Base, RoBERTa-Large, Llama2-7B, and Mistral-7B) and datasets (SST-2, OLID, AG News, and QNLI). Compared to multiple advanced defensive approaches, our method offers an effective and efficient inference-stage defense against backdoor attacks on classification and instruction-tuned tasks without additional resources or specific knowledge. Our approach consistently outperforms recent advanced baselines, leading to an average of about 75% reduction in the attack success rate. Since model merging has been an established approach for improving model performance, the extra advantage it provides regarding defense can be seen as a cost-free bonus. 2024.findings-acl.894 @@ -18297,9 +18297,9 @@ Enhancing Sentence Simplification in <fixed-case>P</fixed-case>ortuguese: Leveraging Paraphrases, Context, and Linguistic Features - ArthurScalercio - MariaFinattoUniversidade Federal do Rio Grande do Sul - AlinePaesUniversidade Federal Fluminense + ArthurScalercio + MariaFinattoUniversidade Federal do Rio Grande do Sul + AlinePaesUniversidade Federal Fluminense 15076-15091 Automatic text simplification focuses on transforming texts into a more comprehensible version without sacrificing their precision. However, automatic methods usually require (paired) datasets that can be rather scarce in languages other than English. This paper presents a new approach to automatic sentence simplification that leverages paraphrases, context, and linguistic attributes to overcome the absence of paired texts in Portuguese.We frame the simplification problem as a textual style transfer task and learn a style representation using the sentences around the target sentence in the document and its linguistic attributes. Moreover, unlike most unsupervised approaches that require style-labeled training data, we fine-tune strong pre-trained models using sentence-level paraphrases instead of annotated data. Our experiments show that our model achieves remarkable results, surpassing the current state-of-the-art (BART+ACCESS) while competitively matching a Large Language Model. 2024.findings-acl.895 @@ -18323,9 +18323,9 @@ Toward Reliable Ad-hoc Scientific Information Extraction: A Case Study on Two Materials Dataset - SatanuGhosh + SatanuGhosh NealBrodnik - CarolinaFrey + CarolinaFrey CollinHolgateUniversity of California, Santa Barbara TresaPollockUniversity of California-Santa Barbara SamanthaDalyUniversity of Michigan Ann Arbor @@ -18339,7 +18339,7 @@ Structural Optimization Ambiguity and Simplicity Bias in Unsupervised Neural Grammar Induction JinwookParkGwangju Institute of Science and Technology - KangilKimGwangju Institute of Science and Technology + KangilKimGwangju Institute of Science and Technology 15124-15139 Neural parameterization has significantly advanced unsupervised grammar induction. However, training these models with a traditional likelihood loss for all possible parses exacerbates two issues: 1) *structural optimization ambiguity* that arbitrarily selects one among structurally ambiguous optimal grammars despite the specific preference of gold parses, and 2) *structural simplicity bias* that leads a model to underutilize rules to compose parse trees. These challenges subject unsupervised neural grammar induction (UNGI) to inevitable prediction errors, high variance, and the necessity for extensive grammars to achieve accurate predictions. This paper tackles these issues, offering a comprehensive analysis of their origins. As a solution, we introduce *sentence-wise parse-focusing* to reduce the parse pool per sentence for loss evaluation, using the structural bias from pre-trained parsers on the same dataset.In unsupervised parsing benchmark tests, our method significantly improves performance while effectively reducing variance and bias toward overly simplistic parses. Our research promotes learning more compact, accurate, and consistent explicit grammars, facilitating better interpretability. 2024.findings-acl.898 @@ -18353,8 +18353,8 @@ FlorianLuisierGoogle GuolongSuGoogle XiaoyuSunGoogle - Ramya SreeBoppanaGoogle - ZilongWangUniversity of California, San Diego + Ramya SreeBoppanaGoogle + ZilongWangUniversity of California, San Diego ZifengWangGoogle JiaqiMuGoogle HaoZhang @@ -18369,9 +18369,9 @@ <fixed-case>DBQR</fixed-case>-<fixed-case>QA</fixed-case>: A Question Answering Dataset on a Hybrid of Database Querying and Reasoning RungsimanNararatwong - Chung-ChiChenAIST, National Institute of Advanced Industrial Science and Technology + Chung-ChiChenAIST, National Institute of Advanced Industrial Science and Technology NatthawutKertkeidkachornJapan Advanced Institute of Science and Technology, Tokyo Institute of Technology - HiroyaTakamuraAIST, National Institute of Advanced Industrial Science and Technology + HiroyaTakamuraAIST, National Institute of Advanced Industrial Science and Technology RyutaroIchiseNational Intitute of Informatics and Tokyo Institute of Technology, Tokyo Institute of Technology 15169-15182 This paper introduces the Database Querying and Reasoning Dataset for Question Answering (DBQR-QA), aimed at addressing the gap in current question-answering (QA) research by emphasizing the essential processes of database querying and reasoning to answer questions. Specifically designed to accommodate sequential questions and multi-hop queries, DBQR-QA more accurately mirrors the dynamics of real-world information retrieval and analysis, with a particular focus on the financial reports of US companies. The dataset’s construction, the challenges encountered during its development, the performance of large language models on this dataset, and a human evaluation are thoroughly discussed to illustrate the dataset’s complexity and highlight future research directions in querying and reasoning tasks. @@ -18382,12 +18382,12 @@ <fixed-case>N</fixed-case>ote<fixed-case>C</fixed-case>hat: A Dataset of Synthetic Patient-Physician Conversations Conditioned on Clinical Notes JundaWang - ZonghaiYaoUniversity of Massachusetts at Amherst - ZhichaoYangUniversity of Massachusetts, Amherst + ZonghaiYaoUniversity of Massachusetts at Amherst + ZhichaoYangUniversity of Massachusetts, Amherst HuixueZhou RumengLiUniversity of Massachusetts, Amherst XunWangMicrosoft - YuchengXu + YuchengXu HongYuColumbia University 15183-15201 We introduce NoteChat, a novel cooperative multi-agent framework leveraging Large Language Models (LLMs) to generate patient-physician dialogues. NoteChat embodies the principle that an ensemble of role-specific LLMs, through structured role-play and strategic prompting, can perform their assigned roles more effectively. The synergy among these role-playing LLMs results in a cohesive and efficient dialogue generation. Evaluation on MTS-dialogue, a benchmark dataset for patient-physician dialogues-note pairs, shows that models trained with the augmented synthetic patient-physician dialogues by NoteChat outperforms other state-of-the-art models for generating clinical notes. Our comprehensive automatic and human evaluation demonstrates that NoteChat substantially surpasses state-of-the-art models like ChatGPT and GPT-4 up to 22.78% by domain experts in generating superior synthetic patient-physician dialogues based on clinical notes. NoteChat has the potential to engage patients directly and help clinical documentation, a leading cause of physician burnout. @@ -18398,7 +18398,7 @@ Model Editing at Scale leads to Gradual and Catastrophic Forgetting AkshatGuptaUniversity of California, Berkeley - AnuragRao + AnuragRao GopalaAnumanchipalliUniversity of California, Berkeley 15202-15232 Editing knowledge in large language models is an attractive capability that allows us to correct incorrectly learned facts during pre-training, as well as update the model with an ever-growing list of new facts. While existing model editing techniques have shown promise, they are usually evaluated using metrics for reliability, specificity and generalization over one or few edits. We argue that for model editing to have practical utility, we must be able to make multiple edits to the same model. With this in mind, we evaluate current model editing methods at scale, focusing on two state of the art methods - ROME and MEMIT. With the lens of scalability, we evaluate model editing methods for three crucial properties - editing proficiency, fact forgetting and downstream performance. We find that as a model is edited sequentially with multiple facts, it continually becomes less editable, forgets previously edited facts and loses the ability to perform downstream tasks. For ROME and MEMIT, this “forgetting” happens in two phases - an initial gradual but progressive forgetting phase followed by an abrupt or catastrophic forgetting. Both gradual and catastrophic forgetting limit the usefulness of model editing methods at scale - the former makes model editing less effective as multiple edits are made to the model while the latter caps the scalability of such model editing methods. Our analysis also highlights other key limitations of ROME and MEMIT at scale. With our work, we push for better evaluation of model editing and development of model editing methods keeping scalability in mind. @@ -18408,13 +18408,13 @@ 3<fixed-case>MVRD</fixed-case>: Multimodal Multi-task Multi-teacher Visually-Rich Form Document Understanding - YihaoDing - LorenzoVaianiPolytechnic Institute of Turin + YihaoDing + LorenzoVaianiPolytechnic Institute of Turin CarenHanUniversity of Melbourne, University of Western Australia and University of Sydney - JeanLee - PaoloGarzaPolytechnic Institute of Turin - JosiahPoonUniversity of Sydney - LucaCaglieroPolytechnic Institute of Turin + JeanLee + PaoloGarzaPolytechnic Institute of Turin + JosiahPoonUniversity of Sydney + LucaCaglieroPolytechnic Institute of Turin 15233-15244 This paper presents a groundbreaking multimodal, multi-task, multi-teacher joint-grained knowledge distillation model for visually-rich form document understanding. The model is designed to leverage insights from both fine-grained and coarse-grained levels by facilitating a nuanced correlation between token and entity representations, addressing the complexities inherent in form documents. Additionally, we introduce new inter-grained and cross-grained loss functions to further refine diverse multi-teacher knowledge distillation transfer process, presenting distribution gaps and a harmonised understanding of form documents. Through a comprehensive evaluation across publicly available form document understanding datasets, our proposed model consistently outperforms existing baselines, showcasing its efficacy in handling the intricate structures and content of visually complex form documents. 2024.findings-acl.903 @@ -18424,9 +18424,9 @@ Faithful Persona-based Conversational Dataset Generation with Large Language Models PegahJandaghi - XianghaiShengGoogle - XinyiBaiGoogle - JayPujaraUniversity of Southern California + XianghaiShengGoogle + XinyiBaiGoogle + JayPujaraUniversity of Southern California HakimSidahmed 15245-15270 High-quality conversational datasets are essential for developing AI models that can communicate with users.One way to foster deeper interactions between a chatbot and its user is through *personas*, aspects of the user’s character that provide insights into their personality, motivations, and behaviors.Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations.The Generator is an LLM prompted to output conversations.The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations.These experts select the best generated conversations, which we then use to improve the Generator.We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat.We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during an AI detection test decreases from 17.2% to 8.8% over three iterations. @@ -18439,11 +18439,11 @@ ZhiyangXu ChaoFengUniversity of Michigan - Ann Arbor and University of Electronic Science and Technology of China RulinShao - TrevorAshbyVirginia Polytechnic Institute and State University + TrevorAshbyVirginia Polytechnic Institute and State University YingShen DiJinMeta YuChengThe Chinese University of Hong Kong - QifanWangMeta AI + QifanWangMeta AI LifuHuangVirginia Tech 15271-15342 Despite vision-language models’ (VLMs) remarkable capabilities as versatile visual assistants, two substantial challenges persist within the existing VLM frameworks: (1) lacking task diversity in pretraining and visual instruction tuning, and (2) annotation error and bias in GPT-4 synthesized instruction tuning data. Both challenges lead to issues such as poor generalizability, hallucination, and catastrophic forgetting. To address these challenges, we construct Vision-Flan, the most diverse publicly available visual instruction tuning dataset to date, comprising 187 diverse tasks and 1,664,261 instances sourced from academic datasets, and each task is accompanied by an expert-written instruction. In addition, we propose a two-stage instruction tuning framework, in which VLMs are firstly finetuned on Vision-Flan and further tuned on GPT-4 synthesized data. We find this two-stage tuning framework significantly outperforms the traditional single-stage visual instruction tuning framework and achieves the state-of-the-art performance across a wide range of multi-modal evaluation benchmarks. Finally, we conduct in-depth analyses to understand visual instruction tuning and our findings reveal that: (1) GPT-4 synthesized data does not substantially enhance VLMs’ capabilities but rather modulates the model’s responses to human-preferred formats; (2) A minimal quantity (e.g., 1,000) of GPT-4 synthesized data can effectively align VLM responses with human-preference; (3) Visual instruction tuning mainly helps large-language models (LLMs) to understand visual features. @@ -18467,9 +18467,9 @@ ClaireJin SudhaRaoMicrosoft XiangyuPengSalesForce.com - PortiaBotchwayVanderbilt University + PortiaBotchwayVanderbilt University JessicaQuayeHarvard University - ChrisBrockettMicrosoft + ChrisBrockettMicrosoft BillDolan 15353-15368 Advancements in large language models (LLMs) are revolutionizing interactive game design, enabling dynamic plotlines and interactions between players and non-player characters (NPCs). However, LLMs may exhibit flaws such as hallucinations, forgetfulness, or misinterpretations of prompts, causing logical inconsistencies and unexpected deviations from intended designs. Automated techniques for detecting such game bugs are still lacking. To address this, we propose a systematic LLM-based method for automatically identifying such bugs from player game logs, eliminating the need for collecting additional data such as post-play surveys. Applied to a text-based game DejaBoom!, our approach effectively identifies bugs inherent in LLM-powered interactive games, surpassing unstructured LLM-powered bug-catching methods and filling the gap in automated detection of logical and design flaws. @@ -18489,7 +18489,7 @@ Challenges to Evaluating the Generalization of Coreference Resolution Models: A Measurement Modeling Perspective - IanPoradaMcGill University + IanPoradaMcGill University AlexandraOlteanuResearch, Microsoft KaheerSuleman AdamTrischler @@ -18502,9 +18502,9 @@ <fixed-case>SAGA</fixed-case>: A Participant-specific Examination of Story Alternatives and Goal Applicability for a Deeper Understanding of Complex Events - SaiVallurupalli + SaiVallurupalli KatrinErkUniversity of Texas, Austin - FrancisFerraroUniversity of Maryland, Baltimore County + FrancisFerraroUniversity of Maryland, Baltimore County 15396-15420 Interpreting and assessing goal driven actions is vital to understanding and reasoning over complex events. It is important to be able to acquire the knowledge needed for this understanding, though doing so is challenging. We argue that such knowledge can be elicited through a participant achievement lens. We analyze a complex event in a narrative according to the intended achievements of the participants in that narrative, the likely future actions of the participants, and the likelihood of goal success. We collect 6.3K high quality goal and action annotations reflecting our proposed participant achievement lens, with an average weighted Fleiss-Kappa IAA of 80%. Our collection contains annotated alternate versions of each narrative. These alternate versions vary minimally from the “original” story, but can license drastically different inferences. Our findings suggest that while modern large language models can reflect some of the goal-based knowledge we study, they find it challenging to fully capture the design and intent behind concerted actions, even when the model pretraining included the data from which we extracted the goal knowledge. We show that smaller models fine-tuned on our dataset can achieve performance surpassing larger models. 2024.findings-acl.910 @@ -18514,10 +18514,10 @@ <fixed-case>SLIDE</fixed-case>: A Framework Integrating Small and Large Language Models for Open-Domain Dialogues Evaluation KunZhao - BohaoYangUniversity of Manchester - ChenTang + BohaoYangUniversity of Manchester + ChenTang ChenghuaLinUniversity of Manchester - LiangZhanUniversity of Pittsburgh + LiangZhanUniversity of Pittsburgh 15421-15435 The long-standing one-to-many problem of gold standard responses in open-domain dialogue systems presents challenges for automatic evaluation metrics. Though prior works have demonstrated some success by applying powerful Large Language Models (LLMs), existing approaches still struggle with the one-to-many problem, and exhibit subpar performance in domain-specific scenarios. We assume the commonsense reasoning biases within LLMs may hinder their performance in domain-specific evaluations. To address both issues, we propose a novel framework SLIDE (Small and Large Integrated for Dialogue Evaluation), that leverages both a small, specialised model (SLM), and LLMs for the evaluation of open domain dialogues. Our approach introduces several techniques: (1) Contrastive learning to differentiate between robust and non-robust response embeddings; (2) A novel metric for semantic sensitivity that combines embedding cosine distances with similarity learned through neural networks, and (3) A strategy for incorporating the evaluation results from both the SLM and LLMs. Our empirical results demonstrate that our approach achieves state-of-the-art performance in both the classification and evaluation tasks, and additionally the SLIDE evaluator exhibits better correlation with human judgements. Our code is available at https://github.com/hegehongcha/SLIDE-ACL2024. 2024.findings-acl.911 @@ -18541,7 +18541,7 @@ What Makes Language Models Good-enough? DaikiAsami - SakuSugawaraNational Institute of Informatics + SakuSugawaraNational Institute of Informatics 15453-15467 Psycholinguistic research suggests that humans may build a representation of linguistic input that is ‘good-enough’ for the task at hand. This study examines what architectural features make language models learn human-like good-enough language processing. We focus on the number of layers and self-attention heads in Transformers. We create a good-enough language processing (GELP) evaluation dataset (7,680 examples), which is designed to test the effects of two plausibility types, eight construction types, and three degrees of memory cost on language processing. To annotate GELP, we first conduct a crowdsourcing experiment whose design follows prior psycholinguistic studies. Our model evaluation against the annotated GELP then reveals that the full model as well as models with fewer layers and/or self-attention heads exhibit a good-enough performance. This result suggests that models with shallower depth and fewer heads can learn good-enough language processing. 2024.findings-acl.913 @@ -18578,15 +18578,15 @@ Knowledge-Infused Prompting: Assessing and Advancing Clinical Text Data Generation with Large Language Models RanXuEmory University - HejieCuiStanford University - YueYuGeorgia Institute of Technology + HejieCuiStanford University + YueYuGeorgia Institute of Technology XuanKan - WenqiShiUniversity of Texas Southwestern Medical Center + WenqiShiUniversity of Texas Southwestern Medical Center YuchenZhuangGeorgia Institute of Technology May DongmeiWang WeiJinEmory University JoyceHoEmory University - CarlYangEmory University + CarlYangEmory University 15496-15523 Clinical natural language processing faces challenges like complex medical terminology and clinical contexts. Recently, large language models (LLMs) have shown promise in this domain. Yet, their direct deployment can lead to privacy issues and are constrained by resources. To address this challenge, we delve into synthetic clinical text generation with LLMs for clinical NLP tasks. We propose an innovative, resource-efficient approach, ClinGen, which infuses knowledge into the process. Our model involves clinical knowledge extraction and context-informed LLM prompting. Both clinical topics and writing styles are drawn from external domain-specific knowledge graphs and LLMs to guide data generation. Our extensive empirical study across 8 clinical NLP tasks and 18 datasets reveals that ClinGen consistently enhances performance across various tasks by 7.7%-8.7% on average, effectively aligning the distribution of real datasets and enriching the diversity of generated training instances. 2024.findings-acl.916 @@ -18621,9 +18621,9 @@ <fixed-case>TELLER</fixed-case>: A Trustworthy Framework for Explainable, Generalizable and Controllable Fake News Detection - HuiLiuCity University of Hong Kong - WenyaWangNanyang Technological University - HaoruLi + HuiLiuCity University of Hong Kong + WenyaWangNanyang Technological University + HaoruLi HaoliangLiCity University of Hong Kong 15556-15583 The proliferation of fake news has emerged as a severe societal problem, raising significant interest from industry and academia. While existing deep-learning based methods have made progress in detecting fake news accurately, their reliability may be compromised caused by the non-transparent reasoning processes, poor generalization abilities and inherent risks of integration with large language models (LLMs). To address this challenge, we propose TELLER, a novel framework for trustworthy fake news detection that prioritizes explainability, generalizability and controllability of models. This is achieved via a dual-system framework that integrates cognition and decision systems, adhering to the principles above. The cognition system harnesses human expertise to generate logical predicates, which guide LLMs in generating human-readable logic atoms. Meanwhile, the decision system deduces generalizable logic rules to aggregate these atoms, enabling the identification of the truthfulness of the input news across diverse domains and enhancing transparency in the decision-making process. Finally, we present comprehensive evaluation results on four datasets, demonstrating the feasibility and trustworthiness of our proposed framework. @@ -18658,7 +18658,7 @@ A Meta-Learning Perspective on Transformers for Causal Language Modeling XinboWuUniversity of Illinois, Urbana Champaign - LavVarshneyUniversity of Illinois at Urbana-Champaign + LavVarshneyUniversity of Illinois at Urbana-Champaign 15612-15622 The Transformer architecture has become prominent in developing large causal language models. However, mechanisms to explain its capabilities are not well understood. Focused on the training process, here we establish a meta-learning view of the Transformer architecture when trained for the causal language modeling task, by explicating an inner optimization process that may happen within the Transformer. Further, from within the inner optimization, we discover and theoretically analyze a special characteristic of the norms of learned token representations within Transformer-based causal language models. Our analysis is supported by experiments conducted on pre-trained large language models and real-world data. 2024.findings-acl.922 @@ -18668,15 +18668,15 @@ <fixed-case>PL</fixed-case>a<fixed-case>D</fixed-case>: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs RongzhiZhangGeorgia Institute of Technology and Zhejiang University - JiamingShenGoogle DeepMind + JiamingShenGoogle DeepMind TianqiLiuGoogle HaoruiWangGeorgia Institute of Technology - ZhenQinGoogle + ZhenQinGoogle FengHanResearch, Google JialuLiuGoogle Research SimonBaumgartnerGoogle - MichaelBenderskyGoogle - ChaoZhangGeorgia Institute of Technology + MichaelBenderskyGoogle + ChaoZhangGeorgia Institute of Technology 15623-15636 Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings. Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models. However, traditional KD techniques face specific challenges when applied to LLMs, including restricted access to LLM outputs, significant teacher-student capacity gaps, and the inherited mis-calibration issue. In this work, we present PLaD, a novel preference-based LLM distillation framework. PLaD exploits the teacher-student capacity discrepancy to generate pseudo-preference pairs where teacher outputs are preferred over student outputs. Then, PLaD leverages a ranking loss to re-calibrate the student’s estimation of sequence likelihood, which steers the student’s focus towards understanding the relative quality of outputs instead of simply imitating the teacher. PLaD bypasses the need for access to teacher LLM’s internal states, tackles the student’s expressivity limitations, and mitigates the student mis-calibration issue. Through extensive experiments on two sequence generation tasks and with various LLMs, we demonstrate the effectiveness of our proposed PLaD framework. 2024.findings-acl.923 @@ -18703,9 +18703,9 @@ KexunZhangCarnegie Mellon University YeeChoi ZhenqiaoSong - TaiqiHe + TaiqiHe William YangWangUC Santa Barbara - LeiLiSchool of Computer Science, Carnegie Mellon University + LeiLiSchool of Computer Science, Carnegie Mellon University 15654-15669 How can large language models (LLMs) process and translate endangered languages? Many languages lack a large corpus to train a decent LLM; therefore existing LLMs rarely perform well in unseen, endangered languages. On the contrary, we observe that 2000 endangered languages, though without a large corpus, have a grammar book or a dictionary. We propose LingoLLM, a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training. Our key insight is to demonstrate linguistic knowledge of an unseen language in an LLM’s prompt, including a dictionary, a grammar book, and morphologically analyzed input text. We implement LingoLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages. Our results show that LingoLLM elevates translation capability from GPT-4’s 0 to 10.5 BLEU for 10 language directions. Our findings demonstrate the tremendous value of linguistic knowledge in the age of LLMs for endangered languages. Our data, code, and model generations will be released to the public. Our data, code, and model generations can be found at https://github.com/LLiLab/llm4endangeredlang. 2024.findings-acl.925 @@ -18714,7 +18714,7 @@ From Tarzan to <fixed-case>T</fixed-case>olkien: Controlling the Language Proficiency Level of <fixed-case>LLM</fixed-case>s for Content Generation - AliMalikStanford University + AliMalikStanford University StephenMayhewDuolingo ChristopherPiech KlintonBicknellDuolingo @@ -18741,9 +18741,9 @@ <fixed-case>CT</fixed-case>ool<fixed-case>E</fixed-case>val: A <fixed-case>C</fixed-case>hinese Benchmark for <fixed-case>LLM</fixed-case>-Powered Agent Evaluation in Real-World <fixed-case>API</fixed-case> Interactions - ZishanGuo + ZishanGuo YufeiHuang - DeyiXiongTianjin University + DeyiXiongTianjin University 15711-15724 Assessing the capabilities of large language models (LLMs) as agents in decision making and operational tasks is crucial for the development of LLM-as-agent service. We propose CToolEval, a benchmark designed to evaluate LLMs in the context of Chinese societal applications, featuring 398 APIs across 27 widely-used Apps (e.g., Apps for shopping, map, music, travel, etc.) that cover 14 domains. We further present an evaluation framework that simulates real-life scenarios, to facilitate the assessment of tool invocation ability of LLMs for tool learning and task completion ability for user interation. Our extensive experiments with CToolEval evaluate 11 LLMs, revealing that while GPT-3.5-turbo excels in tool invocation, Chinese LLMs usually struggle with issues like hallucination and a lack of comprehensive tool understanding. Our findings highlight the need for further refinement in decision-making capabilities of LLMs, offering insights into bridging the gap between current functionalities and agent-level performance. To promote further research for LLMs to fully act as reliable agents in complex, real-world situations, we release our data and codes at https://github.com/tjunlp-lab/CToolEval. 2024.findings-acl.928 @@ -18753,11 +18753,11 @@ Token Alignment via Character Matching for Subword Completion BenAthiwaratkunAmazon - ShiqiWangAmazon + ShiqiWangAmazon MingyueShangAmazon YuchenTian ZijianWangAmazon AWS AI Labs - Sujan KumarGonugondlaAmazon + Sujan KumarGonugondlaAmazon Sanjay KrishnaGoudaAmazon RobertKwiatkowskiAmazon RameshNallapatiAmazon Web Services @@ -18800,7 +18800,7 @@ Language-Informed Beam Search Decoding for Multilingual Machine Translation YilinYangOregon State University StefanLeeOregon State University - PrasadTadepalliOregon State University and Oregon State University + PrasadTadepalliOregon State University and Oregon State University 15761-15772 Beam search decoding is the de-facto method for decoding auto-regressive Neural Machine Translation (NMT) models, including multilingual NMT where the target language is specified as an input. However, decoding multilingual NMT models commonly produces off-target translations – yielding translation outputs not in the intended language.In this paper, we first conduct an error analysis of off-target translations for a strong multilingual NMT model and identify how these decodings are produced during beam search. We then propose Language-informed Beam Search (LiBS), a general decoding algorithm incorporating an off-the-shelf Language Identification (LiD) model into beam search decoding to reduce off-target translations. LiBS is an inference-time procedure that is NMT-model agnostic and does not require any additional parallel data. Results show that our proposed LiBS algorithm on average improves +1.1 BLEU and +0.9 BLEU on WMT and OPUS datasets, and reduces off-target rates from 22.9% to 7.7% and 65.8% to 25.3% respectively. 2024.findings-acl.932 @@ -18821,8 +18821,8 @@ The <fixed-case>PGNSC</fixed-case> Benchmark: How Do We Predict Where Information Spreads? - AlexanderTaylorUCLA Computer Science Department, University of California, Los Angeles - WeiWangUniversity of California, Los Angeles + AlexanderTaylorUCLA Computer Science Department, University of California, Los Angeles + WeiWangUniversity of California, Los Angeles 15787-15803 Social networks have become ideal vehicles for news dissemination because posted content is easily able to reach users beyond a news outlet’s direct audience. Understanding how information is transmitted among communities of users is a critical step towards understanding the impact social networks have on real-world events. Two significant barriers in this vein of work are identifying user clusters and meaningfully characterizing these communities. Thus, we propose the PGNSC benchmark, which builds information pathways based on the audiences of influential news sources and uses their content to characterize the communities. We present methods of aggregating these news-source-centric communities and for constructing the community feature representations that are used sequentially to construct information pathway prediction pipelines. Lastly, we perform extensive experiments to demonstrate the performance of baseline pipeline constructions and to highlight the possibilities for future work. 2024.findings-acl.934 @@ -18878,10 +18878,10 @@ A Critical Study of What Code-<fixed-case>LLM</fixed-case>s (Do Not) Learn - AbhinavAnandTechnische Universität Darmstadt + AbhinavAnandTechnische Universität Darmstadt ShwetaVermaTechnische Universität Darmstadt KrishnaNarasimhan - MiraMeziniTechnische Universität Darmstadt + MiraMeziniTechnische Universität Darmstadt 15869-15889 Large Language Models trained on code corpora (code-LLMs) have demonstrated impressive performance in various coding assistance tasks. However, despite their increased size and training dataset, code-LLMs still have limitations such as suggesting codes with syntactic errors, variable misuse etc. Some studies argue that code-LLMs perform well on coding tasks because they use self-attention and hidden representations to encode relations among input tokens. However, previous works have not studied what code properties are not encoded by code-LLMs. In this paper, we conduct a fine-grained analysis of attention maps and hidden representations of code-LLMs. Our study indicates that code-LLMs only encode relations among specific subsets of input tokens. Specifically, by categorizing input tokens into syntactic tokens and identifiers, we found that models encode relations among syntactic tokens and among identifiers, but they fail to encode relations between syntactic tokens and identifiers. We also found that fine-tuned models encode these relations poorly compared to their pre-trained counterparts. Additionally, larger models with billions of parameters encode significantly less information about code than models with only a few hundred million parameters. 2024.findings-acl.939 @@ -18890,7 +18890,7 @@ Visual In-Context Learning for Large Vision-Language Models - YuchengZhouUniversity of Macau + YuchengZhouUniversity of Macau XiangLi QianningWang JianbingShenUniversity of Macau @@ -18908,7 +18908,7 @@ Si-QingChen FuruWeiMicrosoft Research DongyanZhaoPeking University - RuiYanRenmin University of China + RuiYanRenmin University of China 15903-15918 In this paper, we introduce SCALE, a collaborative framework that connects a compact Specialized Translation Model (STM) and a general-purpose Large Language Model (LLM) as one unified translation engine. By introducing translation from STM into the triplet in-context demonstrations, SCALE unlocks refinement and pivoting ability of LLM, thus 1) mitigating language bias of LLMs and parallel data bias of STMs, 2) enhancing LLM speciality without sacrificing generality, and 3) facilitating continual learning in a LLM-tuning-free way.Our comprehensive experiments show that SCALE significantly outperforms both LLMs (GPT-4, GPT-3.5) and supervised models (NLLB, M2M) in either high-resource or challenging low-resource settings. Moreover SCALE shows great scalability by only updating the lightweight STM and witness consistent system improvement, an averaged 4 BLEURT score across 4 languages without tuning LLM. Interestingly, SCALE could also effectively exploit the existing language bias of LLMs by using an English-centric STM as a pivot to conduct translation between any language pairs, outperforming GPT-4 by an average of 6 COMET points across eight translation directions. Furthermore we provide an in-depth analysis of SCALE’s robustness, translation characteristics, latency costs and inherent language bias, providing solid foundation for future studies exploring the potential synergy between LLMs and more specialized models. 2024.findings-acl.941 @@ -18930,12 +18930,12 @@ Retrieval-Augmented Retrieval: Large Language Models are Strong Zero-Shot Retriever TaoShenOracle - GuodongLongUniversity of Technology Sydney + GuodongLongUniversity of Technology Sydney XiuboGengMicrosoft ChongyangTaoBeihang University YibinLeiUniversity of Amsterdam - TianyiZhouUniversity of Maryland, College Park - MichaelBlumensteinUniversity of Technology Sydney + TianyiZhouUniversity of Maryland, College Park + MichaelBlumensteinUniversity of Technology Sydney DaxinJiangMicrosoft 15933-15946 We propose a simple method that applies a large language model (LLM) to large-scale retrieval in zero-shot scenarios. Our method, the Large language model as Retriever (LameR), is built upon no other neural models but an LLM in a retrieval-augmented retrieval fashion, while breaking brute-force combinations of retrievers with LLMs and lifting the performance of zero-shot retrieval to be very competitive on benchmark datasets. Essentially, we propose to augment a query with its potential answers by prompting LLMs with a composition of the query and the query’s in-domain candidates. The candidates, regardless of correct or wrong, are obtained by a vanilla retrieval procedure on the target collection. As a part of the prompts, they are likely to help LLM generate more precise answers by pattern imitation or candidate summarization. Even if all the candidates are wrong, the prompts at least make LLM aware of in-collection patterns and genres. Moreover, due to the low performance of a self-supervised retriever, the LLM-based query augmentation becomes less effective as the retriever bottlenecks the whole pipeline. Therefore, we propose to leverage a non-parametric lexicon-based method (e.g., BM25) as the retrieval module to capture query-document overlap in a literal fashion. As such, LameR makes the retrieval procedure transparent to the LLM, thus circumventing the bottleneck. @@ -18945,11 +18945,11 @@ A Survey on Predicting the Factuality and the Bias of News Media - PreslavNakov + PreslavNakov JisunAn HaewoonKwak - Muhammad ArslanManzoor - Zain MuhammadMujahid + Muhammad ArslanManzoor + Zain MuhammadMujahid Husrev TahaSencar 15947-15962 The present level of proliferation of fake, biased, and propagandistic content online has made it impossible to fact-check every single suspicious claim or article, either manually or automatically. An increasing number of scholars are focusing on a coarser granularity, aiming to profile entire news outlets, which allows fast identification of potential “fake news” by checking the reliability of their source. Source factuality is also an important element of systems for automatic fact-checking and “fake news” detection, as they need to assess the reliability of the evidence they retrieve online. Political bias detection, which in the Western political landscape is about predicting left-center-right bias, is an equally important topic, which has experienced a similar shift toward profiling entire news outlets. Moreover, there is a clear connection between the two, as highly biased media are less likely to be factual; yet, the two problems have been addressed separately. In this survey, we review the state of the art on media profiling for factuality and bias, arguing for the need to model them jointly. We also shed light on some of the major challenges for modeling bias and factuality jointly. We further discuss interesting recent advances in using different information sources and modalities, which go beyond the text of the articles the target news outlet has published. Finally, we discuss current challenges and outline future research directions. @@ -18971,9 +18971,9 @@ Improving Multi-hop Logical Reasoning in Knowledge Graphs with Context-Aware Query Representation Learning JeonghoonKim - HeesooJung + HeesooJung HyejuJangIndiana University/Purdue University at Indianapolis - HogunParkSungkyunkwan University + HogunParkSungkyunkwan University 15978-15991 Multi-hop logical reasoning on knowledge graphs is a pivotal task in natural language processing, with numerous approaches aiming to answer First-Order Logic (FOL) queries. Recent geometry (e.g., box, cone) and probability (e.g., beta distribution)-based methodologies have effectively addressed complex FOL queries. However, a common challenge across these methods lies in determining accurate geometric bounds or probability parameters for these queries. The challenge arises because existing methods rely on linear sequential operations within their computation graphs, overlooking the logical structure of the query and the relation-induced information that can be gleaned from the relations of the query, which we call the context of the query. To address the problem, we propose a model-agnostic methodology that enhances the effectiveness of existing multi-hop logical reasoning approaches by fully integrating the context of the FOL query graph. Our approach distinctively discerns (1) the structural context inherent to the query structure and (2) the relation-induced context unique to each node in the query graph as delineated in the corresponding knowledge graph. This dual-context paradigm helps nodes within a query graph attain refined internal representations throughout the multi-hop reasoning steps. Through experiments on two datasets, our method consistently enhances the three multi-hop reasoning foundation models, achieving performance improvements of up to 19.5%. Our codes are available at https://github.com/kjh9503/caqr. 2024.findings-acl.946 @@ -18985,10 +18985,10 @@ YuzhaoHeng ChunyuanDengRice University YitongLi - YueYuGeorgia Institute of Technology - YinghaoLi + YueYuGeorgia Institute of Technology + YinghaoLi RongzhiZhangGeorgia Institute of Technology and Zhejiang University - ChaoZhangGeorgia Institute of Technology + ChaoZhangGeorgia Institute of Technology 15992-16030 Although Large Language Models (LLMs) exhibit remarkable adaptability across domains, these models often fall short in structured knowledge extraction tasks such as named entity recognition (NER). This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets. Our approach diverges from the basic class-conditional prompts by instructing LLMs to self-reflect on the specific domain, thereby generating domain-relevant attributes (such as category and emotions for movie reviews), which are utilized for creating attribute-rich training data. Furthermore, we preemptively generate entity terms and then develop NER context data around these entities, effectively bypassing the LLMs’ challenges with complex structures. Our experiments across both general and niche domains reveal significant performance enhancements over conventional data generation methods while being more cost-effective than existing alternatives. 2024.findings-acl.947 @@ -19011,8 +19011,8 @@ A Large Collection of Model-generated Contradictory Responses for Consistency-aware Dialogue Systems ShikiSatoCyberAgent, Inc. ReinaAkamaTohoku University and RIKEN - JunSuzukiTohoku University - KentaroInuiMohamed bin Zayed University of Artificial Intelligence, RIKEN and Tohoku University + JunSuzukiTohoku University + KentaroInuiMohamed bin Zayed University of Artificial Intelligence, RIKEN and Tohoku University 16047-16062 Mitigating the generation of contradictory responses poses a substantial challenge in dialogue response generation. The quality and quantity of available contradictory response data play a vital role in suppressing these contradictions, offering two significant benefits. First, having access to large contradiction data enables a comprehensive examination of their characteristics. Second, data-driven methods to mitigate contradictions may be enhanced with large-scale contradiction data for training. Nevertheless, no attempt has been made to build an extensive collection of model-generated contradictory responses. In this paper, we build a large dataset of response generation models’ contradictions for the first time. Then, we acquire valuable insights into the characteristics of model-generated contradictions through an extensive analysis of the collected responses. Lastly, we also demonstrate how this dataset substantially enhances the performance of data-driven contradiction suppression methods. 2024.findings-acl.949 @@ -19025,7 +19025,7 @@ RisakoAndoKeio University TakanobuMorishita HirohikoAbeKeio University - KojiMineshimaKeio University + KojiMineshimaKeio University MitsuhiroOkada 16063-16077 This paper explores the question of how accurately current large language models can perform logical reasoning in natural language, with an emphasis on whether these models exhibit reasoning biases similar to humans. Specifically, our study focuses on syllogistic reasoning, a form of deductive reasoning extensively studied in cognitive science as a natural form of human reasoning. We present a syllogism dataset called NeuBAROCO, which consists of syllogistic reasoning problems in English and Japanese. This dataset was originally designed for psychological experiments to assess human reasoning capabilities using various forms of syllogisms. Our experiments with leading large language models indicate that these models exhibit reasoning biases similar to humans, along with other error tendencies. Notably, there is significant room for improvement in reasoning problems where the relationship between premises and hypotheses is neither entailment nor contradiction. We also present experimental results and in-depth analysis using a new Chain-of-Thought prompting method, which asks LLMs to translate syllogisms into abstract logical expressions and then explain their reasoning process. Our analysis using this method suggests that the primary limitations of LLMs lie in the reasoning process itself rather than the interpretation of syllogisms. @@ -19052,7 +19052,7 @@ <fixed-case>DIMSIM</fixed-case>: Distilled Multilingual Critics for <fixed-case>I</fixed-case>ndic Text Simplification SnehaMondalGoogle RitikaRitikaGoogle - AshishAgrawal + AshishAgrawal PreethiJyothiIndian Institute of Technology Bombay AravindanRaghuveerGoogle 16093-16109 @@ -19066,7 +19066,7 @@ DongkyuLee ChandanaSatya PrakashAmazon JackFitzGeraldAmazon - JensLehmannAmazon, Technische Universität Dresden, University of Bonn and Fraunhofer IAIS + JensLehmannAmazon, Technische Universität Dresden, University of Bonn and Fraunhofer IAIS 16110-16121 Leveraging external knowledge is crucial for achieving high performance in knowledge-intensive tasks, such as question answering. The retrieve-and-read approach is widely adopted for integrating external knowledge into a language model. However, this approach suffers from increased computational cost and latency due to the long context length, which grows proportionally with the number of retrieved knowledge. Furthermore, existing retrieval-augmented models typically retrieve information from a single type of knowledge source, limiting their scalability to diverse knowledge sources with varying structures. In this work, we introduce an efficient memory-augmented transformer called MATTER, designed to retrieve relevant knowledge from multiple heterogeneous knowledge sources. Specifically, our model retrieves and reads from both unstructured sources (paragraphs) and semi-structured sources (QA pairs) in the form of fixed-length neural memories. We demonstrate that our model outperforms existing efficient retrieval-augmented models on popular QA benchmarks in terms of both accuracy and speed. Furthermore, MATTER achieves competitive results compared to conventional read-and-retrieve models while having 100x throughput during inference. 2024.findings-acl.953 @@ -19088,12 +19088,12 @@ Chain-of-History Reasoning for Temporal Knowledge Graph Forecasting - YuweiXia + YuweiXia DingWang - QiangLiuInstitute of Automation, Chinese Academy of Sciences - LiangWang + QiangLiuInstitute of Automation, Chinese Academy of Sciences + LiangWang ShuWuInstitute of automation, Chinese academy of science, Chinese Academy of Sciences - Xiao-YuZhangInstitute of Information Engineering, Chinese Academy of Sciences + Xiao-YuZhangInstitute of Information Engineering, Chinese Academy of Sciences 16144-16159 Temporal Knowledge Graph (TKG) forecasting aims to predict future facts based on given histories. Most recent graph-based models excel at capturing structural information within TKGs but lack semantic comprehension abilities. Nowadays, with the surge of LLMs, the LLM-based TKG prediction model has emerged. However, the existing LLM-based model exhibits three shortcomings: (1) It only focuses on the first-order history for prediction while ignoring high-order historical information, resulting in the provided information for LLMs being extremely limited. (2) LLMs struggle with optimal reasoning performance under heavy historical information loads. (3) For TKG prediction, the temporal reasoning capability of LLM alone is limited. To address the first two challenges, we propose Chain-of-History (CoH) reasoning which explores high-order histories step-by-step, achieving effective utilization of high-order historical information for LLMs on TKG prediction. To address the third issue, we design CoH as a plug-and-play module to enhance the performance of graph-based models for TKG prediction. Extensive experiments on three datasets and backbones demonstrate the effectiveness of CoH. 2024.findings-acl.955 @@ -19102,10 +19102,10 @@ Can <fixed-case>LLM</fixed-case>s Speak For Diverse People? Tuning <fixed-case>LLM</fixed-case>s via Debate to Generate Controllable Controversial Statements - MingLiUniversity of Maryland, College Park + MingLiUniversity of Maryland, College Park JiuhaiChen LichangChen - TianyiZhouUniversity of Maryland, College Park + TianyiZhouUniversity of Maryland, College Park 16160-16176 Making LLMs speak for different, especially minority groups of people, and generate statements supporting their diverse or even controversial perspectives is critical to creating an inclusive environment. However, existing LLMs lack sufficient controllability to the stance of their generated content, which often contains inconsistent, neutral, or biased statements. In this paper, we improve the controllability of LLMs in generating statements supporting an argument the user defined in the prompt. We find that multi-round debates between two LLMs with opposite stances generate higher-quality and more salient statements for each, which are important training data to improve the controllability of LLMs. Motivated by this, we develop a novel debate & tuning (“DEBATUNE”) pipeline finetuning LLMs to generate the statements obtained via debate. To examine DEBATUNE, we curate the largest dataset of debate topics so far, which covers 710 controversial topics and corresponding arguments for each topic. Evaluations by the GPT-4 judge with a novel controversy controllability metric show that LLMs’ capability of generating diverse perspectives is significantly improved by DEBATUNE. Moreover, such controllability can be generalized to unseen topics, generating high-quality statements supporting controversial arguments. 2024.findings-acl.956 @@ -19115,10 +19115,10 @@ Label-aware Hard Negative Sampling Strategies with Momentum Contrastive Learning for Implicit Hate Speech Detection JaehoonKimHanyang University - SeungwanJin + SeungwanJin SohyunPark SomeenPark - KyungsikHanHanyang University + KyungsikHanHanyang University 16177-16188 Detecting implicit hate speech that is not directly hateful remains a challenge. Recent research has attempted to detect implicit hate speech by applying contrastive learning to pre-trained language models such as BERT and RoBERTa, but the proposed models still do not have a significant advantage over cross-entropy loss-based learning. We found that contrastive learning based on randomly sampled batch data does not encourage the model to learn hard negative samples. In this work, we propose Label-aware Hard Negative sampling strategies (LAHN) that encourage the model to learn detailed features from hard negative samples, instead of naive negative samples in random batch, using momentum-integrated contrastive learning. LAHN outperforms the existing models for implicit hate speech detection both in- and cross-datasets. The code is available at https://github.com/Hanyang-HCC-Lab/LAHN 2024.findings-acl.957 @@ -19127,12 +19127,12 @@ Selective Reflection-Tuning: Student-Selected Data Recycling for <fixed-case>LLM</fixed-case> Instruction-Tuning - MingLiUniversity of Maryland, College Park + MingLiUniversity of Maryland, College Park LichangChen JiuhaiChen ShwaiHeUniversity of Maryland, College Park JiuxiangGuAdobe Systems - TianyiZhouUniversity of Maryland, College Park + TianyiZhouUniversity of Maryland, College Park 16189-16211 Instruction tuning is critical to large language models (LLMs) for achieving better instruction following and task adaptation capabilities but its success heavily relies on the training data quality. Many recent methods focus on improving the data quality but often overlook the compatibility of the data with the student model being finetuned. This paper introduces Selective Reflection-Tuning, a novel paradigm that synergizes a teacher LLM’s reflection and introspection for improving existing data quality with the data selection capability of the student LLM, to automatically refine existing instruction-tuning data. This teacher-student collaboration produces high-quality and student-compatible instruction-response pairs, resulting in sample-efficient instruction tuning and LLMs of superior performance. Selective Reflection-Tuning is a data augmentation and synthesis that generally improves LLM finetuning and self-improvement without collecting brand-new data. We apply our method to Alpaca and WizardLM data and achieve much stronger and top-tier 7B and 13B LLMs. 2024.findings-acl.958 @@ -19168,17 +19168,17 @@ <fixed-case>C</fixed-case>ontext<fixed-case>BLIP</fixed-case>: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions - HonglinLin + HonglinLin SiyuLi GuoshunNanBeijing University of Posts and Telecommunications - ChaoyueTang - XuetingWang - JingxinXuBeijing University of Posts and Telecommunications + ChaoyueTang + XuetingWang + JingxinXuBeijing University of Posts and Telecommunications RongYankai - ZhouzhiliZhouzhiliGuangzhou University + ZhouzhiliZhouzhiliGuangzhou University YutongGaoBeijing jiaotong univercity, National Taipei University of Technology, Northeastern University and Minzu University of China QimeiCuiBeijing University of Posts and Telecommunications - XiaofengTao + XiaofengTao 16240-16258 Image retrieval from contextual descriptions (IRCD) aims to identify an image within a set of minimally contrastive candidates based on linguistically complex text. Despite the success of VLMs, they still significantly lag behind human performance in IRCD. The main challenges lie in aligning key contextual cues in two modalities, where these subtle cues are concealed in tiny areas of multiple contrastive images and within the complex linguistics of textual descriptions. This motivates us to propose ContextBLIP, a simple yet effective method that relies on a doubly contextual alignment scheme for challenging IRCD. Specifically, 1) our model comprises a multi-scale adapter, a matching loss, and a text-guided masking loss. The adapter learns to capture fine-grained visual cues. The two losses enable iterative supervision for the adapter, gradually highlighting the focal patches of a single image to the key textual cues. We term such a way as intra-contextual alignment. 2) Then, ContextBLIP further employs an inter-context encoder to learn dependencies among candidates, facilitating alignment between the text to multiple images. We term this step as inter-contextual alignment. Consequently, the nuanced cues concealed in each modality can be effectively aligned. Experiments on two benchmarks show the superiority of our method. We observe that ContextBLIP can yield comparable results with GPT-4V, despite involving about 7,500 times fewer parameters. 2024.findings-acl.961 @@ -19205,7 +19205,7 @@ JaehongKim ChaeyoonJeong SeongchanPark - MeeyoungChaKorea Advanced Institute of Science & Technology + MeeyoungChaKorea Advanced Institute of Science & Technology WonjaeLeeKorea Advanced Institute of Science and Technology 16274-16289 Understanding the interplay between emotions in language and user behaviors is critical. We study how moral emotions shape the political participation of users based on cross-cultural online petition data. To quantify moral emotions, we employ a context-aware NLP model that is designed to capture the subtle nuances of emotions across cultures. For model training, we construct and share a moral emotion dataset comprising nearly 50,000 petition sentences in Korean and English each, along with emotion labels annotated by a fine-tuned LLM. We examine two distinct types of user participation: general support (i.e., registered signatures of petitions) and active support (i.e., sharing petitions on social media). We discover that moral emotions like other-suffering increase both forms of participation and help petitions go viral, while self-conscious have the opposite effect. The most prominent moral emotion, other-condemning, led to polarizing responses among the audience. In contrast, other-praising was perceived differently by culture; it led to a rise in active support in Korea but a decline in the UK. Our findings suggest that both moral emotions embedded in language and cultural perceptions are critical to shaping the public’s political discourse. @@ -19229,8 +19229,8 @@ <fixed-case>CF</fixed-case>-<fixed-case>TCIR</fixed-case>: A Compositor-Free Framework for Hierarchical Text-Conditioned Image Retrieval YuchenYang - YuWangShanghai Jiao Tong University - YanfengWangShanghai Jiao Tong University + YuWangShanghai Jiao Tong University + YanfengWangShanghai Jiao Tong University 16315-16325 In text-conditioned image retrieval (TCIR), the combination of a reference image and modification text forms a query tuple, aiming to locate the most congruent target image within a dataset. The advantages of rich image semantic information and text flexibility are combined in this manner for more accurate retrieval. While traditional techniques often employ attention-driven compositors to craft a unified image-text representation, our paper introduces a compositor-free framework, CF-TCIR, which eschews the standard compositor. Compositor-based methods are designed to learn a joint representation of images and text, but they struggle to directly capture the correlations between attributes across the image and text modalities. Instead, we reformulate the retrieval process as a cross-modal interaction between a synthesized image feature and its corresponding text descriptor. This novel methodology offers advantages in terms of computational efficiency, scalability, and superior performance. To optimize the retrieval performance, we advocate a tiered retrieval mechanism, blending both coarse-grain and fine-grain paradigms. Moreover, to enrich the contextual relationship within the query tuple, we integrate a generative cross-modal alignment technique, ensuring synchronization of sequential attributes between image and text data. 2024.findings-acl.965 @@ -19241,7 +19241,7 @@ <fixed-case>DMIN</fixed-case>: A Discourse-specific Multi-granularity Integration Network for Conversational Aspect-based Sentiment Quadruple Analysis PeijieHuangSouth China Agricultural University XishengXiaoSouth China Agricultural University - YuhongXuSouth China Agricultural University + YuhongXuSouth China Agricultural University JiaweiChen 16326-16338 Conversational Aspect-based Sentiment Quadruple Analysis (DiaASQ) aims to extract fine-grained sentiment quadruples from dialogues. Previous research has primarily concentrated on enhancing token-level interactions, still lacking in sufficient modeling of the discourse structure information in dialogue. Firstly, it does not incorporate interactions among different utterances in the encoding stage, resulting in a limited token-level context understanding for subsequent modules. Secondly, it ignores the critical fact that discourse information is naturally organized at the utterance level and learning it solely at the token level is incomplete. In this work, we strengthen the token-level encoder by utilizing a discourse structure called “thread” and graph convolutional networks to enhance the token interaction among different utterances. Moreover, we propose an utterance-level encoder to learn the structured speaker and reply information, providing a macro understanding of dialogue discourse. Furthermore, we introduce a novel Multi-granularities Integrator to integrate token-level and utterance-level representations, resulting in a comprehensive and cohesive dialogue contextual understanding. Experiments on two datasets demonstrate that our model achieves state-of-the-art performance. Our codes are publicly available at https://github.com/SIGSDSscau/DMIN. @@ -19252,7 +19252,7 @@ Are Decoder-Only Language Models Better than Encoder-Only Language Models in Understanding Word Meaning? Muhammad RezaQorib - GeonsikMoon + GeonsikMoon Hwee TouNg 16339-16347 The natural language processing field has been evolving around language models for the past few years, from the usage of n-gram language models for re-ranking, to transfer learning with encoder-only (BERT-like) language models, and finally to large language models (LLMs) as general solvers. LLMs are dominated by the decoder-only type, and they are popular for their efficacy in numerous tasks. LLMs are regarded as having strong comprehension abilities and strong capabilities to solve new unseen tasks. As such, people may quickly assume that decoder-only LLMs always perform better than the encoder-only ones, especially for understanding word meaning. In this paper, we demonstrate that decoder-only LLMs perform worse on word meaning comprehension than an encoder-only language model that has vastly fewer parameters. @@ -19274,11 +19274,11 @@ On the Robustness of Document-Level Relation Extraction Models to Entity Name Variations ShiaoMengTsinghua University - XumingHuThe Hong Kong University of Science and Technology (Guangzhou) and Hong Kong University of Science and Technology + XumingHuThe Hong Kong University of Science and Technology (Guangzhou) and Hong Kong University of Science and Technology AiweiLiuTsinghua University, Tsinghua University FukunMaTsinghua University, Tsinghua University YawenYangTsinghua University, Tsinghua University - ShuangLiTencent + ShuangLiTencent LijieWenSchool of Software, Tsinghua University 16362-16374 Driven by the demand for cross-sentence and large-scale relation extraction, document-level relation extraction (DocRE) has attracted increasing research interest. Despite the continuous improvement in performance, we find that existing DocRE models which initially perform well may make more mistakes when merely changing the entity names in the document, hindering the generalization to novel entity names. To this end, we systematically investigate the robustness of DocRE models to entity name variations in this work. We first propose a principled pipeline to generate entity-renamed documents by replacing the original entity names with names from Wikidata. By applying the pipeline to DocRED and Re-DocRED datasets, we construct two novel benchmarks named Env-DocRED and Env-Re-DocRED for robustness evaluation. Experimental results show that both three representative DocRE models and two in-context learned large language models consistently lack sufficient robustness to entity name variations, particularly on cross-sentence relation instances and documents with more entities. Finally, we propose an entity variation robust training method which not only improves the robustness of DocRE models but also enhances their understanding and reasoning capabilities. We further verify that the basic idea of this method can be effectively transferred to in-context learning for DocRE as well. @@ -19288,7 +19288,7 @@ <fixed-case>RESEMO</fixed-case>: A Benchmark <fixed-case>C</fixed-case>hinese Dataset for Studying Responsive Emotion from Social Media Content - BoHuUniversity of Science and Technology of China + BoHuUniversity of Science and Technology of China MengZhang ChenfeiXieUniversity of Science and Technology of China YuanheTianUniversity of Washington, Seattle @@ -19315,7 +19315,7 @@ <fixed-case>KEEP</fixed-case> <fixed-case>CHATTING</fixed-case>! An Attractive Dataset for Continuous Conversation Agents YiheWang - JinLiuWuhan University + JinLiuWuhan University YaoWanHuazhong University of Science and Technology YitongLiHuawei Technologies Co., Ltd. ZifengLiu @@ -19329,10 +19329,10 @@ <fixed-case>R</fixed-case>e<fixed-case>P</fixed-case>air: Automated Program Repair with Process-based Feedback YuzeZhaoUniversity of Science and Technology of China - ZhenyaHuangUniversity of Science and Technology of China - YixiaoMaUniversity of Science and Technology of China - RuiLi - KaiZhang + ZhenyaHuangUniversity of Science and Technology of China + YixiaoMaUniversity of Science and Technology of China + RuiLi + KaiZhang HaoJiangUniversity of Science and Technology of China QiLiuUniversity of Science and Technology of China LinboZhu @@ -19348,7 +19348,7 @@ YangXuHarbin Institute of Technology YunlongFeng HonglinMuHarbin Institute Of Technology - YutaiHou + YutaiHou YitongLiHuawei Technologies Co., Ltd. XinghaoWang WanjunZhongByteDance Inc. @@ -19369,7 +19369,7 @@ JialiChengUniversity of Massachusetts at Lowell NidhiVakilUniversity of Massachusetts, Lowell HadiAmiriUniversity of Massachusetts Lowell - Leo AnthonyCeliMassachusetts Institute of Technology and Beth Israel Deaconess Medical Center + Leo AnthonyCeliMassachusetts Institute of Technology and Beth Israel Deaconess Medical Center 16442-16455 Medical decisions directly impact individuals’ health and well-being. Extracting decision spans from clinical notes plays a crucial role in understanding medical decision-making processes. In this paper, we develop a new dataset called “MedDec,” which contains clinical notes of eleven different phenotypes (diseases) annotated by ten types of medical decisions. We introduce the task of medical decision extraction, aiming to jointly extract and classify different types of medical decisions within clinical notes. We provide a comprehensive analysis of the dataset, develop a span detection model as a baseline for this task, evaluate recent span detection approaches, and employ a few metrics to measure the complexity of data samples. Our findings shed light on the complexities inherent in clinical decision extraction and enable future work in this area of research. The dataset and code are available through https://github.com/CLU-UML/MedDec. 2024.findings-acl.975 From 24518e9ffb1cac9ce570f46381346e4019efaa9d Mon Sep 17 00:00:00 2001 From: Matt Post Date: Sat, 20 Sep 2025 20:25:45 -0400 Subject: [PATCH 2/7] Take volume ID from file if not specified --- bin/ingest_orcids.py | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/bin/ingest_orcids.py b/bin/ingest_orcids.py index 00ae547530..ab4e38b9a7 100755 --- a/bin/ingest_orcids.py +++ b/bin/ingest_orcids.py @@ -69,11 +69,11 @@ def parse_paper_yaml(paper_path: str) -> List[Dict[str, str]]: @click.argument( 'full_volume_id', type=str, - required=True, + required=False, ) def main( paper_yaml: str, - full_volume_id: str, + full_volume_id: str = None, ): anthology_datadir = Path(sys.argv[0]).parent / ".." / "data" # anthology = Anthology( @@ -86,6 +86,10 @@ def main( # people = AnthologyIndex(srcdir=anthology_datadir) # people.bibkeys = load_bibkeys(anthology_datadir) + if full_volume_id is None: + full_volume_id = Path(paper_yaml).name.replace(".yaml", "") + print(f"Taking full volume ID from file name: {full_volume_id}", file=sys.stderr) + # Load the papers.yaml file, skipping non-archival papers papers = [p for p in parse_paper_yaml(paper_yaml) if p["archival"]] # print(f"Found {len(papers)} archival papers", file=sys.stderr) From 91037651de90600a744ab175b4dc3dadf0e32f44 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Sat, 20 Sep 2025 20:29:56 -0400 Subject: [PATCH 3/7] Add ORCID iDs for 2024.acl-demos --- data/xml/2024.acl.xml | 110 +++++++++++++++++++++--------------------- 1 file changed, 55 insertions(+), 55 deletions(-) diff --git a/data/xml/2024.acl.xml b/data/xml/2024.acl.xml index 4175361f2e..311ecc8ae2 100644 --- a/data/xml/2024.acl.xml +++ b/data/xml/2024.acl.xml @@ -13262,7 +13262,7 @@ <fixed-case>O</fixed-case>pen<fixed-case>VNA</fixed-case>: A Framework for Analyzing the Behavior of Multimodal Language Understanding System under Noisy Scenarios - ZiqiYuan + ZiqiYuan BaozhengZhang HuaXuTsinghua University, Tsinghua University ZhiyunLiang @@ -13278,7 +13278,7 @@ HaoFeiNational University of Singapore MeishanZhangHarbin Institute of Technology (Shenzhen), China and Tianjin University, China MinZhangHarbin Institute of Technology, Shenzhen - Tat-SengChuaNational University of Singapore + Tat-SengChuaNational University of Singapore 19-30 Structured Natural Language Processing (XNLP) is an important subset of NLP that entails understanding the underlying semantic or syntactic structure of texts, which serves as a foundational component for many downstream applications. Despite certain recent efforts to explore universal solutions for specific categories of XNLP tasks, a comprehensive and effective approach for unifying all XNLP tasks long remains underdeveloped. Meanwhile, while XNLP demonstration systems are vital for researchers exploring various XNLP tasks, existing platforms can be limited to, e.g., supporting few XNLP tasks, lacking interactivity and universalness. To this end, we propose an advanced XNLP demonstration system, where we leverage LLM to achieve universal XNLP, with one model for all with high generalizability. Overall, our system advances in multiple aspects, including universal XNLP modeling, high performance, interpretability, scalability, and interactivity, offering a unified platform for exploring diverse XNLP tasks in the community. 2024.acl-demos.3 @@ -13288,7 +13288,7 @@ Towards the <fixed-case>T</fixed-case>op<fixed-case>M</fixed-case>ost: A Topic Modeling System Toolkit XiaobaoWuNanyang Technological University - FengjunPan + FengjunPan Anh TuanLuuNanyang Technological University 31-41 Topic models have a rich history with various applications and have recently been reinvigorated by neural topic modeling. However, these numerous topic models adopt totally distinct datasets, implementations, and evaluations. This impedes quick utilization and fair comparisons, and thereby hinders their research progress and applications. To tackle this challenge, we in this paper propose a Topic Modeling System Toolkit (TopMost). Compared to existing toolkits, TopMost stands out by supporting more extensive features. It covers a broader spectrum of topic modeling scenarios with their complete lifecycles, including datasets, preprocessing, models, training, and evaluations. Thanks to its highly cohesive and decoupled modular design, TopMost enables rapid utilization, fair comparisons, and flexible extensions of diverse cutting-edge topic models. Our code, tutorials, and documentation are available at https://github.com/bobxwu/topmost. @@ -13298,7 +13298,7 @@ Wordflow: Social Prompt Engineering for Large Language Models - ZijieWangGeorgia Institute of Technology + ZijieWangGeorgia Institute of Technology AishwaryaChakravarthy DavidMunechika Duen HorngChauGeorgia Institute of Technology @@ -13311,7 +13311,7 @@ <fixed-case>LM</fixed-case> Transparency Tool: Interactive Tool for Analyzing Transformer Language Models IgorTufanovFacebook - KarenHambardzumyanFacebook and University College London, University of London + KarenHambardzumyanFacebook and University College London, University of London JavierFerrando ElenaVoitaFAIR at Meta AI and University of Amsterdam 51-60 @@ -13326,8 +13326,8 @@ HanZhangXidian University BinWang LiziLiaoSingapore Management University - QianLiuUniversity of Auckland - ErikCambriaNanyang Technological University + QianLiuUniversity of Auckland + ErikCambriaNanyang Technological University 61-71 This paper introduces EmpathyEar, a pioneering open-source, avatar-based multimodal empathetic chatbot, to fill the gap in traditional text-only empathetic response generation (ERG) systems. Leveraging the advancements of a large language model, combined with multimodal encoders and generators, EmpathyEar supports user inputs in any combination of text, sound, and vision, and produces multimodal empathetic responses, offering users, not just textual responses but also digital avatars with talking faces and synchronized speeches. A series of emotion-aware instruction-tuning is performed for comprehensive emotional understanding and generation capabilities. In this way, EmpathyEar provides users with responses that achieve a deeper emotional resonance, closely emulating human-like empathy. The system paves the way for the next emotional intelligence, for which we open-source the code for public access. 2024.acl-demos.7 @@ -13336,8 +13336,8 @@ <fixed-case>O</fixed-case>pen<fixed-case>W</fixed-case>eb<fixed-case>A</fixed-case>gent: An Open Toolkit to Enable Web Agents on Large Language Models - Iat LongIong - XiaoLiu + Iat LongIong + XiaoLiu YuxuanChen HanyuLai ShuntianYaoBeijing University of Posts and Telecommunications @@ -13353,13 +13353,13 @@ <fixed-case>E</fixed-case>asy<fixed-case>E</fixed-case>dit: An Easy-to-use Knowledge Editing Framework for Large Language Models - PengWang + PengWang NingyuZhangZhejiang University BozhongTian ZekunXi YunzhiYao ZiwenXuZhejiang University - MengruWangZhejiang University + MengruWangZhejiang University ShengyuMao XiaohanWangZhejiang University SiyuanCheng @@ -13375,7 +13375,7 @@ <fixed-case>E</fixed-case>asy<fixed-case>I</fixed-case>nstruct: An Easy-to-use Instruction Processing Framework for Large Language Models - YixinOu + YixinOu NingyuZhangZhejiang University HonghaoGui ZiwenXuZhejiang University @@ -13398,7 +13398,7 @@ YuyangHuang ZixunLuUniversity of Southern California TianliTong - JonathanMayUniversity of Southern California and USC/ISI + JonathanMayUniversity of Southern California and USC/ISI 107-116 Following the rapid progress in natural language processing (NLP) models, language models are applied to increasingly more complex interactive tasks such as negotiations and conversation moderations. Having human evaluators directly interact with these NLP models is essential for adequately evaluating the performance on such interactive tasks. We develop BotEval, an easily customizable, open-source, evaluation toolkit that focuses on enabling human-bot interactions as part of the evaluation process, as opposed to human evaluators making judgements for a static input. BotEval balances flexibility for customization and user-friendliness by providing templates for common use cases that span various degrees of complexity and built-in compatibility with popular crowdsourcing platforms.We showcase the numerous useful features of BotEval through a study that evaluates the performance of various chatbots on their effectiveness for conversational moderation and discuss how BotEval differs from other annotation tools. 2024.acl-demos.11 @@ -13407,9 +13407,9 @@ <fixed-case>G</fixed-case>en<fixed-case>GO</fixed-case>: <fixed-case>ACL</fixed-case> Paper Explorer with Semantic Features - SotaroTakeshitaUniversit�t Mannheim - SimonePonzettoUniversity of Mannheim - KaiEckertMannheim University of Applied Sciences + SotaroTakeshitaUniversit�t Mannheim + SimonePonzettoUniversity of Mannheim + KaiEckertMannheim University of Applied Sciences 117-126 We present GenGO, a system for exploring papers published in ACL conferences. Paper data stored in our database is enriched with multi-aspect summaries, extracted named entities, a field of study label, and text embeddings by our data processing pipeline. These metadata are used in our web-based user interface to enable researchers to quickly find papers relevant to their interests, and grasp an overview of papers without reading full-text of papers. To make GenGO to be available online as long as possible, we design GenGO to be simple and efficient to reduce maintenance and financial costs. In addition, the modularity of our data processing pipeline lets developers easily extend it to add new features. We make our code available to foster open development and transparency: https://gengo.sotaro.io. 2024.acl-demos.12 @@ -13418,8 +13418,8 @@ <fixed-case>NLP</fixed-case>-<fixed-case>KG</fixed-case>: A System for Exploratory Search of Scientific Literature in Natural Language Processing - TimSchopf - FlorianMatthesTechnische Universit�t M�nchen + TimSchopf + FlorianMatthesTechnische Universit�t M�nchen 127-135 Scientific literature searches are often exploratory, whereby users are not yet familiar with a particular field or concept but are interested in learning more about it. However, existing systems for scientific literature search are typically tailored to keyword-based lookup searches, limiting the possibilities for exploration. We propose NLP-KG, a feature-rich system designed to support the exploration of research literature in unfamiliar natural language processing (NLP) fields. In addition to a semantic search, NLP-KG allows users to easily find survey papers that provide a quick introduction to a field of interest. Further, a Fields of Study hierarchy graph enables users to familiarize themselves with a field and its related areas. Finally, a chat interface allows users to ask questions about unfamiliar concepts or specific articles in NLP and obtain answers grounded in knowledge retrieved from scientific publications. Our system provides users with comprehensive exploration possibilities, supporting them in investigating the relationships between different fields, understanding unfamiliar concepts in NLP, and finding relevant research literature. Demo, video, and code are available at: https://github.com/NLP-Knowledge-Graph/NLP-KG-WebApp. 2024.acl-demos.13 @@ -13439,9 +13439,9 @@ <fixed-case>JORA</fixed-case>: <fixed-case>JAX</fixed-case> Tensor-Parallel <fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case> Library for Retrieval Augmented Fine-Tuning - AniqueTahirArizona State University - LuChengUniversity of Illinois at Chicago - HuanLiuArizona State University + AniqueTahirArizona State University + LuChengUniversity of Illinois at Chicago + HuanLiuArizona State University 152-159 The scaling of Large Language Models (LLMs) for retrieval-based tasks, particularly in Retrieval Augmented Generation (RAG), faces significant memory constraints, especially when fine-tuning extensive prompt sequences. Current open-source libraries support full-model inference and fine-tuning across multiple GPUs but fall short of accommodating the efficient parameter distribution required for retrieved context. Addressing this gap, we introduce a novel framework for PEFT-compatible fine-tuning of GPT models, leveraging distributed training. Our framework uniquely utilizes JAX’s just-in-time (JIT) compilation and tensor-sharding for efficient resource management, thereby enabling accelerated fine-tuning with reduced memory requirements. This advancement significantly improves the scalability and feasibility of fine-tuning LLMs for complex RAG applications, even on systems with limited GPU resources. Our experiments show more than 12x improvement in runtime compared to Hugging Face/DeepSpeed implementation with four GPUs while consuming less than half the VRAM per GPU. 2024.acl-demos.15 @@ -13464,7 +13464,7 @@ <fixed-case>IMGTB</fixed-case>: A Framework for Machine-Generated Text Detection Benchmarking MichalSpiegelKempelen Institute of Intelligent Technologies - DominikMackoKempelen Institute of Intelligent Technologies + DominikMackoKempelen Institute of Intelligent Technologies 172-179 In the era of large language models generating high quality texts, it is a necessity to develop methods for detection of machine-generated text to avoid their harmful use or simply for annotation purposes. It is, however, also important to properly evaluate and compare such developed methods. Recently, a few benchmarks have been proposed for this purpose; however, integration of newest detection methods is rather challenging, since new methods appear each month and provide slightly different evaluation pipelines.In this paper, we present the IMGTB framework, which simplifies the benchmarking of machine-generated text detection methods by easy integration of custom (new) methods and evaluation datasets. In comparison to existing frameworks, it enables to objectively compare statistical metric-based zero-shot detectors with classification-based detectors and with differently fine-tuned detectors. Its configurability and flexibility makes research and development of new detection methods easier, especially their comparison to the existing state-of-the-art detectors. The default set of analyses, metrics and visualizations offered by the tool follows the established practices of machine-generated text detection benchmarking found in state-of-the-art literature. 2024.acl-demos.17 @@ -13475,9 +13475,9 @@ <fixed-case>D</fixed-case>rug<fixed-case>W</fixed-case>atch: A Comprehensive Multi-Source Data Visualisation Platform for Drug Safety Information ArtemBobrovKing’s College London, University of London DomantasSaltenis - ZhaoyueSunUniversity of Warwick + ZhaoyueSunUniversity of Warwick GabrielePergolaUniversity of Warwick - YulanHeKing’s College London, University of London + YulanHeKing’s College London, University of London 180-189 Drug safety research is crucial for maintaining public health, often requiring comprehensive data support. However, the resources currently available to the public are limited and fail to provide a comprehensive understanding of the relationship between drugs and their side effects. This paper introduces “DrugWatch”, an easy-to-use and interactive multi-source information visualisation platform for drug safety study. It allows users to understand common side effects of drugs and their statistical information, flexibly retrieve relevant medical reports, or annotate their own medical texts with our automated annotation tool. Supported by NLP technology and enriched with interactive visual components, we are committed to providing researchers and practitioners with a one-stop information analysis, retrieval, and annotation service. The demonstration video is available at https://www.youtube.com/watch?v=RTqDgxzETjw. We also deployed an online demonstration system at https://drugwatch.net/. 2024.acl-demos.18 @@ -13491,7 +13491,7 @@ JiaxuanLiTianjin University RenrenJin YufeiHuang - LingShi + LingShi JunhuiZhang XinmengJi TingtingCui @@ -13499,7 +13499,7 @@ JinwangSong HongyingZanZhengzhou University SunLiChina Academy of Information and Communications Technology - DeyiXiongTianjin University + DeyiXiongTianjin University 190-210 The rapid development of Chinese large language models (LLMs) poses big challenges for efficient LLM evaluation. While current initiatives have introduced new benchmarks or evaluation platforms for assessing Chinese LLMs, many of these focus primarily on capabilities, usually overlooking potential alignment and safety issues. To address this gap, we introduce OpenEval, an evaluation testbed that benchmarks Chinese LLMs across capability, alignment and safety. For capability assessment, we include 12 benchmark datasets to evaluate Chinese LLMs from 4 sub-dimensions: NLP tasks, disciplinary knowledge, commonsense reasoning and mathematical reasoning. For alignment assessment, OpenEval contains 7 datasets that examines the bias, offensiveness and illegalness in the outputs yielded by Chinese LLMs. To evaluate safety, especially anticipated risks (e.g., power-seeking, self-awareness) of advanced LLMs, we include 6 datasets. In addition to these benchmarks, we have implemented a phased public evaluation and benchmark update strategy to ensure that OpenEval is in line with the development of Chinese LLMs or even able to provide cutting-edge benchmark datasets to guide the development of Chinese LLMs. In our first public evaluation, we have tested a range of Chinese LLMs, spanning from 7B to 72B parameters, including both open-source and proprietary models. Evaluation results indicate that while Chinese LLMs have shown impressive performance in certain tasks, more attention should be directed towards broader aspects such as commonsense reasoning, alignment, and safety. 2024.acl-demos.19 @@ -13550,7 +13550,7 @@ HanghaoWu JiajieZhangNortheastern University XuHanTsinghua University, Tsinghua University - ZhiyuanLiuTsinghua University + ZhiyuanLiuTsinghua University MaosongSun 247-257 Evaluation is pivotal for honing Large Language Models (LLMs), pinpointing their capabilities and guiding enhancements. The rapid development of LLMs calls for a lightweight and easy-to-use framework for swift evaluation deployment. However, due to the various implementation details to consider, developing a comprehensive evaluation platform is never easy. Existing platforms are often complex and poorly modularized, hindering seamless incorporation into researcher’s workflows. This paper introduces UltraEval, a user-friendly evaluation framework characterized by lightweight, comprehensiveness, modularity, and efficiency. We identify and reimplement three core components of model evaluation (models, data, and metrics). The resulting composability allows for the free combination of different models, tasks, prompts, and metrics within a unified evaluation workflow. Additionally, UltraEval supports diverse models owing to a unified HTTP service and provides sufficient inference acceleration. @@ -13561,9 +13561,9 @@ <fixed-case>P</fixed-case>y<fixed-case>F</fixed-case>oma: a Python finite-state compiler module MansHuldenUniversity of Colorado at Boulder - MichaelGinnUniversity of Colorado at Boulder + MichaelGinnUniversity of Colorado at Boulder MiikkaSilfverbergUniversity of British Columbia - MichaelHammondUniversity of Arizona + MichaelHammondUniversity of Arizona 258-265 We describe PyFoma, an open-source Python module for constructing weighted and unweighted finite-state transducers and automata from regular expressions, string rewriting rules, right-linear grammars, or low-level state/transition manipulation. A large variety of standard algorithms for working with finite-state machines is included, with a particular focus on the needs of linguistic and NLP applications. The data structures and code in the module are designed for legibility to allow for potential use in teaching the theory and algorithms associated with finite-state machines. 2024.acl-demos.24 @@ -13581,7 +13581,7 @@ KaihuaZhu SiliangXu ShizheDiaoHong Kong University of Science and Technology - TongZhangUIUC + TongZhangUIUC 266-277 The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources’ credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection. 2024.acl-demos.25 @@ -13591,7 +13591,7 @@ string2string: A Modern Python Library for String-to-String Algorithms MiracSuzgunStanford University - StuartShieberHarvard University + StuartShieberHarvard University DanJurafskyStanford University 278-285 We introduce **string2string**, an open-source library that offers a comprehensive suite of efficient algorithms for a broad range of string-to-string problems. It includes traditional algorithmic solutions as well as recent advanced neural approaches to tackle various problems in string alignment, distance measurement, lexical and semantic search, and similarity analysis�along with several helpful visualization tools and metrics to facilitate the interpretation and analysis of these methods. Notable algorithms featured in the library include the Smith-Waterman algorithm for pairwise local alignment, the Hirschberg algorithm for global alignment, the Wagner-Fischer algorithm for edit distance, BARTScore and BERTScore for similarity analysis, the Knuth-Morris-Pratt algorithm for lexical search, and Faiss for semantic search. In addition, it wraps existing efficient and widely-used implementations of certain frameworks and metrics, such as sacreBLEU and ROUGE. Overall, the library aims to provide extensive coverage and increased flexibility in comparison to existing libraries for strings. It can be used for many downstream applications, tasks, and problems in natural-language processing, bioinformatics, and computational social sciences. It is implemented in Python, easily installable via pip, and accessible through a simple API. Source code, documentation, and tutorials are all available on our GitHub page: https://github.com/stanfordnlp/string2string* Documentation: https://string2string.readthedocs.io/en/latest/* GitHub page: https://github.com/stanfordnlp/string2string* Short video: https://drive.google.com/file/d/1IT-pBACDVUoEHewk__5Pz5mU5oAMq5k_/view?usp=sharing @@ -13626,13 +13626,13 @@ ChenhuiShenNational University of Singapore Yew KenChia XingxuanLi - JianyuWangAlibaba DAMO Academy + JianyuWangAlibaba DAMO Academy QingyuTannational university of singaore, National University of Singapore LiyingCheng GuanzhengChen - YueDengSchool of Computer Science and Engineering, Nanyang Technological University + YueDengSchool of Computer Science and Engineering, Nanyang Technological University SenYangThe Chinese University of Hong Kong - ChaoqunLiu + ChaoqunLiu HangZhang LidongBingAlibaba Group 294-304 @@ -13659,7 +13659,7 @@ LeiZang JiaotuanWang ChenyiZhuang - JinjieGu + JinjieGu 315-325 Automatic Chinese classical poetry generation has attracted much research interest, but achieving effective control over format and content simultaneously remains challenging. Traditional systems usually accept keywords as user inputs, resulting in limited control over content. Large language models (LLMs) improve content control by allowing unrestricted user instructions, but the token-by-token generation process frequently makes format errors. Motivated by this, we propose CharPoet, a Chinese classical poetry generation system based on token-free LLM, which provides effective control over both format and content. Our token-free architecture generates in a character-by-character manner, enabling precise control over the number of characters. Pruned from existing token-based LLMs, CharPoet inherits their pretrained capabilities and can generate poetry following instructions like �Write me a poem for my mother’s birthday.� CharPoet achieves format accuracy above 0.96, outperforming Jiuge-GPT-2 (0.91) and GPT-4 (0.38). In terms of content quality, CharPoet surpasses traditional systems including Jiuge, and is comparable to other LLMs. Our system is open source and available at https://modelscope.cn/models/CharPoet/CharPoet. A video demonstration of CharPoet is available at https://youtu.be/voZ25qEp3Dc. 2024.acl-demos.30 @@ -13669,9 +13669,9 @@ <fixed-case>ITAKE</fixed-case>: Interactive Unstructured Text Annotation and Knowledge Extraction System with <fixed-case>LLM</fixed-case>s and <fixed-case>M</fixed-case>odel<fixed-case>O</fixed-case>ps JiaheSong - HongxinDing - ZhiyuanWang - YongxinXu + HongxinDing + ZhiyuanWang + YongxinXu YashaWang JunfengZhaoPeking University 326-334 @@ -13690,7 +13690,7 @@ YugeTu PengkaiLiCentral South University LeiShi - ZhiyuanLiuTsinghua University + ZhiyuanLiuTsinghua University MaosongSun 335-345 Despite advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), their integration into language-grounded, human-like embodied agents remains incomplete, hindering complex real-life task performance in 3D environments. Existing integrations often feature limited open-sourcing, challenging collective progress in this field. We introduce LEGENT, an open, scalable platform for developing embodied agents using LLMs and LMMs. LEGENT offers a dual approach: a rich 3D environment with interactive, communicable, and actionable agents, paired with a user-friendly interface, and a sophisticated data generation pipeline utilizing advanced algorithms to exploit supervision from simulated worlds at scale. In our experiments, an embryonic vision-language-action model trained on LEGENT-generated data surpasses GPT-4V in embodied tasks, showcasing promising generalization capabilities. The demo video is available at the following link https://video.legent.ai. @@ -13701,8 +13701,8 @@ Variationist: Exploring Multifaceted Variation and Bias in Written Language Data AlanRamponiFondazione Bruno Kessler - CamillaCasulaUniversity of Trento and Fondazione Bruno Kessler - StefanoMenini + CamillaCasulaUniversity of Trento and Fondazione Bruno Kessler + StefanoMenini 346-354 Exploring and understanding language data is a fundamental stage in all areas dealing with human language. It allows NLP practitioners to uncover quality concerns and harmful biases in data before training, and helps linguists and social scientists to gain insight into language use and human behavior. Yet, there is currently a lack of a unified, customizable tool to seamlessly inspect and visualize language variation and bias across multiple variables, language units, and diverse metrics that go beyond descriptive statistics. In this paper, we introduce Variationist, a highly-modular, extensible, and task-agnostic tool that fills this gap. Variationist handles at once a potentially unlimited combination of variable types and semantics across diversity and association metrics with regards to the language unit of choice, and orchestrates the creation of up to five-dimensional interactive charts for over 30 variable type-semantics combinations. Through our case studies on computational dialectology, human label variation, and text generation, we show how Variationist enables researchers from different disciplines to effortlessly answer specific research questions or unveil undesired associations in language data. A Python library, code, documentation, and tutorials are made publicly available to the research community. 2024.acl-demos.33 @@ -13711,10 +13711,10 @@ An <fixed-case>LLM</fixed-case>-based Knowledge Synthesis and Scientific Reasoning Framework for Biomedical Discovery - OskarWysocki + OskarWysocki Magdalena.wysocka@cruk.manchester.ac.ukMagdalena.wysocka@cruk.manchester.ac.ukNA - DaniloCarvalhoUniversity of Manchester - AlexBogatu + DaniloCarvalhoUniversity of Manchester + AlexBogatu Danilo.miranda@idiap.chDanilo.miranda@idiap.chNA Maxime.delmas@idiap.chMaxime.delmas@idiap.chNA Harriet.unsworth@cruk.manchester.ac.ukHarriet.unsworth@cruk.manchester.ac.ukNA @@ -13761,7 +13761,7 @@ XiaoxueCheng GeyangGuoGeorgia Institute of Technology HanPeng - BowenZhengRenmin University of China + BowenZhengRenmin University of China YiruTang YingqianMin YushuoChen @@ -13774,7 +13774,7 @@ JunyiLi KunZhouRenmin University of China XinZhaoRenmin University of China - Ji-RongWenRenmin University of China + Ji-RongWenRenmin University of China 388-399 To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency. With our library, users can easily reproduce existing methods, train new models, and conduct comprehensive performance comparisons. To rigorously test LLMBox, we conduct extensive experiments in a diverse coverage of evaluation settings, and experimental results demonstrate the effectiveness and efficiency of our library in supporting various implementations related to LLMs. The detailed introduction and usage guidance can be found at https://github.com/RUCAIBox/LLMBox. 2024.acl-demos.37 @@ -13783,8 +13783,8 @@ <fixed-case>L</fixed-case>lama<fixed-case>F</fixed-case>actory: Unified Efficient Fine-Tuning of 100+ Language Models - YaoweiZheng - RichongZhang + YaoweiZheng + RichongZhang JunhaoZhang YanhanYe ZheyanLuo @@ -13839,7 +13839,7 @@ Topic Modeling for Short Texts with Large Language Models TomokiDoi MasaruIsonuma - HitomiYanakathe University of Tokyo + HitomiYanakathe University of Tokyo 21-33 As conventional topic models rely on word co-occurrence to infer latent topics, topic modeling for short texts has been a long-standing challenge. Large Language Models (LLMs) can potentially overcome this challenge by contextually learning the meanings of words via pretraining. In this paper, we study two approaches to using LLMs for topic modeling: parallel prompting and sequential prompting. Input length limitations prevent LLMs from processing many texts at once. However, an arbitrary number of texts can be handled by LLMs by splitting the texts into smaller subsets and processing them in parallel or sequentially. Our experimental results demonstrate that our methods can identify more coherent topics than existing ones while maintaining the diversity of the induced topics. Furthermore, we found that the inferred topics cover the input texts to some extent, while hallucinated topics are hardly generated. 2024.acl-srw.3 @@ -13908,7 +13908,7 @@ Fine-Tuning <fixed-case>ASR</fixed-case> models for Very Low-Resource Languages: A Study on Mvskoke - JuliaMainzinger + JuliaMainzinger Gina-AnneLevowUniversity of Washington and University of Washington 76-82 Recent advancements in multilingual models for automatic speech recognition (ASR) have been able to achieve a high accuracy for languages with extremely limited resources. This study examines ASR modeling for the Mvskoke language, an indigenous language of America. The parameter efficiency of adapter training is contrasted with training entire models, and it is demonstrated how performance varies with different amounts of data. Additionally, the models are evaluated with trigram language model decoding, and the outputs are compared across different types of speech recordings. Results show that training an adapter is both parameter efficient and gives higher accuracy for a relatively small amount of data. @@ -14020,8 +14020,8 @@ Action Inference for Destination Prediction in Vision-and-Language Navigation AnirudhKondapallyHonda R&D Co., Ltd. - KentaroYamadaHonda R&D Co., Ltd. - HitomiYanakathe University of Tokyo + KentaroYamadaHonda R&D Co., Ltd. + HitomiYanakathe University of Tokyo 192-199 Vision-and-Language Navigation (VLN) encompasses interacting with autonomous vehicles using language and visual input from the perspective of mobility.Most of the previous work in this field focuses on spatial reasoning and the semantic grounding of visual information.However, reasoning based on the actions of pedestrians in the scene is not much considered.In this study, we provide a VLN dataset for destination prediction with action inference to investigate the extent to which current VLN models perform action inference.We introduce a crowd-sourcing process to construct a dataset for this task in two steps: (1) collecting beliefs about the next action for a pedestrian and (2) annotating the destination considering the pedestrian’s next action.Our benchmarking results of the models on destination prediction lead us to believe that the models can learn to reason about the effect of the action and the next action on the destination to a certain extent.However, there is still much scope for improvement. 2024.acl-srw.26 @@ -14054,7 +14054,7 @@ Compromesso! <fixed-case>I</fixed-case>talian Many-Shot Jailbreaks undermine the safety of Large Language Models FabioPernisi DirkHovy - PaulRöttger + PaulRöttger 245-251 As diverse linguistic communities and users adopt Large Language Models (LLMs), assessing their safety across languages becomes critical. Despite ongoing efforts to align these models with safe and ethical guidelines, they can still be induced into unsafe behavior with jailbreaking, a technique in which models are prompted to act outside their operational guidelines. What research has been conducted on these vulnerabilities was predominantly on English, limiting the understanding of LLM behavior in other languages. We address this gap by investigating Many-Shot Jailbreaking (MSJ) in Italian, underscoring the importance of understanding LLM behavior in different languages. We base our analysis on a newly created Italian dataset to identify unique safety vulnerabilities in 4 families of open-source LLMs.We find that the models exhibit unsafe behaviors even with minimal exposure to harmful prompts, and–more alarmingly–this tendency rapidly escalates with more demonstrations. 2024.acl-srw.29 @@ -14065,7 +14065,7 @@ <fixed-case>V</fixed-case>i<fixed-case>M</fixed-case>ed<fixed-case>AQA</fixed-case>: A <fixed-case>V</fixed-case>ietnamese Medical Abstractive Question-Answering Dataset and Findings of Large Language Model Minh-NamTran Phu-VinhNguyen - LongNguyenHo Chi Minh city University of Science, Vietnam National University + LongNguyenHo Chi Minh city University of Science, Vietnam National University DienDinh 252-260 Question answering involves creating answers to questions. With the growth of large language models, the ability of question-answering systems has dramatically improved. However, there is a lack of Vietnamese abstractive question-answering datasets, especially in the medical domain. Therefore, this research aims to mitigate this gap by introducing ViMedAQA. This **Vi**etnamese **Med**ical **A**bstractive **Q**uestion-**A**nswering dataset covers four topics in the Vietnamese medical domain, including body parts, disease, drugs and medicine. Additionally, the empirical results on the proposed dataset examine the capability of the large language models in the Vietnamese medical domain, including reasoning, memorizing and awareness of essential information. @@ -14102,7 +14102,7 @@ <fixed-case>H</fixed-case>omophone2<fixed-case>V</fixed-case>ec: Embedding Space Analysis for Empirical Evaluation of Phonological and Semantic Similarity SophieWuMcGill University AnitaZhengMcGill University - JoeyChuangMcGill University + JoeyChuangMcGill University 287-292 This paper introduces a novel method for empirically evaluating the relationship between the phonological and semantic similarity of linguistic units using embedding spaces. Chinese character homophones are used as a proof-of-concept. We employ cosine similarity as a proxy for semantic similarity between characters, and compare relationships between phonologically-related characters and baseline characters (chosen as similar-frequency characters). We show there is a strongly statistically significant positive semantic relationship among different Chinese characters at varying levels of sound-sharing. We also perform some basic probing using t-SNE and UMAP visualizations, and indicate directions for future applications of this method. 2024.acl-srw.34 From 0349833d9b0167c4ea0348e69011bcde3188a06c Mon Sep 17 00:00:00 2001 From: Matt Post Date: Sat, 20 Sep 2025 20:38:07 -0400 Subject: [PATCH 4/7] Ingest ORCID iDs for 2024.alvr-1 --- data/xml/2024.alvr.xml | 46 +++++++++++++++++++++--------------------- 1 file changed, 23 insertions(+), 23 deletions(-) diff --git a/data/xml/2024.alvr.xml b/data/xml/2024.alvr.xml index 487ad2f4e3..d365f8797c 100644 --- a/data/xml/2024.alvr.xml +++ b/data/xml/2024.alvr.xml @@ -22,7 +22,7 @@ <fixed-case>WISMIR</fixed-case>3: A Multi-Modal Dataset to Challenge Text-Image Retrieval Approaches - FlorianSchneiderUniversität Hamburg + FlorianSchneiderUniversität Hamburg ChrisBiemannU Hamburg 1-6 This paper presents WISMIR3, a multi-modal dataset comprising roughly 300K text-image pairs from Wikipedia. With a sophisticated automatic ETL pipeline, we scraped, filtered, and transformed the data so that WISMIR3 intrinsically differs from other popular text-image datasets like COCO and Flickr30k. We prove this difference by comparing various linguistic statistics between the three datasets computed using the pipeline. The primary purpose of WISMIR3 is to use it as a benchmark to challenge state-of-the-art text-image retrieval approaches, which already reach around 90% Recall@5 scores on the mentioned popular datasets. Therefore, we ran several text-image retrieval experiments on our dataset using current models, which show that the models, in fact, perform significantly worse compared to evaluation results on COCO and Flickr30k. In addition, for each text-image pair, we release features computed by Faster-R-CNN and CLIP models. With this, we want to ease and motivate the use of the dataset for other researchers. @@ -34,7 +34,7 @@ m<fixed-case>BLIP</fixed-case>: Efficient Bootstrapping of Multilingual Vision-<fixed-case>LLM</fixed-case>s GregorGeigleBayerische Julius-Maximilians-Universität Würzburg AbhayJain - RaduTimofteBayerische Julius-Maximilians-Universität Würzburg and ETH Zurich + RaduTimofteBayerische Julius-Maximilians-Universität Würzburg and ETH Zurich GoranGlavašJulius-Maximilians-Universität Würzburg 7-25 Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to ‘understand’ the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we re-align an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at https://github.com/gregor-ge/mBLIP. @@ -57,10 +57,10 @@ Negative Object Presence Evaluation (<fixed-case>NOPE</fixed-case>) to Measure Object Hallucination in Vision-Language Models - HolyLoveniaAI Singapore + HolyLoveniaAI Singapore WenliangDaiNVIDIA - SamuelCahyawijaya - ZiweiJiHong Kong University of Science and Technology + SamuelCahyawijaya + ZiweiJiHong Kong University of Science and Technology PascaleFungHKUST 37-58 Object hallucination poses a significant challenge in vision-language (VL) models, often leading to the generation of nonsensical or unfaithful responses with non-existent objects. However, the absence of a general measurement for evaluating object hallucination in VL models has hindered our understanding and ability to mitigate this issue. In this work, we present NOPE (Negative Object Presence Evaluation), a novel benchmark designed to assess object hallucination in VL models through visual question answering (VQA). We propose a cost-effective and scalable approach utilizing large language models to generate 29.5k synthetic negative pronoun (NegP) data of high quality for NOPE. We extensively investigate the performance of 10 state-of-the-art VL models in discerning the non-existence of objects in visual questions, where the ground truth answers are denoted as (e.g., “none”). Additionally, we evaluate their standard performance on visual questions on 9 other VQA datasets. Through our experiments, we demonstrate that no VL model is immune to the vulnerability of object hallucination, as all models achieve accuracy below 10% on NegP. Furthermore, we uncover that lexically diverse visual questions, question types with large scopes, and scene-relevant objects capitalize the risk of object hallucination in VL models. @@ -71,8 +71,8 @@ How and where does <fixed-case>CLIP</fixed-case> process negation? VincentQuantmeyer - PabloMosteiroUtrecht University - AlbertGattUtrecht University + PabloMosteiroUtrecht University + AlbertGattUtrecht University 59-72 Various benchmarks have been proposed to test linguistic understanding in pre-trained vision & language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al., 2022) which we use to test models’ understanding of negation, a particularly interesting issue for multimodal models. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal processes through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of negation. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al., 2021), a highly influential VL model. We localise parts of the encoder that process negation and analyse the role of attention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (e.g., causal tracing) can be translated to multimodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding. 2024.alvr-1.5 @@ -84,7 +84,7 @@ MalvinaNikandrouHeriot-Watt University GeorgiosPantazopoulos IoannisKonstasHeriot-Watt University - AlessandroSugliaHeriot-Watt University + AlessandroSugliaHeriot-Watt University 73-85 Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision & Language (VL) models. Motivated by this observation, we propose a modality-aware feature distillation (MAFED) approach which outperforms existing baselines across models of varying scale in three multimodal continual learning settings. Furthermore, we provide ablations showcasing that modality-aware distillation complements experience replay. Overall, our results emphasize the importance of addressing modality-specific dynamics to prevent forgetting in multimodal continual learning. 2024.alvr-1.6 @@ -120,10 +120,10 @@ Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples - Philipp J.RöschBundeswehr University Munich - NorbertOswald - MichaelaGeierhosUniversität der Bundeswehr München - JindřichLibovickýCharles University Prague + Philipp J.RöschBundeswehr University Munich + NorbertOswald + MichaelaGeierhosUniversität der Bundeswehr München + JindřichLibovickýCharles University Prague 102-115 Current vision-language models leveraging contrastive learning often face limitations in developing fine-grained conceptual understanding. This is due to random negative samples during pretraining, causing almost exclusively very dissimilar concepts to be compared in the loss function. Consequently, the models struggle with fine-grained semantic differences. To address this problem, we introduce a novel pretraining method incorporating synthetic hard negative text examples. The hard negatives replace terms corresponding to visual concepts, leading to a more fine-grained visual and textual concept alignment. Further, we introduce InpaintCOCO, a new challenging dataset for assessing the fine-grained alignment of colors, objects, and sizes in vision-language models. We created the dataset using generative inpainting from COCO images by changing the visual concepts so that the images no longer match their original captions. Our results show significant improvements in fine-grained concept understanding across various vision-language datasets, including our InpaintCOCO dataset. 2024.alvr-1.9 @@ -134,13 +134,13 @@ Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities ShiyuXia JunyuXiong - HaoyuDong + HaoyuDong JianboZhao YuzhangTian - MengyuZhouMicrosoft Research - YeyeHeMicrosoft - ShiHanMicrosoft Research Asia - DongmeiZhangMicrosoft and Microsoft + MengyuZhouMicrosoft Research + YeyeHeMicrosoft + ShiHanMicrosoft Research Asia + DongmeiZhangMicrosoft and Microsoft 116-128 This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs by integrating these challenges. To probe VLMs more finely, we propose three spreadsheet-to-image settings: column width adjustment, style change, and address augmentation. We propose variants of prompts to address the above tasks in different settings. Notably, to leverage the strengths of VLMs in understanding text rather than two-dimensional positioning, we propose to decode cell values on the four boundaries of the table in spreadsheet boundary detection. Our findings reveal that VLMs demonstrate promising OCR capabilities but produce unsatisfactory results due to cell omission and misalignment, and they notably exhibit insufficient spatial and format recognition skills, motivating future work to enhance VLMs’ spreadsheet data comprehension capabilities using our methods to generate extensive spreadsheet-image pairs in various settings. 2024.alvr-1.10 @@ -150,9 +150,9 @@ <fixed-case>S</fixed-case>lide<fixed-case>AVSR</fixed-case>: A Dataset of Paper Explanation Videos for Audio-Visual Speech Recognition HaoWang - ShuheiKuritaNational Institute of Informatics and New York University + ShuheiKuritaNational Institute of Informatics and New York University ShuichiroShimizu - DaisukeKawaharaWaseda University + DaisukeKawaharaWaseda University 129-137 Audio-visual speech recognition (AVSR) is a multimodal extension of automatic speech recognition (ASR), using video as a complement to audio. In AVSR, considerable efforts have been directed at datasets for facial features such as lip-readings, while they often fall short in evaluating the image comprehension capabilities in broader contexts. In this paper, we construct SlideAVSR, an AVSR dataset using scientific paper explanation videos. SlideAVSR provides a new benchmark where models transcribe speech utterances with texts on the slides on the presentation recordings. As technical terminologies that are frequent in paper explanations are notoriously challenging to transcribe without reference texts, our SlideAVSR dataset spotlights a new aspect of AVSR problems. As a simple yet effective baseline, we propose DocWhisper, an AVSR model that can refer to textual information from slides, and confirm its effectiveness on SlideAVSR. 2024.alvr-1.11 @@ -162,7 +162,7 @@ Causal and Temporal Inference in Visual Question Generation by Utilizing Pre-trained Models ZhanghaoHu - FrankKellerUniversity of Edinburgh + FrankKellerUniversity of Edinburgh 138-154 Visual Question Generation is a task at the crossroads of visual and language learning, impacting broad domains like education, medicine, and social media. While existing pre-trained models excel in fact-based queries with image pairs, they fall short of capturing human-like inference, particularly in understanding causal and temporal relationships within videos. Additionally, the computational demands of prevalent pre-training methods pose challenges. In response, our study introduces a framework that leverages vision-text matching pre-trained models to guide language models in recognizing event-entity relationships within videos and generating inferential questions. Demonstrating efficacy on the NExT-QA dataset, which is designed for causal and temporal inference in visual question answering, our method successfully guides pre-trained language models in recognizing video content. We present methodologies for abstracting causal and temporal relationships between events and entities, pointing out the importance of consistent relationships among input frames during training and inference phases and suggesting an avenue for future exploration. 2024.alvr-1.12 @@ -173,7 +173,7 @@ Improving Vision-Language Cross-Lingual Transfer with Scheduled Unfreezing MaxReinhardt GregorGeigleBayerische Julius-Maximilians-Universität Würzburg - RaduTimofteBayerische Julius-Maximilians-Universität Würzburg and ETH Zurich + RaduTimofteBayerische Julius-Maximilians-Universität Würzburg and ETH Zurich GoranGlavašJulius-Maximilians-Universität Würzburg 155-166 Large-scale pretraining of vision-language (VL) models brought dramatic improvements across numerous tasks, from visual question-answering to cross-modal retrieval but these gains are mostly limited to English. Massively multilingual VL encoder models (mVLMs) hold promise for other languages: after fine-tuning on only English task data, they can perform the task in other languages in what is termed zero-shot cross-lingual transfer (ZS-XLT). Still, ZS-XLT sees a large performance gap to English, especially for low-resource languages. In this work, we reduce this gap with a fine-tuning strategy known as Scheduled Unfreezing (SUF): instead of updating all parameters from the start, we begin with the top layer(s) of the vision-language encoder and gradually unfreeze (i.e., update) its layers top to bottom. SUF forces reliance on encoder’s representations from higher layers: the fact that in multilingual models these representations encode higher-level semantics rather than low-level language-specific idiosyncrasies, we hypothesize, should render SUF beneficial for ZS-XLT. Experiments with two mVLMs (UC2 & CCLM) on three downstream tasks (xGQA, XVNLI, xFlickrCo) show that SUF brings consistent gains in ZS-XLT, especially for visual Q&A (xGQA) by up to 10 points. @@ -215,7 +215,7 @@ AdrianZiupkaHasso Plattner Institute Lucie-AiméeKaffeeHugging Face RussaBiswasHasso Plattner Institute - GerardDe MeloHasso Plattner Institute and University of Potsdam + GerardDe MeloHasso Plattner Institute and University of Potsdam 186-194 Describing Wikimedia Commons images using Wikidata’s structured data enables a wide range of automation tasks, such as search and organization, as well as downstream tasks, such as labeling images or training machine learning models. However, there is currently a lack of structured data-labelled images on Wikimedia Commons.To close this gap, we propose the task of Visual Entity Linking (VEL) for Wikimedia Commons, in which we create new labels for Wikimedia Commons images from Wikidata items. VEL is a crucial tool for improving information retrieval, search, content understanding, cross-modal applications, and various machine-learning tasks. In this paper, we propose a method to create new labels for Wikimedia Commons images from Wikidata items. To this end, we create a novel dataset leveraging community-created structured data on Wikimedia Commons and fine-tuning pre-trained models based on the CLIP architecture. Although the best-performing models show promising results, the study also identifies key challenges of the data and the task. 2024.alvr-1.16 @@ -225,7 +225,7 @@ <fixed-case>V</fixed-case>erb<fixed-case>CLIP</fixed-case>: Improving Verb Understanding in Vision-Language Models with Compositional Structures HadiWazni - Kin IanLoUniversity College London, University of London + Kin IanLoUniversity College London, University of London MehrnooshSadrzadehUniversity College London 195-201 Verbs describe the dynamics of interactions between people, objects, and their environments. They play a crucial role in language formation and understanding. Nonetheless, recent vision-language models like CLIP predominantly rely on nouns and have a limited account of verbs. This limitation affects their performance in tasks requiring action recognition and scene understanding. In this work, we introduce VerbCLIP, a verb-centric vision-language model which learns meanings of verbs based on a compositional approach to statistical machine learning. Our methods significantly outperform CLIP in zero-shot performance on the VALSE, VL-Checklist, and SVO-Probes datasets, with improvements of +2.38%, +3.14%, and +1.47%, without fine-tuning. Fine-tuning resulted in further improvements, with gains of +2.85% and +9.2% on the VALSE and VL-Checklist datasets. From 708d0cb79beaf7ca08303a470bdd7b806f55f502 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Sat, 20 Sep 2025 20:49:15 -0400 Subject: [PATCH 5/7] Ingest ORCID iDs for ACL 2024 workshops --- data/xml/2024.arabicnlp.xml | 272 ++++++++++++++++----------------- data/xml/2024.argmining.xml | 38 ++--- data/xml/2024.c3nlp.xml | 36 ++--- data/xml/2024.climatenlp.xml | 52 +++---- data/xml/2024.cmcl.xml | 60 ++++---- data/xml/2024.conda.xml | 8 +- data/xml/2024.gebnlp.xml | 72 ++++----- data/xml/2024.hucllm.xml | 22 +-- data/xml/2024.kallm.xml | 24 +-- data/xml/2024.knowledgenlp.xml | 20 +-- data/xml/2024.knowllm.xml | 28 ++-- data/xml/2024.langmol.xml | 38 ++--- data/xml/2024.loresmt.xml | 34 ++--- data/xml/2024.nlp4convai.xml | 28 ++-- data/xml/2024.nlrse.xml | 6 +- data/xml/2024.privatenlp.xml | 38 ++--- data/xml/2024.sdp.xml | 70 ++++----- data/xml/2024.sighan.xml | 62 ++++---- data/xml/2024.sigturk.xml | 14 +- data/xml/2024.smm4h.xml | 112 +++++++------- data/xml/2024.splurobonlp.xml | 6 +- data/xml/2024.teachingnlp.xml | 48 +++--- data/xml/2024.textgraphs.xml | 56 +++---- data/xml/2024.wassa.xml | 102 ++++++------- 24 files changed, 623 insertions(+), 623 deletions(-) diff --git a/data/xml/2024.arabicnlp.xml b/data/xml/2024.arabicnlp.xml index 0588c627d9..dc85807fb9 100644 --- a/data/xml/2024.arabicnlp.xml +++ b/data/xml/2024.arabicnlp.xml @@ -46,11 +46,11 @@ Synthetic <fixed-case>A</fixed-case>rabic Medical Dialogues Using Advanced Multi-Agent <fixed-case>LLM</fixed-case> Techniques - MariamALMutairi + MariamALMutairi LulwahAlKulaib - MelikeAktasVirginia Polytechnic Institute and State University - SaraAlsalamahVirginia Polytechnic Institute and State University - Chang-TienLuVirginia Tech + MelikeAktasVirginia Polytechnic Institute and State University + SaraAlsalamahVirginia Polytechnic Institute and State University + Chang-TienLuVirginia Tech 11-26 The increasing use of artificial intelligence in healthcare requires robust datasets for training and validation, particularly in the domain of medical conversations. However, the creation and accessibility of such datasets in Arabic face significant challenges, especially due to the sensitivity and privacy concerns that are associated with medical conversations. These conversations are rarely recorded or preserved, making the availability of comprehensive Arabic medical dialogue datasets scarce. This limitation slows down not only the development of effective natural language processing models but also restricts the opportunity for open comparison of algorithms and their outcomes. Recent advancements in large language models (LLMs) like ChatGPT, GPT-4, Gemini-pro, and Claude-3 show promising capabilities in generating synthetic data. To address this gap, we introduce a novel Multi-Agent LLM approach capable of generating synthetic Arabic medical dialogues from patient notes, regardless of the original language. This development presents a significant step towards overcoming the barriers in dataset availability, enhancing the potential for broader research and application in AI-driven medical dialogue systems. 2024.arabicnlp-1.2 @@ -61,7 +61,7 @@ <fixed-case>A</fixed-case>u<fixed-case>RED</fixed-case>: Enabling <fixed-case>A</fixed-case>rabic Rumor Verification using Evidence from Authorities over <fixed-case>T</fixed-case>witter FatimaHaouariUniversity of Qatar TamerElsayedQatar University - ReemSuwailehHamad Bin Khalifa University + ReemSuwailehHamad Bin Khalifa University 27-41 Diverging from the trend of the previous rumor verification studies, we introduce the new task of rumor verification using evidence that are exclusively captured from authorities, i.e., entities holding the right and knowledge to verify corresponding information. To enable research on this task for Arabic low-resourced language, we construct and release the first Authority-Rumor-Evidence Dataset (AuRED). The dataset comprises 160 rumors expressed in tweets and 692 Twitter timelines of authorities containing about 34k tweets. Additionally, we explore how existing evidence retrieval and claim verification models for fact-checking perform on our task under both the cross-lingual zero-shot and in-domain fine-tuning setups. Our experiments show that although evidence retrieval models perform relatively well on the task establishing strong baselines, there is still a big room for improvement. However, existing claim verification models perform poorly on the task no matter how good the retrieval performance is. The results also show that stance detection can be useful for evidence retrieval. Moreover, existing fact-checking datasets showed a potential in transfer learning to our task, however, further investigation using different datasets and setups is required. 2024.arabicnlp-1.3 @@ -76,7 +76,7 @@ FatemaNassar FadhlEryaniEberhard-Karls-Universität Tübingen HoudaBouamorCarnegie Mellon University - NizarHabashNew York University Abu Dhabi + NizarHabashNew York University Abu Dhabi 42-54 Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-generated content on social media, presents a major challenge to NLP applications dealing with Dialectal Arabic. In this paper, we explore and report on the task of CODAfication, which aims to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA). We work with a unique parallel corpus of multiple Arabic dialects focusing on five major city dialects. We benchmark newly developed pretrained sequence-to-sequence models on the task of CODAfication. We further show that using dialect identification information improves the performance across all dialects. We make our code, data, andpretrained models publicly available. 2024.arabicnlp-1.4 @@ -87,8 +87,8 @@ Strategies for <fixed-case>A</fixed-case>rabic Readability Modeling JuanLiberato BasharAlhafniNew York University - MuhamedKhalilNew York University - NizarHabashNew York University Abu Dhabi + MuhamedKhalilNew York University + NizarHabashNew York University Abu Dhabi 55-66 Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility. However, Arabic readability assessment is a challenging task due to Arabic’s morphological richness and limited readability resources. In this paper, we present a set of experimental results on Arabic readability assessment using a diverse range of approaches, from rule-based methods to Arabic pretrained language models. We report our results on a newly created corpus at different textual granularity levels (words and sentence fragments). Our results show that combining different techniques yields the best results, achieving an overall macro F1 score of 86.7 at the word level and 87.9 at the fragment level on a blind test set. We make our code, data, and pretrained models publicly available. 2024.arabicnlp-1.5 @@ -108,8 +108,8 @@ Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis - SabriBoughorbelQatar Computing Research Institute - Md RizwanParvezQatar Computing Research Institute and Bosch + SabriBoughorbelQatar Computing Research Institute + Md RizwanParvezQatar Computing Research Institute and Bosch MajdHawaslyQatar Computing Research Institute 73-88 Training LLMs in low resources languages usually utilizes machine translation (MT) data augmentation from English language. However, translation brings a number of challenges: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, the quality of the data degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality stories generated by a capable LLM in Arabic, representing 1% of the original training data. We show, using GPT-4 as a judge and dictionary learning analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the translation pitfalls. We illustrate the improvement through case studies of linguistic and cultural bias issues. @@ -143,7 +143,7 @@ Large Language Models as Legal Translators of <fixed-case>A</fixed-case>rabic Legislation: Do <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case> and <fixed-case>G</fixed-case>emini Care for Context and Terminology? - KhadijaAit ElFqih + KhadijaAit ElFqih JohannaMonti 111-122 Accurate translation of terminology and adaptation to in-context information is a pillar to high quality translation. Recently, there is a remarkable interest towards the use and the evaluation of Large Language Models (LLMs) particularly for Machine Translation tasks. Nevertheless, despite their recent advancement and ability to understand and generate human-like language, these LLMs are still far from perfect, especially in domain-specific scenarios, and need to be thoroughly investigated. This is particularly evident in automatically translating legal terminology from Arabic into English and French, where, beyond the inherent complexities of legal language and specialised translations, technical limitations of LLMs further hinder accurate generation of text. In this paper, we present a preliminary evaluation of two evolving LLMs, namely GPT-4 Generative Pre-trained Transformer and Gemini, as legal translators of Arabic legislatives to test their accuracy and the extent to which they care for context and terminology across two language pairs (AR→EN / AR→FR). The study targets the evaluation of Zero-Shot prompting for in-context and out-of-context scenarios of both models relying on a gold standard dataset, verified by professional translators who are also experts in the field. We evaluate the results applying the Multidimensional Quality Metrics to classify translation errors. Moreover, we also evaluate the general LLMs outputs to verify their correctness, consistency, and completeness. In general, our results show that the models are far from perfect and recall for more fine-tuning efforts using specialised terminological data in the legal domain from Arabic into English and French. @@ -153,9 +153,9 @@ Towards Zero-Shot Text-To-Speech for <fixed-case>A</fixed-case>rabic Dialects - KhaiDoan + KhaiDoan AbdulWaheedMohamed bin Zayed University of Artificial Intelligence - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 123-129 Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic. 2024.arabicnlp-1.11 @@ -167,7 +167,7 @@ SalimaMdhaffarUniversité d’Avignon HarounElleuchElyadata FethiBougareselyadata - YannickEstèveUniversity of Avignon + YannickEstèveUniversity of Avignon 130-139 Speech encoders pretrained through self-supervised learning (SSL) have demonstrated remarkable performance in various downstream tasks, including Spoken Language Understanding (SLU) and Automatic Speech Recognition (ASR). For instance, fine-tuning SSL models for such tasks has shown significant potential, leading to improvements in the SOTA performance across challenging datasets.In contrast to existing research, this paper contributes by comparing the effectiveness of SSL approaches in the context of (i) the low-resource Spoken Tunisian Arabic Dialect and (ii) its combination with a low-resource SLU and ASR scenario, where only a few semantic annotations are available for fine-tuning. We conducted experiments using many SSL speech encoders on the TARIC-SLU dataset. We used speech encoders that were pre-trained on either monolingual or multilingual speech data. Some of them have also been refined without in-domain nor Tunisian data through a multimodal supervised teacher-student learning. The study made in this paper yields numerous significant findings that we will discuss in the paper. 2024.arabicnlp-1.12 @@ -178,7 +178,7 @@ <fixed-case>A</fixed-case>rabic Automatic Story Generation with Large Language Models Ahmed OumarEl-Shangiti FakhraddinAlwajih - MuhammadAbdul-Mageed + MuhammadAbdul-Mageed 140-152 Large language models (LLMs) have recently emerged as a powerful tool for a wide range of language generation tasks. Nevertheless, this progress has been slower in Arabic. In this work, we focus on the task of generating stories from LLMs. For our training, we use stories acquired through machine translation (MT) as well as GPT-4. For the MT data, we develop a careful pipeline that ensures we acquire high-quality stories. For our GPT-4 data, we introduce crafted prompts that allow us to generate data well-suited to the Arabic context in both Modern Standard Arabic (MSA) and two Arabic dialects (Egyptian and Moroccan). For example, we generate stories tailored to various Arab countries on a wide host of topics. Our manual evaluation shows that our model fine-tuned on these training datasets can generate coherent stories that adhere to our instructions. We also conduct an extensive automatic and human evaluation comparing our models against state-of-the-art proprietary and open-source models. Our datasets and models will be made publicly available at https://github.com/UBC-NLP/arastories. 2024.arabicnlp-1.13 @@ -187,11 +187,11 @@ <fixed-case>A</fixed-case>lcla<fixed-case>M</fixed-case>: <fixed-case>A</fixed-case>rabic Dialect Language Model - MurtadhaAhmedZhuiyi AI Lab - SaghirAlfaslyMayo Clinic + MurtadhaAhmedZhuiyi AI Lab + SaghirAlfaslyMayo Clinic BoWen JamalAddeen - MohammedAhmedNorthwest Polytechnical University Xi’an and Dalanj University + MohammedAhmedNorthwest Polytechnical University Xi’an and Dalanj University YunfengLiu 153-159 Pre-trained Language Models (PLMs) are integral to many modern natural language processing (NLP) systems. Although multilingual models cover a wide range of languages, they often grapple with challenges like high inference costs and a lack of diverse non-English training data. Arabic-specific PLMs are trained predominantly on modern standard Arabic, which compromises their performance on regional dialects. To tackle this, we construct an Arabic dialectal corpus comprising 3.4M sentences gathered from social media platforms. We utilize this corpus to expand the vocabulary and retrain a BERT-based model from scratch. Named AlcLaM, our model was trained using only 13GB of text, which represents a fraction of the data used by existing models such as CAMeL, MARBERT, and ArBERT, compared to 7.8%%, and 21.3%, respectively. Remarkably, AlcLaM demonstrates superior performance on a variety of Arabic NLP tasks despite the limited training data. AlcLaM is available at: https://github.com/amurtadha/Alclam. @@ -203,8 +203,8 @@ Data Augmentation for Speech-Based Diacritic Restoration SaraShatnawi SawsanAlqahtaniPrincess Nourah Bint Abdulrahman University - ShadyShehataMohamed bin Zayed University of Artificial Intelligence - HananAldarmakiMohamed bin Zayed University of Artificial Intelligence + ShadyShehataMohamed bin Zayed University of Artificial Intelligence + HananAldarmakiMohamed bin Zayed University of Artificial Intelligence 160-169 This paper describes a data augmentation technique for boosting the performance of speech-based diacritic restoration. Our experiments demonstrate the utility of this appraoch, resulting in improved generalization of all models across different test sets. In addition, we describe the first multi-modal diacritic restoration model, utilizing both speech and text as input modalities. This type of model can be used to diacritize speech transcripts. Unlike previous work that relies on an external ASR model, the proposed model is far more compact and efficient. While the multi-modal framework does not surpass the ASR-based model for this task, it offers a promising approach for improving the efficiency of speech-based diacritization, with a potential for improvement using data augmentation and other methods. 2024.arabicnlp-1.15 @@ -234,11 +234,11 @@ John vs. Ahmed: Debate-Induced Bias in Multilingual <fixed-case>LLM</fixed-case>s - AnastasiiaDemidovaMohamed bin Zayed University of Artificial Intelligence + AnastasiiaDemidovaMohamed bin Zayed University of Artificial Intelligence HaninAtwany - NourRabih + NourRabih SanadSha’ban - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 193-209 Large language models (LLMs) play a crucial role in a wide range of real world applications. However, concerns about their safety and ethical implications are growing. While research on LLM safety is expanding, there is a noticeable gap in evaluating safety across multiple languages, especially in Arabic and Russian. We address this gap by exploring biases in LLMs across different languages and contexts, focusing on GPT-3.5 and Gemini. Through carefully designed argument-based prompts and scenarios in Arabic, English, and Russian, we examine biases in cultural, political, racial, religious, and gender domains. Our findings reveal biases in these domains. In particular, our investigation uncovers subtle biases where each model tends to present winners as those speaking the primary language the model is prompted with. Our study contributes to ongoing efforts to ensure justice and equality in LLM development and emphasizes the importance of further research towards responsible progress in this field. 2024.arabicnlp-1.18 @@ -250,7 +250,7 @@ GaganBhatia El Moatez BillahNagoudiUniversity of British Columbia FakhraddinAlwajih - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 210-224 Arabic Optical Character Recognition (OCR) and Handwriting Recognition (HWR) pose unique challenges due to the cursive and context-sensitive nature of the Arabic script. This study introduces ***Qalam***, a novel foundation model designed for Arabic OCR and HWR, built on a SwinV2 encoder and RoBERTa decoder architecture. Our model significantly outperforms existing methods, achieving a Word Error Rate (WER) of just 0.80% in HWR tasks and 1.18% in OCR tasks. We train ***Qalam*** on a diverse dataset, including over 4.5 million images from Arabic manuscripts and a synthetic dataset comprising 60k image-text pairs. Notably, ***Qalam*** demonstrates exceptional handling of Arabic diacritics, a critical feature in Arabic scripts. Furthermore, it shows a remarkable ability to process high-resolution inputs, addressing a common limitation in current OCR systems. These advancements underscore ***Qalam***’s potential as a leading solution for Arabic script recognition, offering a significant leap in accuracy and efficiency. 2024.arabicnlp-1.19 @@ -275,7 +275,7 @@ <fixed-case>CATT</fixed-case>: Character-based <fixed-case>A</fixed-case>rabic Tashkeel Transformer - FarisAlasmary + FarisAlasmary OrjuwanZaafaraniAbjad AhmadGhannamSamsung 250-257 @@ -286,7 +286,7 @@ Picking Up Where the Linguist Left Off: Mapping Morphology to Phonology through Learning the Residuals - SalamKhalifaState University of New York, Stony Brook + SalamKhalifaState University of New York, Stony Brook AbdelrahimQaddoumi EllenBroselowState University of New York at Stony Brook OwenRambowStony Brook University @@ -301,7 +301,7 @@ AlcidesAlcoba Inciarte Sang YunKwonUniversity of British Columbia El Moatez BillahNagoudiUniversity of British Columbia - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 265-282 Development of pre-trained language models has predominantly relied on large amounts of datasets. However, this dependence on abundant data has limited the applicability of these models in low-resource settings. In this work, we investigate the utility of exploiting synthetic datasets acquired from different sources to pre-train language models for Arabic. Namely, we leverage data derived based on four different methods: optical character recognition (OCR), automatic speech recognition (ASR), machine translation (MT), and generative language models. We use these datasets to pre-train models in three different architectures: encoder-only (BERTBase), encoder-decoder (T5), and decoder-only (GPT-2). We test the capabilities of resulting models on Arabic natural language understanding (NLU) tasks using the ORCA benchmark. Our results show that utilizing synthetic data can achieve performance comparable to, or even surpassing, those trained on gold data. For example, our model based on a GPT-2 architecture trained on a combined synthetic dataset surpasses the baseline model ARBERTv2. Overall, our models pre-trained on synthetic data demonstrate robust performance across various tasks. This highlights the potential of synthetic datasets in augmenting language model training in low-resource settings. 2024.arabicnlp-1.23 @@ -310,11 +310,11 @@ Benchmarking <fixed-case>LL</fixed-case>a<fixed-case>MA</fixed-case>-3 on <fixed-case>A</fixed-case>rabic Language Generation Tasks - Md Tawkat IslamKhondakerUniversity of British Columbia + Md Tawkat IslamKhondakerUniversity of British Columbia NumaanNaeemMohamed bin Zayed University of Artificial Intelligence FatimahKhan AbdelRahimElmadanyUniversity of British Columbia - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 283-297 Open-sourced large language models (LLMs) have exhibited remarkable performance in a variety of NLP tasks, often catching up with the closed-sourced LLMs like ChatGPT. Among these open LLMs, LLaMA-3-70B has emerged as the most recent and the most prominent one. However, how LLaMA-3-70B would situate itself in multilingual settings, especially in a rich morphological language like Arabic, has yet to be explored. In this work, we focus to bridge this gap by evaluating LLaMA-3-70B on a diverse set of Arabic natural language generation (NLG) benchmarks. To the best of our knowledge, this is the first study that comprehensively evaluates LLaMA-3-70B on tasks related to Arabic natural language generation. Our study reveals that LLaMA-3-70B lags behind the closed LLMs like ChatGPT, both in modern standard Arabic (MSA) and dialectal Arabic (DA). We further compare the performance of LLaMA-3-70B with our smaller and dedicated finetuned Arabic models. We find that both LLaMA-3-70B and ChatGPT are outperformed by comparatively smaller dedicated Arabic models, indicating the scope for potential improvement with Arabic-focused LLMs. 2024.arabicnlp-1.24 @@ -326,8 +326,8 @@ MuhammedSaeed AsimMohamed MukhtarMohamed - ShadyShehataMohamed bin Zayed University of Artificial Intelligence - MuhammadAbdul-MageedUniversity of British Columbia + ShadyShehataMohamed bin Zayed University of Artificial Intelligence + MuhammadAbdul-MageedUniversity of British Columbia 298-308 The Coptic language, rooted in the historical landscapes of Egypt, continues to serve as a vital liturgical medium for the Coptic Orthodox and Catholic Churches across Egypt, North Sudan, Libya, and the United States, with approximately ten million speakers worldwide. However, the scarcity of digital resources in Coptic has resulted in its exclusion from digital systems, thereby limiting its accessibility and preservation in modern technological contexts. Our research addresses this issue by developing the most extensive parallel Coptic-centered corpus to date. This corpus comprises over 8,000 parallel sentences between Arabic and Coptic, and more than 24,000 parallel sentences between English and Coptic. We have also developed the first neural machine translation system between Coptic, English, and Arabic. Lastly, we evaluate the capability of leading proprietary Large Language Models (LLMs) to translate to and from Coptic using a few-shot learning approach (in-context learning). Our code and data are available at https://github.com/UBC-NLP/copticmt. 2024.arabicnlp-1.25 @@ -338,7 +338,7 @@ Event-Arguments Extraction Corpus and Modeling using <fixed-case>BERT</fixed-case> for <fixed-case>A</fixed-case>rabic AlaaAljabariBirzeit University LinaDuaibes - MustafaJarrarBirzeit University + MustafaJarrarBirzeit University MohammedKhaliliaQualtrics XM and Birzeit University 309-319 Event-argument extraction is a challenging task, particularly in Arabic due to sparse linguistic resources. To fill this gap, we introduce the corpus (550k tokens) as an extension of Wojood, enriched with event-argument annotations. We used three types of event arguments: agent, location, and date, which we annotated as relation types. Our inter-annotator agreement evaluation resulted in 82.23% Kappa score and 87.2% F_1-score. Additionally, we propose a novel method for event relation extraction using BERT, in which we treat the task as text entailment. This method achieves an F_1-score of 94.01%.To further evaluate the generalization of our proposed method, we collected and annotated another out-of-domain corpus (about 80k tokens) called and used it as a second test set, on which our approach achieved promising results (83.59% F_1-score). Last but not least, we propose an end-to-end system for event-arguments extraction. This system is implemented as part of SinaTools, and both corpora are publicly available at https://sina.birzeit.edu/wojood @@ -350,7 +350,7 @@ Dallah: A Dialect-Aware Multimodal Large Language Model for <fixed-case>A</fixed-case>rabic FakhraddinAlwajih GaganBhatia - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 320-336 Recent advancements have significantly enhanced the capabilities of Multimodal Large Language Models (MLLMs) in generating and understanding image-to-text content. Despite these successes, progress is predominantly limited to English due to the scarcity of high-quality multimodal resources in other languages. This limitation impedes the development of competitive models in languages such as Arabic. To alleviate this situation, we introduce an efficient Arabic multimodal assistant, dubbed ***Dallah***, that utilizes an advanced language model based on LLaMA-2 to facilitate multimodal interactions. ***Dallah*** demonstrates state-of-the-art performance in Arabic MLLMs. Through fine-tuning six Arabic dialects, ***Dallah*** showcases its capability to handle complex dialectal interactions incorporating both textual and visual elements. The model excels in two benchmark tests: one evaluating its performance on Modern Standard Arabic (MSA) and another specifically designed to assess dialectal responses. Beyond its robust performance in multimodal interaction tasks, ***Dallah*** has the potential to pave the way for further development of dialect-aware Arabic MLLMs. 2024.arabicnlp-1.27 @@ -363,9 +363,9 @@ SalamAlbatarni SohailaEltanboulyUniversity of Qatar EmanZahran - HamdoElhuseyinBingol University + HamdoElhuseyinBingol University TamerElsayedQatar University - WalidMassoud + WalidMassoud HoudaBouamorCarnegie Mellon University 337-351 Automated Essay Scoring (AES) has emerged as a significant research problem within natural language processing, providing valuable support for educators in assessing student writing skills. In this paper, we introduce QAES, the first publicly available trait-specific annotations for Arabic AES, built on the Qatari Corpus of Argumentative Writing (QCAW). QAES includes a diverse collection of essays in Arabic, each of them annotated with holistic and trait-specific scores, including relevance, organization, vocabulary, style, development, mechanics, and grammar. In total, it comprises 195 Arabic essays (with lengths ranging from 239 to 806 words) across two distinct argumentative writing tasks. We benchmark our dataset against the state-of-the-art English baselines and a feature-based approach. In addition, we discuss the adopted guidelines and the challenges encountered during the annotation process. Finally, we provide insights into potential areas for improvement and future directions in Arabic AES research. @@ -393,8 +393,8 @@ <fixed-case>A</fixed-case>rabic<fixed-case>NLU</fixed-case> 2024: The First <fixed-case>A</fixed-case>rabic Natural Language Understanding Shared Task MohammedKhaliliaQualtrics XM and Birzeit University SanadMalaysha - ReemSuwailehHamad Bin Khalifa University - MustafaJarrarBirzeit University + ReemSuwailehHamad Bin Khalifa University + MustafaJarrarBirzeit University AlaaAljabariBirzeit University TamerElsayedQatar University ImedZitouniGoogle @@ -409,8 +409,8 @@ TasneemWael EmanElrefai MohamedMakram - SaharSelim - GhadaKhoribaNile University + SaharSelim + GhadaKhoribaNile University 372-376 This paper presents a novel approach to Ara-bic Word Sense Disambiguation (WSD) lever-aging transformer-based models to tackle thecomplexities of the Arabic language. Utiliz-ing the SALMA dataset, we applied severaltechniques, including Sentence Transformerswith Siamese networks and the SetFit frame-work optimized for few-shot learning. Our ex-periments, structured around a robust evalua-tion framework, achieved a promising F1-scoreof up to 71%, securing second place in theArabicNLU 2024: The First Arabic NaturalLanguage Understanding Shared Task compe-tition. These results demonstrate the efficacyof our approach, especially in dealing with thechallenges posed by homophones, homographs,and the lack of diacritics in Arabic texts. Theproposed methods significantly outperformedtraditional WSD techniques, highlighting theirpotential to enhance the accuracy of Arabicnatural language processing applications. 2024.arabicnlp-1.31 @@ -430,7 +430,7 @@ rematchka at <fixed-case>A</fixed-case>rabic<fixed-case>NLU</fixed-case>2024: Evaluating Large Language Models for <fixed-case>A</fixed-case>rabic Word Sense and Location Sense Disambiguation - ReemAbdel-SalamFaculty of Engineering Cairo University, Cairo University + ReemAbdel-SalamFaculty of Engineering Cairo University, Cairo University 383-392 Natural Language Understanding (NLU) plays a vital role in Natural Language Processing (NLP) by facilitating semantic interactions. Arabic, with its diverse morphology, poses a challenge as it allows multiple interpretations of words, leading to potential misunderstandings and errors in NLP applications. In this paper, we present our approach for tackling Arabic NLU shared tasks for word sense disambiguation (WSD) and location mention disambiguation (LMD). Various approaches have been investigated from zero-shot inference of large language models (LLMs) to fine-tuning of pre-trained language models (PLMs). The best approach achieved 57% on WSD task ranking third place, while for the LMD task, our best systems achieved 94% MRR@1 ranking first place. 2024.arabicnlp-1.33 @@ -440,11 +440,11 @@ <fixed-case>A</fixed-case>ra<fixed-case>F</fixed-case>in<fixed-case>NLP</fixed-case> 2024: The First <fixed-case>A</fixed-case>rabic Financial <fixed-case>NLP</fixed-case> Shared Task SanadMalaysha - MoEl-HajLancaster University + MoEl-HajLancaster University SaadEzziniLancaster University MohammedKhaliliaQualtrics XM and Birzeit University - MustafaJarrarBirzeit University - SultanAlmujaiwelKing Saud University + MustafaJarrarBirzeit University + SultanAlmujaiwelKing Saud University IsmailBerradaMohammed VI Polytechnic University HoudaBouamorCarnegie Mellon University 393-402 @@ -481,7 +481,7 @@ HossamElkordiAlexandria University AhmedSakrAlexandria University MarwanTorkiAlexandria University - NagwaEl-Makky + NagwaEl-Makky 415-421 Arabic banking intent detection represents a challenging problem across multiple dialects. It imposes generalization difficulties due to the scarcity of Arabic language and its dialects resources compared to English. We propose a methodology that leverages contrastive training to overcome this limitation. We also augmented the data with several dialects using a translation model. Our experiments demonstrate the ability of our approach in capturing linguistic nuances across different Arabic dialects as well as accurately differentiating between banking intents across diverse linguistic landscapes. This would enhance multi-dialect banking services in the Arab world with limited Arabic language resources. Using our proposed method we achieved second place on subtask 1 leaderboard of the AraFinNLP2024 shared task with micro-F1 score of 0.8762 on the test split. 2024.arabicnlp-1.37 @@ -494,8 +494,8 @@ SymomShohan Md.Hossain JawadHossain - ShawlyAhsanChittagong University of Engineering and Technology - Mohammed MoshiulHoqueChittagong University of Engineering and Technology + ShawlyAhsanChittagong University of Engineering and Technology + Mohammed MoshiulHoqueChittagong University of Engineering and Technology 422-427 Intention detection is a crucial aspect of natural language understanding (NLU), focusing on identifying the primary objective underlying user input. In this work, we present a transformer-based method that excels in determining the intent of Arabic text within the banking domain. We explored several machine learning (ML), deep learning (DL), and transformer-based models on an Arabic banking dataset for intent detection. Our findings underscore the challenges that traditional ML and DL models face in understanding the nuances of various Arabic dialects, leading to subpar performance in intent detection. However, the transformer-based methods, designed to tackle such complexities, significantly outperformed the other models in classifying intent across different Arabic dialects. Notably, the AraBERTv2 model achieved the highest micro F1 score of 82.08% in ArBanking77 dataset, a testament to its effectiveness in this context. This achievement, which contributed to our work being ranked 5^{th} in the shared task, AraFinNLP2024, highlights the importance of developing models that can effectively handle the intricacies of Arabic language processing and intent detection. 2024.arabicnlp-1.38 @@ -505,7 +505,7 @@ <fixed-case>SENIT</fixed-case> at <fixed-case>A</fixed-case>ra<fixed-case>F</fixed-case>in<fixed-case>NLP</fixed-case>2024: trust your model or combine two AbdelmomenNasr - MoezBen HajHmidaNational Engineering School of Tunis + MoezBen HajHmidaNational Engineering School of Tunis 428-432 We describe our submitted system to the 2024 Shared Task on The Arabic Financial NLP (Malaysha et al., 2024). We tackled Subtask 1, namely Multi-dialect Intent Detection. We used state-of-the-art pretrained contextualized text representation models and fine-tuned them according to the downstream task at hand. We started by finetuning multilingual BERT and various Arabic variants, namely MARBERTV1, MARBERTV2, and CAMeLBERT. Then, we employed an ensembling technique to improve our classification performance combining MARBERTV2 and CAMeLBERT embeddings. The findings indicate that MARBERTV2 surpassed all the other models mentioned. 2024.arabicnlp-1.39 @@ -527,7 +527,7 @@ AsmaaRamadan ManarAmr MarwanTorkiAlexandria University - NagwaEl-Makky + NagwaEl-Makky 441-445 Intent detection, also called intent classification or recognition, is an NLP technique to comprehend the purpose behind user utterances. This paper focuses on Multi-dialect Arabic intent detection in banking, utilizing the ArBanking77 dataset. Our method employs an ensemble of fine-tuned BERT-based models, integrating contrastive loss for training. To enhance generalization to diverse Arabic dialects, we augment the ArBanking77 dataset, originally in Modern Standard Arabic (MSA) and Palestinian, with additional dialects such as Egyptian, Moroccan, and Saudi, among others. Our approach achieved an F1-score of 0.8771, ranking first in subtask-1 of the AraFinNLP shared task 2024. 2024.arabicnlp-1.41 @@ -537,8 +537,8 @@ <fixed-case>BFCI</fixed-case> at <fixed-case>A</fixed-case>ra<fixed-case>F</fixed-case>in<fixed-case>NLP</fixed-case>2024: Support Vector Machines for <fixed-case>A</fixed-case>rabic Financial Text Classification NsrinAshrafFaculty of Computer and Artificial Intelligence - HamadaNayelBenha University - MohammedAldawsariPrince Sattam bin Abdulaziz University + HamadaNayelBenha University + MohammedAldawsariPrince Sattam bin Abdulaziz University HosahalliShashirekhaMangalore University TarekElshishtawyBenha University 446-449 @@ -549,8 +549,8 @@ dz<fixed-case>F</fixed-case>in<fixed-case>N</fixed-case>lp at <fixed-case>A</fixed-case>ra<fixed-case>F</fixed-case>in<fixed-case>NLP</fixed-case>: Improving Intent Detection in Financial Conversational Agents - MohamedLichouriUniversité des Sciences et de la Technologie Houari Boumediène - KhaledLounnasUniversité des Sciences et de la Technologie Houari Boumediène + MohamedLichouriUniversité des Sciences et de la Technologie Houari Boumediène + KhaledLounnasUniversité des Sciences et de la Technologie Houari Boumediène AmzianeZakaria 450-455 In this paper, we present our dzFinNlp team’s contribution for intent detection in financial conversational agents, as part of the AraFinNLP shared task. We experimented with various models and feature configurations, including traditional machine learning methods like LinearSVC with TF-IDF, as well as deep learning models like Long Short-Term Memory (LSTM). Additionally, we explored the use of transformer-based models for this task. Our experiments show promising results, with our best model achieving a micro F1-score of 93.02% and 67.21% on the ArBanking77 dataset, in the development and test sets, respectively. @@ -560,13 +560,13 @@ <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: Propagandistic Techniques Detection in Unimodal and Multimodal <fixed-case>A</fixed-case>rabic Content - MaramHasanainQatar Computing Research Institute - Md. AridHasanUniversity of New Brunswick + MaramHasanainQatar Computing Research Institute + Md. AridHasanUniversity of New Brunswick FatemaAhmadHamad Bin Khalifa University - ReemSuwailehHamad Bin Khalifa University - Md. RafiulBiswas - WajdiZaghouani - FirojAlamQatar Computing Research Institute + ReemSuwailehHamad Bin Khalifa University + Md. RafiulBiswas + WajdiZaghouani + FirojAlamQatar Computing Research Institute 456-466 We present an overview of the second edition of the ArAIEval shared task, organized as part of the ArabicNLP 2024 conference co-located with ACL 2024. In this edition, ArAIEval offers two tasks: (i) detection of propagandistic textual spans with persuasion techniques identification in tweets and news articles, and (ii) distinguishing between propagandistic and non-propagandistic memes. A total of 14 teams participated in the final evaluation phase, with 6 and 9 teams participating in Tasks 1 and 2, respectively. Finally, 11 teams submitted system description papers. Across both tasks, we observed that fine-tuning transformer models such as AraBERT was at the core of the majority of the participating systems. We provide a description of the task setup, including a description of the dataset construction and the evaluation setup. We further provide a brief overview of the participating systems. All datasets and evaluation scripts are released to the research community. We hope this will enable further research on these important tasks in Arabic. 2024.arabicnlp-1.44 @@ -575,11 +575,11 @@ <fixed-case>M</fixed-case>eme<fixed-case>M</fixed-case>ind at <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: Generative Augmentation and Feature Fusion for Multimodal Propaganda Detection in <fixed-case>A</fixed-case>rabic Memes through Advanced Language and Vision Models - UzairShah - Md. RafiulBiswas - MarcoAgusHamad Bin Khalifa University + UzairShah + Md. RafiulBiswas + MarcoAgusHamad Bin Khalifa University MowafaHouseh - WajdiZaghouaniNorthwestern University + WajdiZaghouaniNorthwestern University 467-472 Detecting propaganda in multimodal content, such as memes, is crucial for combating disinformation on social media. This paper presents a novel approach for the ArAIEval 2024 shared Task 2 on Multimodal Propagandistic Memes Classification, involving text, image, and multimodal classification of Arabic memes. For text classification (Task 2A), we fine-tune state-of-the-art Arabic language models and use ChatGPT4-generated synthetic text for data augmentation. For image classification (Task 2B), we fine-tune ResNet18, EfficientFormerV2, and ConvNeXt-tiny architectures with DALL-E-2-generated synthetic images. For multimodal classification (Task 2C), we combine ConvNeXt-tiny and BERT architectures in a fusion layer to enhance binary classification. Our results show significant performance improvements with data augmentation for text and image classification models and with the fusion layer for multimodal classification. We highlight challenges and opportunities for future research in multimodal propaganda detection in Arabic content, emphasizing the need for robust and adaptable models to combat disinformation. 2024.arabicnlp-1.45 @@ -588,13 +588,13 @@ <fixed-case>ASOS</fixed-case> at <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: Integrating Text and Image Embeddings for Multimodal Propaganda Detection in <fixed-case>A</fixed-case>rabic Memes - YasserAlhabashiPrince Sultan University + YasserAlhabashiPrince Sultan University AbdullahAlharbi SamarAhmad SerrySibaeeprince sultan university OmerNacar - LahouariGhouti - AnisKoubaaPrince sultan university + LahouariGhouti + AnisKoubaaPrince sultan university 473-477 This paper describes our participation in the ArAIEval Shared Task 2024, focusing on Task 2C, which challenges participants to detect propagandistic elements in multimodal Arabic memes. The challenge involves analyzing both the textual and visual components of memes to identify underlying propagandistic messages. Our approach integrates the capabilities of MARBERT and ResNet50, top-performing pre-trained models for text and image processing, respectively. Our system architecture combines these models through a fusion layer that integrates and processes the extracted features, creating a comprehensive representation that is more effective in detecting nuanced propaganda. Our proposed system achieved significant success, placing second with an F1 score of 0.7987. 2024.arabicnlp-1.46 @@ -625,7 +625,7 @@ Nullpointer at <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: <fixed-case>A</fixed-case>rabic Propagandist Technique Detection with Token-to-Word Mapping in Sequence Tagging - AbrarAbir + AbrarAbir KemalOflazerCarnegie Mellon University 489-493 This paper investigates the optimization of propaganda technique detection in Arabic text, including tweets & news paragraphs, from ArAIEval shared task 1. Our approach involves fine-tuning the AraBERT v2 model with a neural network classifier for sequence tagging.Experimental results show relying on the first token of the word for technique prediction produces the best performance. In addition, incorporating genre information as a feature further enhances the model’s performance. Our system achieved a score of 25.41, placing us 4th on the leaderboard. Subsequent post-submission improvements further raised our score to 26.68. @@ -635,9 +635,9 @@ <fixed-case>M</fixed-case>eme<fixed-case>M</fixed-case>ind at <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: Spotting Persuasive Spans in <fixed-case>A</fixed-case>rabic Text with Persuasion Techniques Identification - Md. RafiulBiswas + Md. RafiulBiswas ZubairShah - WajdiZaghouaniNorthwestern University + WajdiZaghouaniNorthwestern University 494-500 This paper focuses on detecting propagandistic spans and persuasion techniques in Arabic text from tweets and news paragraphs. Each entry in the dataset contains a text sample and corresponding labels that indicate the start and end positions of propaganda techniques within the text. Tokens falling within a labeled span were assigned ’B’ (Begin) or ’I’ (Inside) tags, ’O’, corresponding to the specific propaganda technique. Using attention masks, we created uniform lengths for each span and assigned BIO tags to each token based on the provided labels. Then, we used AraBERT-base pre-trained model for Arabic text tokenization and embeddings with a token classification layer to identify propaganda techniques. Our training process involves a two-phase fine-tuning approach. First, we train only the classification layer for a few epochs, followed by full model fine-tuning, updating all parameters. This methodology allows the model to adapt to the specific characteristics of the propaganda detection task while leveraging the knowledge captured by the pretrained AraBERT model. Our approach achieved an F1 score of 0.2774, securing the 3rd position in the leaderboard of Task 1. 2024.arabicnlp-1.50 @@ -647,7 +647,7 @@ <fixed-case>CLTL</fixed-case> at <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: Multimodal Propagandistic Memes Classification Using Transformer Models YeshanWangVrije Universiteit Amsterdam - IliaMarkovVrije Universiteit Amsterdam + IliaMarkovVrije Universiteit Amsterdam 501-506 We present the CLTL system designed for the ArAIEval Shared Task 2024 on multimodal propagandistic memes classification in Arabic. The challenge was divided into three subtasks: identifying propagandistic content from textual modality of memes (subtask 2A), from visual modality of memes (subtask 2B), and in a multimodal scenario when both modalities are combined (subtask 2C). We explored various unimodal transformer models for Arabic language processing (subtask 2A), visual models for image processing (subtask 2B), and concatenated text and image embeddings using the Multilayer Perceptron fusion module for multimodal propagandistic memes classification (subtask 2C). Our system achieved 77.96% for subtask 2A, 71.04% for subtask 2B, and 79.80% for subtask 2C, ranking 2nd, 1st, and 3rd on the leaderboard. 2024.arabicnlp-1.51 @@ -669,7 +669,7 @@ <fixed-case>A</fixed-case>lex<fixed-case>UNLP</fixed-case>-<fixed-case>MZ</fixed-case> at <fixed-case>A</fixed-case>r<fixed-case>AIE</fixed-case>val Shared Task: Contrastive Learning, <fixed-case>LLM</fixed-case> Features Extraction and Multi-Objective Optimization for <fixed-case>A</fixed-case>rabic Multi-Modal Meme Propaganda Detection MohamedZaytoonAlexandria University - NagwaEl-Makky + NagwaEl-Makky MarwanTorkiAlexandria University 512-517 The rise of memes as a tool for spreading propaganda presents a significant challenge in the current digital environment. In this paper, we outline our work for the ArAIEval Shared Task2 in ArabicNLP 2024. This study introduces a method for identifying propaganda in Arabic memes using a multimodal system that combines textual and visual indicators to enhance the result. Our approach achieves the first place in text classification with Macro-F1 of 78.69%, the third place in image classification with Macro-F1 of 65.92%, and the first place in multimodal classification with Macro-F1 of 80.51% @@ -682,9 +682,9 @@ SymomShohan Md.Hossain AshrafulParan - ShawlyAhsanChittagong University of Engineering and Technology + ShawlyAhsanChittagong University of Engineering and Technology JawadHossain - Mohammed MoshiulHoqueChittagong University of Engineering and Technology + Mohammed MoshiulHoqueChittagong University of Engineering and Technology 518-523 Detecting propagandistic spans and identifying persuasion techniques are crucial for promoting informed decision-making, safeguarding democratic processes, and fostering a media environment characterized by integrity and transparency. Various machine learning (Logistic Regression, Random Forest, and Multinomial Naive Bayes), deep learning (CNN, CNN+LSTM, CNN+BiLSTM), and transformer-based (AraBERTv2, AraBERT-NER, CamelBERT, BERT-Base-Arabic) models were exploited to perform the task. The evaluation results indicate that CamelBERT achieved the highest micro-F1 score (24.09%), outperforming CNN+LSTM and AraBERTv2. The study found that most models struggle to detect propagandistic spans when multiple spans are present within the same article. Overall, the model’s performance secured a 6^{th} place ranking in the ArAIEval Shared Task-1. 2024.arabicnlp-1.54 @@ -703,13 +703,13 @@ The <fixed-case>FIGNEWS</fixed-case> Shared Task on News Media Narratives - WajdiZaghouani - MustafaJarrarBirzeit University - NizarHabashNew York University Abu Dhabi + WajdiZaghouani + MustafaJarrarBirzeit University + NizarHabashNew York University Abu Dhabi HoudaBouamorCarnegie Mellon University ImedZitouniGoogle MonaDiabCarnegie Mellon University - SamhaaEl-Beltagy + SamhaaEl-Beltagy MuhammedAbuOdehNew York University, Abu Dhabi 530-547 We present an overview of the FIGNEWSshared task, organized as part of the Arabic-NLP 2024 conference co-located with ACL2024. The shared task addresses bias and pro-paganda annotation in multilingual news posts.We focus on the early days of the Israel War onGaza as a case study. The task aims to fostercollaboration in developing annotation guide-lines for subjective tasks by creating frame-works for analyzing diverse narratives high-lighting potential bias and propaganda. In aspirit of fostering and encouraging diversity,we address the problem from a multilingualperspective, namely within five languages: En-glish, French, Arabic, Hebrew, and Hindi. Atotal of 17 teams participated in two annota-tion subtasks: bias (16 teams) and propaganda(6 teams). The teams competed in four evalua-tion tracks: guidelines development, annotationquality, annotation quantity, and consistency.Collectively, the teams produced 129,800 datapoints. Key findings and implications for thefield are discussed. @@ -719,14 +719,14 @@ Narrative Navigators at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: New Frontiers in Bias and Propaganda Annotation Techniques - MaryamAlEmadi + MaryamAlEmadi JanaElMesselmani LynaBermak GoumanaAbdullah Esra’aSharqawi AnissaJrad ZiedZouabiIHET Sidi Dhrif - WajdiZaghouaniNorthwestern University + WajdiZaghouaniNorthwestern University 548-554 This paper presents our team’s contribution to the FIGNEWS 2024 Shared Task, which involved annotating bias and propaganda in news coverage of the Israel-Palestine conflict. We developed comprehensive guidelines and employed a rigorous methodology to analyze 2,200 news posts from several official Facebook accounts of news websites in multiple languages. Our team, Narrative Navigators, achieved third place in both the Bias Guidelines and Bias Consistency tracks, demonstrating the effectiveness of our approach. We achieved an IAA Kappa score of 39.4 for bias annotation and 12.8 for propaganda detection. These findings and our performance underscore the need for enhanced media literacy and further research to counter the impact of biased and misleading information on public understanding of the conflict. 2024.arabicnlp-1.57 @@ -739,7 +739,7 @@ MohsenMahmoodzadehHasti Innovation Trading VanoosheNazari RaziehBahmanyarSharif University of Technology - KathrynBurrowsMadonna University + KathrynBurrowsMadonna University 555-560 In this study, we present a novel approach to annotating bias and propaganda in social media data by leveraging topic modeling techniques. Utilizing the BERTopic tool, we performed topic modeling on the FIGNEWS Shared-task dataset, which initially comprised 13,500 samples. From this dataset, we identified 35 distinct topics and selected approximately 50 representative samples from each topic, resulting in a subset of 1,812 samples. These selected samples were meticulously annotated for bias and propaganda labels. Subsequently, we employed multiple methods like KNN, SVC, XGBoost, and RAG to develop a classifier capable of detecting bias and propaganda within social media content. Our approach demonstrates the efficacy of using topic modeling for efficient data subset selection and provides a robust foundation for improving the accuracy of bias and propaganda detection in large-scale social media datasets. 2024.arabicnlp-1.58 @@ -778,8 +778,8 @@ SadafZiafatSadafZiafat MominaIshfaq AlishbaSuboor - HammadAfzalNational University of Science and Technology - SeemabLatifNational University of Science and Technology + HammadAfzalNational University of Science and Technology + SeemabLatifNational University of Science and Technology 573-579 In this paper, we present our methodology and findings from participating in the FIGNEWS 2024 shared task on annotating news fragments on the Gaza-Israel war for bias and propaganda detection. The task aimed to refine the FIGNEWS 2024 annotation guidelines and to contribute to the creation of a comprehensive dataset to advance research in this field. Our team employed a multi-faceted approach to ensure high accuracy in data annotations. Our results highlight key challenges in detecting bias and propaganda, such as the need for more comprehensive guidelines. Our team ranked first in all tracks for propaganda annotation. For Bias, the team stood in first place for the Guidelines and IAA tracks, and in second place for the Quantity and Consistency tracks. 2024.arabicnlp-1.61 @@ -788,7 +788,7 @@ Bias Bluff Busters at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: Developing Guidelines to Make Bias Conscious - JasminHeierliZHAW - Zürcher Hochschule für Angewandte Wissenschaften + JasminHeierliZHAW - Zürcher Hochschule für Angewandte Wissenschaften SilviaParetiAi4privacy SerenaParetiCatholic University of the Sacred Heart TatianaLando @@ -800,8 +800,8 @@ Ceasefire at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: Automated Detection and Annotation of Media Bias Using Large Language Models - NoorSadiah - SaraAl-Emadi + NoorSadiah + SaraAl-Emadi SumayaRahman 590-600 In this paper, we present our approach for FIGNEWS Subtask 1, which focuses on detecting bias in news media narratives about the Israel war on Gaza. We used a Large Language Model (LLM) and prompt engineering, using GPT-3.5 Turbo API, to create a model that automatically flags biased news media content with 99% accuracy. This approach provides Natural Language Processing (NLP) researchers with a robust and effective solution for automating bias detection in news media narratives using supervised learning algorithms. Additionally, this paper provides a detailed analysis of the labeled content, offering valuable insights into media bias in conflict reporting. Our work advances automated content analysis and enhances understanding of media bias. @@ -812,7 +812,7 @@ <fixed-case>S</fixed-case>ahara Pioneers at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: Data Annotation Guidelines for Propaganda Detection in News Items MarwaSolla - HassanEbrahem + HassanEbrahem AlyaIssaUniversity of Tripoli HarmainHarmainUniversity of Tripoli AbdusalamNwesri @@ -828,7 +828,7 @@ BlqeesAl Busaidi MalathAl-Sibani Hiba Salim MuhammadAl-Siyabi - NajmaAl ZidjalySultan Qaboos University + NajmaAl ZidjalySultan Qaboos University 609-613 In this study, we aimed to identify biased language in a dataset provided by the FIGNEWS 2024 committee on the Gaza-Israel war. We classified entries into seven categories: Unbiased, Biased against Palestine, Biased against Israel, Biased against Others, Biased against both Palestine and Israel, Unclear, and Not Applicable. Our team reviewed the literature to develop a codebook of terminologies and definitions. By coding each example, we sought to detect language tendencies used by media outlets when reporting on the same event. The primary finding was that most examples were classified as “Biased against Palestine,” as all examined language data used one-sided terms to describe the October 7 event. The least used category was “Not Applicable,” reserved for irrelevant examples or those lacking context. It is recommended to use neutral and balanced language when reporting volatile political news. 2024.arabicnlp-1.65 @@ -837,10 +837,10 @@ The <fixed-case>C</fixed-case>yber<fixed-case>E</fixed-case>quity Lab at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: Annotating a Corpus of <fixed-case>F</fixed-case>acebook Posts to Label Bias and Propaganda in <fixed-case>G</fixed-case>aza-<fixed-case>I</fixed-case>srael War Coverage in Five Languages - MohammedHelalBirzeit University - RadiJarrarBirzeit University + MohammedHelalBirzeit University + RadiJarrarBirzeit University MohammedAlkhanafsehBirzeit University - AbdallahKarakraBirzeit University + AbdallahKarakraBirzeit University RubaAwadallah 614-619 This paper presents The_CyberEquity_Lab team’s participation in the FIGNEWS 2024 Shared Task (Zaghouani, et al., 2024). The task is to annotate a corpus of Facebook posts into bias and propaganda in covering the Gaza-Israel war. The posts represent news articles written in five different languages. The paper presents the guidelines of annotation that the team has adhered in identifying both bias and propaganda in coverage of this continuous conflict. @@ -851,7 +851,7 @@ <fixed-case>BSC</fixed-case>-<fixed-case>LANGTECH</fixed-case> at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: Exploring Semi-Automatic Bias Annotation using Frame Analysis ValleRuiz-Fernández - JoséSaizBarcelona Supercomputing Center + JoséSaizBarcelona Supercomputing Center AitorGonzalez-Agirre 620-629 This paper introduces the methodology of BSC-LANGTECH team for the FIGNEWS 2024 Shared Task on News Media Narratives. Following the bias annotation subtask, we apply the theory and methods of framing analysis to develop guidelines to annotate bias in the corpus provided by the task organizators. The manual annotation of a subset, with which a moderate IAA agreement has been achieved, is further used in Deep Learning techniques to explore automatic annotation and test the reliability of our framework. @@ -861,13 +861,13 @@ <fixed-case>G</fixed-case>roningen<fixed-case>A</fixed-case>nnotates<fixed-case>G</fixed-case>aza at the <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: Analyzing Bias in Conflict Narratives - KhalidKhatibUniversity of Groningen + KhalidKhatibUniversity of Groningen SaraGemelliUniversity of Bergamo SaskiaHeisterborg PrithaMajumdar - GosseMinnema + GosseMinnema AriannaMuti - NoaSolissaUniversity of Groningen + NoaSolissaUniversity of Groningen 630-639 In this paper we report the development of our annotation methodology for the shared task FIGNEWS 2024. The objective of the shared task is to look into the layers of bias in how the war on Gaza is represented in media narrative. Our methodology follows the prescriptive paradigm, in which guidelines are detailed and refined through an iterative process in which edge cases are discussed and converged. Our IAA score (Krippendorff’s \alpha) is 0.420, highlighting the challenging and subjective nature of the task. Our results show that 52% of posts were unbiased, 42% biased against Palestine, 5% biased against Israel, and 3% biased against both. 16% were unclear or not applicable. 2024.arabicnlp-1.68 @@ -878,9 +878,9 @@ Sina at <fixed-case>F</fixed-case>ig<fixed-case>N</fixed-case>ews 2024: Multilingual Datasets Annotated with Bias and Propaganda. LinaDuaibes AreejJaberPalestine Technical University - Kadoorie - MustafaJarrarBirzeit University + MustafaJarrarBirzeit University AhmadQadiaaup.edu - MaisQandeelÖrebro University + MaisQandeelÖrebro University 640-645 The proliferation of bias and propaganda onsocial media is an increasingly significant concern,leading to the development of techniquesfor automatic detection. This article presents amultilingual corpus of 12, 000 Facebook postsfully annotated for bias and propaganda. Thecorpus was created as part of the FigNews2024 Shared Task on News Media Narrativesfor framing the Israeli War on Gaza. It coversvarious events during the War from October7, 2023 to January 31, 2024. The corpuscomprises 12, 000 posts in five languages (Arabic,Hebrew, English, French, and Hindi), with2, 400 posts for each language. The annotationprocess involved 10 graduate students specializingin Law. The Inter-Annotator Agreement(IAA) was used to evaluate the annotationsof the corpus, with an average IAA of 80.8%for bias and 70.15% for propaganda annotations.Our team was ranked among the bestperformingteams in both Bias and Propagandasubtasks. The corpus is open-source and availableat https://sina.birzeit.edu/fada 2024.arabicnlp-1.69 @@ -891,7 +891,7 @@ <fixed-case>SQU</fixed-case>ad at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: Unmasking Bias in Social Media Through Data Analysis and Annotation AsmahanAl-Mamari FatmaAl-Farsi - NajmaZidjalySultan Qaboos University + NajmaZidjalySultan Qaboos University 646-650 This paper is a part of the FIGNEWS 2024 Datathon Shared Task and it aims to investigate bias and double standards in media coverage of the Gaza-Israel 2023-2024 conflict through a comprehensive analysis of news articles. The methodology integrated both manual labeling as well as the application of a natural language processing (NLP) tool, which is the Facebook/BART-large-MNLI model. The annotation process involved categorizing the dataset based on identified biases, following a set of guidelines in which categories of bias were defined by the team. The findings revealed that most of the media texts provided for analysis included bias against Palestine, whether it was through the use of biased vocabulary or even tone. It was also found that texts written in Hebrew contained the most bias against Palestine. In addition, when comparing annotations done by AAI-1 and AAI-2, the results turned out to be very similar, which might be mainly due to the clear annotation guidelines set by the annotators themselves. Thus, we recommend the use of clear guidelines to facilitate the process of annotation by future researchers. 2024.arabicnlp-1.70 @@ -922,7 +922,7 @@ The Guidelines Specialists at <fixed-case>FIGNEWS</fixed-case> 2024 Shared Task: An annotation guideline to Unravel Bias in News Media Narratives Using a Linguistic Approach - GhizlaneBourahouat + GhizlaneBourahouat SamarAmer 672-676 This article presents the participation of “The Guideline Specialists” in the FIGNEWS 2024 Shared Task, which aims to unravel bias and propaganda in news media narratives surrounding the Gaza-Israel 2023-2024 war. Leveraging innovative annotation methodologies and drawing on a diverse team of annotators, our approach focuses on meticulously annotating news articles using a linguistic approach to uncover the intricate nuances of bias. By incorporating detailed examples and drawing on related work that show how language structure represented in the use of passive voice or the use of nominalization and the choice of vocabulary carry bias, our findings provide valuable insights into the representation of the Gaza-Israel conflict across various languages and cultures. The guideline we developed detected the bias against Gaza, against Israel and others by setting keywords that are based on linguistic background tested by the AntConc concordance tool. The result was an annotation guideline that have a solid base. Through this collaborative effort, we developed a guideline that contributes to fostering a deeper understanding of media narratives during one of the most critical moments in recent history. @@ -932,11 +932,11 @@ <fixed-case>KSAA</fixed-case>-<fixed-case>CAD</fixed-case> Shared Task: Contemporary <fixed-case>A</fixed-case>rabic Dictionary for Reverse Dictionary and Word Sense Disambiguation - WaadAlshammariKing Salman Global Academy for Arabic Language + WaadAlshammariKing Salman Global Academy for Arabic Language AmalAlmazruaKing Abdulaziz City for Science and Technology (KACST) - AsmaAl Wazrah + AsmaAl Wazrah RawanAlmatham - MuneeraAlhoshanKing Abdulaziz City for Science and Technology (KACST) + MuneeraAlhoshanKing Abdulaziz City for Science and Technology (KACST) AbdulrahmanAlosaimyAl-Imam Mohamed Ibn Saud Islamic University 677-685 This paper outlines the KSAA-CAD shared task, highlighting the Contemporary Arabic Language Dictionary within the scenario of developing a Reverse Dictionary (RD) system and enhancing Word Sense Disambiguation (WSD) capabilities. The first KSAA-RD (Al-Matham et al., 2023) highlighted significant gaps in the domain of RDs, which are designed to retrieve words by their meanings or definitions. This shared task comprises two tasks: RD and WSD. The RD task focuses on identifying word embeddings that most accurately match a given definition, termed a “gloss,” in Arabic. Conversely, the WSD task involves determining the specific meaning of a word in context, particularly when the word has multiple meanings. The winning team achieved the highest-ranking score of 0.0644 in RD using Electra embeddings. In this paper, we describe the methods employed by the participating teams and provide insights into the future direction of KSAA-CAD. @@ -970,8 +970,8 @@ AbdullahAlharbi SamarAhmad OmerNacar - AnisKoubaaPrince sultan university - LahouariGhouti + AnisKoubaaPrince sultan university + LahouariGhouti 697-703 Semantic search tasks have grown extremely fast following the advancements in large language models, including the Reverse Dictionary and Word Sense Disambiguation in Arabic. This paper describes our participation in the Contemporary Arabic Dictionary Shared Task. We propose two models that achieved first place in both tasks. We conducted comprehensive experiments on the latest five multilingual sentence transformers and the Arabic BERT model for semantic embedding extraction. We achieved a ranking score of 0.06 for the reverse dictionary task, which is double than last year’s winner. We had an accuracy score of 0.268 for the Word Sense Disambiguation task. 2024.arabicnlp-1.77 @@ -980,8 +980,8 @@ Baleegh at <fixed-case>KSAA</fixed-case>-<fixed-case>CAD</fixed-case> 2024: Towards Enhancing <fixed-case>A</fixed-case>rabic Reverse Dictionaries - MaisAlheraki - SouhamMeshoul + MaisAlheraki + SouhamMeshoul 704-708 The domain of reverse dictionaries (RDs), while advancing in languages like English and Chinese, remains underdeveloped for Arabic. This study attempts to explore a data-driven approach to enhance word retrieval processes in Arabic RDs. The research focuses on the ArabicNLP 2024 Shared Task, named KSAA-CAD, which provides a dictionary dataset of 39,214 word-gloss pairs, each with a corresponding target word embedding. The proposed solution aims to surpass the baseline performance by employing SOTA deep learning models and innovative data expansion techniques. The methodology involves enriching the dataset with contextually relevant examples, training a T5 model to align the words to their glosses in the space, and evaluating the results on the shared task metrics. We find that our model is closely aligned with the baseline performance on bertseg and bertmsa targets, however does not perform well on electra target, suggesting the need for further exploration. 2024.arabicnlp-1.78 @@ -990,14 +990,14 @@ <fixed-case>NADI</fixed-case> 2024: The Fifth Nuanced <fixed-case>A</fixed-case>rabic Dialect Identification Shared Task - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia AmrKelegUniversity of Edinburgh, University of Edinburgh AbdelRahimElmadanyUniversity of British Columbia - ChiyuZhangUniversity of British Columbia + ChiyuZhangUniversity of British Columbia InjyHamedUniversity of Stuttgart, Universität Stuttgart - WalidMagdyUniversity of Edinburgh + WalidMagdyUniversity of Edinburgh HoudaBouamorCarnegie Mellon University - NizarHabashNew York University Abu Dhabi + NizarHabashNew York University Abu Dhabi 709-728 We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI’s objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on prespecified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask 1), identification of the Arabic level of dialectness (Subtask 2), and dialect-to-MSA machine translation (Subtask 3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask 1, three in Subtask 2, and eight in Subtask 3. The winning teams achieved 50.57 F1 on Subtask 1, 0.1403 RMSE for Subtask 2, and 20.44 BLEU in Subtask 3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI. 2024.arabicnlp-1.79 @@ -1006,9 +1006,9 @@ <fixed-case>A</fixed-case>rabic Train at <fixed-case>NADI</fixed-case> 2024 shared task: <fixed-case>LLM</fixed-case>s’ Ability to Translate <fixed-case>A</fixed-case>rabic Dialects into <fixed-case>M</fixed-case>odern <fixed-case>S</fixed-case>tandard <fixed-case>A</fixed-case>rabic - AnastasiiaDemidovaMohamed bin Zayed University of Artificial Intelligence + AnastasiiaDemidovaMohamed bin Zayed University of Artificial Intelligence HaninAtwany - NourRabih + NourRabih SanadSha’ban 729-734 Navigating the intricacies of machine translation (MT) involves tackling the nuanced disparities between Arabic dialects and Modern Standard Arabic (MSA), presenting a formidable obstacle. In this study, we delve into Subtask 3 of the NADI shared task (CITATION), focusing on the translation of sentences from four distinct Arabic dialects into MSA. Our investigation explores the efficacy of various models, including Jais, NLLB, GPT-3.5, and GPT-4, in this dialect-to-MSA translation endeavor. Our findings reveal that Jais surpasses all other models, boasting an average BLEU score of 19.48 in the combination of zero- and few-shot setting, whereas NLLB exhibits the least favorable performance, garnering a BLEU score of 8.77. @@ -1020,7 +1020,7 @@ <fixed-case>A</fixed-case>lex<fixed-case>UNLP</fixed-case>-<fixed-case>STM</fixed-case> at <fixed-case>NADI</fixed-case> 2024 shared task: Quantifying the <fixed-case>A</fixed-case>rabic Dialect Spectrum with Contrastive Learning, Weighted Sampling, and <fixed-case>BERT</fixed-case>-based Regression Ensemble AbdelrahmanSakr MarwanTorkiAlexandria University - NagwaEl-Makky + NagwaEl-Makky 735-741 Recognizing the nuanced spectrum of dialectness in Arabic text poses a significant challenge for natural language processing (NLP) tasks. Traditional dialect identification (DI) methods treat the task as binary, overlooking the continuum of dialect variation present in Arabic speech and text. In this paper, we describe our submission to the NADI shared Task of ArabicNLP 2024. We participated in Subtask 2 - ALDi Estimation, which focuses on estimating the Arabic Level of Dialectness (ALDi) for Arabic text, indicating how much it deviates from Modern Standard Arabic (MSA) on a scale from 0 to 1, where 0 means MSA and 1 means high divergence from MSA. We explore diverse training approaches, including contrastive learning, applying a random weighted sampler along with fine-tuning a regression task based on the AraBERT model, after adding a linear and non-linear layer on top of its pooled output. Finally, performing a brute force ensemble strategy increases the performance of our system. Our proposed solution achieved a Root Mean Squared Error (RMSE) of 0.1406, ranking second on the leaderboard. 2024.arabicnlp-1.81 @@ -1029,10 +1029,10 @@ <fixed-case>NLP</fixed-case>_<fixed-case>DI</fixed-case> at <fixed-case>NADI</fixed-case> 2024 shared task: Multi-label <fixed-case>A</fixed-case>rabic Dialect Classifications with an Unsupervised Cross-Encoder - VaniKanjirangatDalle Molle Institute for Artificial Intelligence USI-SUPSI + VaniKanjirangatDalle Molle Institute for Artificial Intelligence USI-SUPSI TanjaSamardzicUniversity of Zurich - LjiljanaDolamicarmasuisse - FabioRinaldiIDSIA + LjiljanaDolamicarmasuisse + FabioRinaldiIDSIA 742-747 We report the approaches submitted to the NADI 2024 Subtask 1: Multi-label country-level Dialect Identification (MLDID). The core part was to adapt the information from multi-class data for a multi-label dialect classification task. We experimented with supervised and unsupervised strategies to tackle the task in this challenging setting. Under the supervised setup, we used the model trained using NADI 2023 data and devised approaches to convert the multi-class predictions to multi-label by using information from the confusion matrix or using calibrated probabilities. Under unsupervised settings, we used the Arabic-based sentence encoders and multilingual cross-encoders to retrieve similar samples from the training set, considering each test input as a query. The associated labels are then assigned to the input query. We also tried different variations, such as co-occurring dialects derived from the provided development set. We obtained the best validation performance of 48.5% F-score using one of the variations with an unsupervised approach and the same approach yielded the best test result of 43.27% (Ranked 2). 2024.arabicnlp-1.82 @@ -1044,8 +1044,8 @@ OmerNacar SerrySibaeeprince sultan university AbdullahAlharbi - LahouariGhouti - AnisKoubaaPrince sultan university + LahouariGhouti + AnisKoubaaPrince sultan university 748-753 This study undertakes a comprehensive investigation of transformer-based models to advance Arabic language processing, focusing on two pivotal aspects: the estimation of Arabic Level of Dialectness and dialectal sentence-level machine translation into Modern Standard Arabic. We conducted various evaluations of different sentence transformers across a proposed regression model, showing that the MARBERT transformer-based proposed regression model achieved the best root mean square error of 0.1403 for Arabic Level of Dialectness estimation. In parallel, we developed bi-directional translation models between Modern Standard Arabic and four specific Arabic dialects—Egyptian, Emirati, Jordanian, and Palestinian—by fine-tuning and evaluating different sequence-to-sequence transformers. This approach significantly improved translation quality, achieving a BLEU score of 0.1713. We also enhanced our evaluation capabilities by integrating MSA predictions from the machine translation model into our Arabic Level of Dialectness estimation framework, forming a comprehensive pipeline that not only demonstrates the effectiveness of our methodologies but also establishes a new benchmark in the deployment of advanced Arabic NLP technologies. 2024.arabicnlp-1.83 @@ -1054,8 +1054,8 @@ dz<fixed-case>NLP</fixed-case> at <fixed-case>NADI</fixed-case> 2024 Shared Task: Multi-Classifier Ensemble with Weighted Voting and <fixed-case>TF</fixed-case>-<fixed-case>IDF</fixed-case> Features - MohamedLichouriUniversité des Sciences et de la Technologie Houari Boumediène - KhaledLounnasUniversité des Sciences et de la Technologie Houari Boumediène + MohamedLichouriUniversité des Sciences et de la Technologie Houari Boumediène + KhaledLounnasUniversité des Sciences et de la Technologie Houari Boumediène ZahafNadjib RabiaiAyoub 754-757 @@ -1079,7 +1079,7 @@ Alson at <fixed-case>NADI</fixed-case> 2024 shared task: Alson - A fine-tuned model for <fixed-case>A</fixed-case>rabic Dialect Translation - MananAlMusallamAl-Imam Mohamed Ibn Saud Islamic University + MananAlMusallamAl-Imam Mohamed Ibn Saud Islamic University SamarAhmad 764-768 DA-MSA Machine Translation is a recentchallenge due to the multitude of Arabic dialects and their variations. In this paper, we present our results within the context of Subtask 3 of the NADI-2024 Shared Task(Abdul-Mageed et al., 2024) that is DA-MSA Machine Translation . We utilized the DIALECTS008MSA MADAR corpus (Bouamor et al., 2018),the Emi-NADI corpus for the Emirati dialect (Khered et al., 2023), and we augmented thePalestinian and Jordanian datasets based onNADI 2021. Our approach involves develop013ing sentence-level machine translations fromPalestinian, Jordanian, Emirati, and Egyptiandialects to Modern Standard Arabic (MSA).To016 address this challenge, we fine-tuned models such as (Nagoudi et al., 2022)AraT5v2-msa-small, AraT5v2-msa-base, and (Elmadanyet al., 2023)AraT5v2-base-1024 to comparetheir performance. Among these, the AraT5v2-base-1024 model achieved the best accuracy, with a BLEU score of 0.1650 on the develop023ment set and 0.1746 on the test set. @@ -1098,8 +1098,8 @@ <tex-math>StanceEval 2024: The First Arabic Stance Detection Shared Task</tex-math> - NoraAlturayeifImam Abdulrahman Bin Faisal University - HamzahLuqmanKing Fahad University of Petroleum and Minerals + NoraAlturayeifImam Abdulrahman Bin Faisal University + HamzahLuqmanKing Fahad University of Petroleum and Minerals ZaidAlyafeai AsmaYamani 774-782 @@ -1122,7 +1122,7 @@ <fixed-case>ANLP</fixed-case> <fixed-case>RG</fixed-case> at <fixed-case>S</fixed-case>tance<fixed-case>E</fixed-case>val2024: Comparative Evaluation of Stance, Sentiment and Sarcasm Detection MezghaniAmalUniversité Virtuelle de Tunis RahmaBoujelbane - MariemEllouze + MariemEllouze 788-793 As part of our study, we worked on three tasks:stance detection, sarcasm detection and senti-ment analysis using fine-tuning techniques onBERT-based models. Fine-tuning parameterswere carefully adjusted over multiple iterationsto maximize model performance. The threetasks are essential in the field of natural lan-guage processing (NLP) and present uniquechallenges. Stance detection is a critical taskaimed at identifying a writer’s stances or view-points in relation to a topic. Sarcasm detectionseeks to spot sarcastic expressions, while senti-ment analysis determines the attitude expressedin a text. After numerous experiments, we iden-tified Arabert-twitter as the model offering thebest performance for all three tasks. In particu-lar, it achieves a macro F-score of 78.08% forstance detection, a macro F1-score of 59.51%for sarcasm detection and a macro F1-score of64.57% for sentiment detection. .Our source code is available at https://github.com/MezghaniAmal/Mawqif 2024.arabicnlp-1.90 @@ -1131,9 +1131,9 @@ dz<fixed-case>S</fixed-case>tance at <fixed-case>S</fixed-case>tance<fixed-case>E</fixed-case>val2024: <fixed-case>A</fixed-case>rabic Stance Detection based on Sentence Transformers - MohamedLichouriUniversité des Sciences et de la Technologie Houari Boumediène - KhaledLounnasUniversité des Sciences et de la Technologie Houari Boumediène - OuarasRafik + MohamedLichouriUniversité des Sciences et de la Technologie Houari Boumediène + KhaledLounnasUniversité des Sciences et de la Technologie Houari Boumediène + OuarasRafik MohamedABi AnisGuechtouli 794-799 @@ -1186,7 +1186,7 @@ MohamedBadran Mo’menHamdyAlexandria University MarwanTorkiAlexandria University - NagwaEl-Makky + NagwaEl-Makky 823-827 Stance detection, an evolving task in natural language processing, involves understanding a writer’s perspective on certain topics by analyzing his written text and interactions online, especially on social media platforms. In this paper, we outline our submission to the StanceEval task, leveraging the Mawqif dataset featured in The Second Arabic Natural Language Processing Conference. Our task is to detect writers’ stances (Favor, Against, or None) towards three selected topics (COVID-19 vaccine, digital transformation, and women empowerment). We present our approach primarily relying on a contrastive loss ensemble strategy. Our proposed approach achieved an F1-score of 0.8438 and ranked first in the stanceEval 2024 task. The code and checkpoints are availableat https://github.com/MBadran2000/Mawqif.git 2024.arabicnlp-1.96 @@ -1219,7 +1219,7 @@ <fixed-case>PICT</fixed-case> at <fixed-case>S</fixed-case>tance<fixed-case>E</fixed-case>val2024: Stance Detection in <fixed-case>A</fixed-case>rabic using Ensemble of Large Language Models IshaanShukla AnkitVaidya - GeetanjaliKaleSCTR’s Pune Institute of Computer Technology + GeetanjaliKaleSCTR’s Pune Institute of Computer Technology 837-841 This paper outlines our approach to the StanceEval 2024- Arabic Stance Evaluation shared task. The goal of the task was to identify the stance, one out of three (Favor, Against or None) towards tweets based on three topics, namely- COVID-19 Vaccine, Digital Transformation and Women Empowerment. Our approach consists of fine-tuning BERT-based models efficiently for both, Single-Task Learning as well as Multi-Task Learning, the details of which are discussed. Finally, an ensemble was implemented on the best-performing models to maximize overall performance. We achieved a macro F1 score of 78.02% in this shared task. Our codebase is available publicly. 2024.arabicnlp-1.99 @@ -1228,7 +1228,7 @@ <fixed-case>TAO</fixed-case> at <fixed-case>S</fixed-case>tance<fixed-case>E</fixed-case>val2024 Shared Task: <fixed-case>A</fixed-case>rabic Stance Detection using <fixed-case>A</fixed-case>ra<fixed-case>BERT</fixed-case> - AnasMelhemPalestine Technical University - Kadoorie + AnasMelhemPalestine Technical University - Kadoorie OsamaHamedPalestine Technical University - Kadoorie ThaerSammarPalestine Technical University - Kadoorie 842-846 @@ -1239,12 +1239,12 @@ <fixed-case>W</fixed-case>ojood<fixed-case>NER</fixed-case> 2024: The Second <fixed-case>A</fixed-case>rabic Named Entity Recognition Shared Task - MustafaJarrarBirzeit University - NaghamHamadBirzeit University and Palestine Technical University - Kadoorie + MustafaJarrarBirzeit University + NaghamHamadBirzeit University and Palestine Technical University - Kadoorie MohammedKhaliliaQualtrics XM and Birzeit University - BasharTalafha + BasharTalafha AbdelRahimElmadanyUniversity of British Columbia - MuhammadAbdul-MageedUniversity of British Columbia + MuhammadAbdul-MageedUniversity of British Columbia 847-857 We present WojoodNER-2024, the second Arabic Named Entity Recognition (NER) Shared Task. In WojoodNER-2024, we focus on fine-grained Arabic NER. We provided participants with a new Arabic fine-grained NER dataset called Wojoodfine, annotated with subtypes of entities. WojoodNER-2024 encompassed three subtasks: (i) Closed-Track Flat Fine-Grained NER, (ii) Closed-Track Nested Fine-Grained NER, and (iii) an Open-Track NER for the Israeli War on Gaza. A total of 43 unique teams registered for this shared task. Five teams participated in the Flat Fine-Grained Subtask, among which two teams tackled the Nested Fine-Grained Subtask and one team participated in the Open-Track NER Subtask. The winning teams achieved F_1 scores of 91% and 92% in the Flat Fine-Grained and Nested Fine-Grained Subtasks, respectively. The sole team in the Open-Track Subtask achieved an F_1 score of 73.7%. 2024.arabicnlp-1.101 @@ -1253,12 +1253,12 @@ mu<fixed-case>NER</fixed-case>a at <fixed-case>W</fixed-case>ojood<fixed-case>NER</fixed-case> 2024: Multi-tasking <fixed-case>NER</fixed-case> Approach - NoufAlotaibiSaudi Data and AI Authority - HaneenAlhomoud + NoufAlotaibiSaudi Data and AI Authority + HaneenAlhomoud HananMurayshidKing Abdulaziz City for Science and Technology - WaadAlshammariKing Salman Global Academy for Arabic Language + WaadAlshammariKing Salman Global Academy for Arabic Language NoufAlshalawiKing Abdulaziz City for Science and Technology - SakharAlkhereyfKing Abdulaziz City for Science and Technology + SakharAlkhereyfKing Abdulaziz City for Science and Technology 858-866 This paper presents our system “muNERa”, submitted to the WojoodNER 2024 shared task at the second ArabicNLP conference. We participated in two subtasks, the flat and nested fine-grained NER sub-tasks (1 and 2). muNERa achieved first place in the nested NER sub-task and second place in the flat NER sub-task. The system is based on the TANL framework (CITATION),by using a sequence-to-sequence structured language translation approach to model both tasks. We utilize the pre-trained AraT5v2-base model as the base model for the TANL framework. The best-performing muNERa model achieves 91.07% and 90.26% for the F-1 scores on the test sets for the nested and flat subtasks, respectively. 2024.arabicnlp-1.102 diff --git a/data/xml/2024.argmining.xml b/data/xml/2024.argmining.xml index 272b37ed81..bdfe32d345 100644 --- a/data/xml/2024.argmining.xml +++ b/data/xml/2024.argmining.xml @@ -19,7 +19,7 @@ <fixed-case>ARIES</fixed-case>: A General Benchmark for Argument Relation Identification DebelaGemechu RamonRuiz-DolzUniversity of Dundee - ChrisReedUniversity of Dundee + ChrisReedUniversity of Dundee 1-14 Measuring advances in argument mining is one of the main challenges in the area. Different theories of argument, heterogeneous annotations, and a varied set of argumentation domains make it difficult to contextualise and understand the results reported in different work from a general perspective. In this paper, we present ARIES, a general benchmark for Argument Relation Identification aimed at providing with a standard evaluation for argument mining research. ARIES covers the three different language modelling approaches: sequence and token modelling, and sequence-to-sequence-to-sequence alignment, together with the three main Transformer-based model architectures: encoder-only, decoder-only, and encoder-decoder. Furthermore, the benchmark consists of eight different argument mining datasets, covering the most common argumentation domains, and standardised with the same annotation structures. This paper provides a first comprehensive and comparative set of results in argument mining across a broad range of configurations to compare with, both advancing the state-of-the-art, and establishing a standard way to measure future advances in the area. Across varied task setups and architectures, our experiments reveal consistent challenges in cross-dataset evaluation, with notably poor results. Given the models’ struggle to acquire transferable skills, the task remains challenging, opening avenues for future research. 2024.argmining-1.1 @@ -29,7 +29,7 @@ Detecting Scientific Fraud Using Argument Mining GabrielFreedman - FrancescaToniImperial College London + FrancescaToniImperial College London 15-28 A proliferation of fraudulent scientific research in recent years has precipitated a greater interest in more effective methods of detection. There are many varieties of academic fraud, but a particularly challenging type to detect is the use of paper mills and the faking of peer-review. To the best of our knowledge, there have so far been no attempts to automate this process.The complexity of this issue precludes the use of heuristic methods, like pattern-matching techniques, which are employed for other types of fraud. Our proposed method in this paper uses techniques from the Computational Argumentation literature (i.e. argument mining and argument quality evaluation). Our central hypothesis stems from the assumption that articles that have not been subject to the proper level of scrutiny will contain poorly formed and reasoned arguments, relative to legitimately published papers. We use a variety of corpora to test this approach, including a collection of abstracts taken from retracted papers. We show significant improvement compared to a number of baselines, suggesting that this approach merits further investigation. 2024.argmining-1.2 @@ -39,10 +39,10 @@ <fixed-case>D</fixed-case>eep<fixed-case>CT</fixed-case>-enhanced Lexical Argument Retrieval AlexanderBondarenkoFriedrich-Schiller Universität Jena and Universität Leipzig - MaikFröbeMartin-Luther Universität Halle-Wittenberg + MaikFröbeMartin-Luther Universität Halle-Wittenberg DanikHollatzMartin-Luther-Universität Halle-Wittenberg - JanMerkerFriedrich-Schiller Universität Jena - MatthiasHagenFriedrich-Schiller Universität Jena + JanMerkerFriedrich-Schiller Universität Jena + MatthiasHagenFriedrich-Schiller Universität Jena 29-35 The recent Touché lab’s argument retrieval task focuses on controversial topics like ‘Should bottled water be banned?’ and asks to retrieve relevant pro/con arguments. Interestingly, the most effective systems submitted to that task still are based on lexical retrieval models like BM25. In other domains, neural retrievers that capture semantics are more effective than lexical baselines. To add more “semantics” to argument retrieval, we propose to combine lexical models with DeepCT-based document term weights. Our evaluation shows that our approach is more effective than all the systems submitted to the Touché lab while being on par with modern neural re-rankers that themselves are computationally more expensive. 2024.argmining-1.3 @@ -52,8 +52,8 @@ Exploiting Dialogue Acts and Context to Identify Argumentative Relations in Online Debates StefanoMezza - WayneWobckeUniversity of New South Wales - AlanBlair + WayneWobckeUniversity of New South Wales + AlanBlair 36-45 Argumentative Relation Classification is the task of determining the relationship between two contributions in the context of an argumentative dialogue. Existing models in the literature rely on a combination of lexical features and pre-trained language models to tackle this task; while this approach is somewhat effective, it fails to take into account the importance of pragmatic features such as the illocutionary force of the argument or the structure of previous utterances in the discussion; relying solely on lexical features also produces models that over-fit their initial training set and do not scale to unseen domains. In this work, we introduce ArguNet, a new model for Argumentative Relation Classification which relies on a combination of Dialogue Acts and Dialogue Context to improve the representation of argument structures in opinionated dialogues. We show that our model achieves state-of-the-art results on the Kialo benchmark test set, and provide evidence of its robustness in an open-domain scenario. 2024.argmining-1.4 @@ -63,7 +63,7 @@ Multi-Task Learning Improves Performance in Deep Argument Mining Models AmirhosseinFarzamDuke University, Duke University - ShashankShekharNew York University + ShashankShekharNew York University IsaacMehlhaffTexas A&M University - College Station and Texas A&M University - College Station MarcoMorucci 46-58 @@ -84,12 +84,12 @@ <fixed-case>MAMK</fixed-case>it: A Comprehensive Multimodal Argument Mining Toolkit - EleonoraManciniUniversity of Bologna - FedericoRuggeriUniversity of Bologna + EleonoraManciniUniversity of Bologna + FedericoRuggeriUniversity of Bologna StefanoColamonacoUniversity of Bologna AndreaZeccaUniversity of Bologna SamueleMarroUniversity of Bologna - PaoloTorroniUniversity of Bologna + PaoloTorroniUniversity of Bologna 69-82 Multimodal Argument Mining (MAM) is a recent area of research aiming to extend argument analysis and improve discourse understanding by incorporating multiple modalities. Initial results confirm the importance of paralinguistic cues in this field. However, the research community still lacks a comprehensive platform where results can be easily reproduced, and methods and models can be stored, compared, and tested against a variety of benchmarks. To address these challenges, we propose MAMKit, an open, publicly available, PyTorch toolkit that consolidates datasets and models, providing a standardized platform for experimentation. MAMKit also includes some new baselines, designed to stimulate research on text and audio encoding and fusion for MAM tasks. Our initial results with MAMKit indicate that advancements in MAM require novel annotation processes to encompass auditory cues effectively. 2024.argmining-1.7 @@ -182,9 +182,9 @@ Sövereign at The Perspective Argument Retrieval Shared Task 2024: Using <fixed-case>LLM</fixed-case>s with Argument Mining RobertGünzler ÖzgeSevgili - SteffenRemusUniversität Hamburg + SteffenRemusUniversität Hamburg ChrisBiemannU Hamburg - IrinaNikishina + IrinaNikishina 150-158 This paper presents the Sövereign submission for the shared task on perspective argument retrieval for the Argument Mining Workshop 2024. The main challenge is to perform argument retrieval considering socio-cultural aspects such as political interests, occupation, age, and gender. To address the challenge, we apply open-access Large Language Models (Mistral-7b) in a zero-shot fashion for re-ranking and explicit similarity scoring. Additionally, we combine different features in an ensemble setup using logistic regression. Our system ranks second in the competition for all test set rounds on average for the logistic regression approach using LLM similarity scores as a feature. In addition to the description of the approach, we also provide further results of our ablation study. Our code will be open-sourced upon acceptance. 2024.argmining-1.15 @@ -204,7 +204,7 @@ Twente-<fixed-case>BMS</fixed-case>-<fixed-case>NLP</fixed-case> at <fixed-case>P</fixed-case>erspective<fixed-case>A</fixed-case>rg 2024: Combining Bi-Encoder and Cross-Encoder for Argument Retrieval LeixinZhang - DanielBraunUniversity of Twente + DanielBraunUniversity of Twente 164-168 The paper describes our system for the Perspective Argument Retrieval Shared Task. The shared task consists of three scenarios in which relevant political arguments have to be retrieved based on queries (Scenario 1). In Scenario 2 explicit socio-cultural properties are provided and in Scenario 3 implicit socio-cultural properties within the arguments have to be used. We combined a Bi-Encoder and a Cross-Encoder to retrieve relevant arguments for each query. For the third scenario, we extracted linguistic features to predict socio-demographic labels as a separate task. However, the socio-demographic match task proved challenging due to the constraints of argument lengths and genres. The described system won both tracks of the shared task. 2024.argmining-1.17 @@ -214,10 +214,10 @@ <fixed-case>GESIS</fixed-case>-<fixed-case>DSM</fixed-case> at <fixed-case>P</fixed-case>erpective<fixed-case>A</fixed-case>rg2024: A Matter of Style? Socio-Cultural Differences in Argumentation MaximilianMaurerGESIS Leibniz Institute for the Social Sciences - JuliaRombergGESIS Leibniz Institute for the Social Sciences - MyrtheReuverVrije Universiteit Amsterdam + JuliaRombergGESIS Leibniz Institute for the Social Sciences + MyrtheReuverVrije Universiteit Amsterdam NegashWeldekiros - GabriellaLapesaGESIS – Leibniz Institute for the Social Sciences and Heinrich-Heine University Düsseldorf + GabriellaLapesaGESIS – Leibniz Institute for the Social Sciences and Heinrich-Heine University Düsseldorf 169-181 This paper describes the contribution of team GESIS-DSM to the Perspective Argument Retrieval Task, a task on retrieving socio-culturally relevant and diverse arguments for different user queries. Our experiments and analyses aim to explore the nature of the socio-cultural specialization in argument retrieval: (how) do the arguments written by different socio-cultural groups differ? We investigate the impact of content and style for the task of identifying arguments relevant to a query and a certain demographic attribute. In its different configurations, our system employs sentence embedding representations, arguments generated with Large Language Model, as well as stylistic features. final method places third overall in the shared task, and, in comparison, does particularly well in the most difficult evaluation scenario, where the socio-cultural background of the argument author is implicit (i.e. has to be inferred from the text). This result indicates that socio-cultural differences in argument production may indeed be a matter of style. 2024.argmining-1.18 @@ -226,9 +226,9 @@ <fixed-case>XFACT</fixed-case> Team0331 at <fixed-case>P</fixed-case>erspective<fixed-case>A</fixed-case>rg2024: Sampling from Bounded Clusters for Diverse Relevant Argument Retrieval - Wan JuKang + Wan JuKang JiyoungHankaist - JaeminJung + JaeminJung JamesThorneKAIST 182-188 This paper reports on the argument mining system submitted to the ArgMining workshop 2024 for The Perspective Argument Retrieval Shared Task (Falk et al., 2024). We com- bine the strengths of a smaller Sentence BERT model and a Large Language Model: the for- mer is fine-tuned for a contrastive embedding objective and a classification objective whereas the latter is invoked to augment the query and populate the latent space with diverse relevant arguments. We conduct an ablation study on these components to find that each contributes substantially to the diversity and relevance cri- teria for the top-k retrieval of arguments from the given corpus. diff --git a/data/xml/2024.c3nlp.xml b/data/xml/2024.c3nlp.xml index eb4c419019..b1f73a2bc0 100644 --- a/data/xml/2024.c3nlp.xml +++ b/data/xml/2024.c3nlp.xml @@ -40,9 +40,9 @@ Conformity, Confabulation, and Impersonation: Persona Inconstancy in Multi-Agent <fixed-case>LLM</fixed-case> Collaboration - RazanBaltaji - BabakHemmatian - LavVarshneyUniversity of Illinois at Urbana-Champaign + RazanBaltaji + BabakHemmatian + LavVarshneyUniversity of Illinois at Urbana-Champaign 17-31 This study explores the sources of instability in maintaining cultural personas and opinions within multi-agent LLM systems. Drawing on simulations of inter-cultural collaboration and debate, we analyze agents’ pre- and post-discussion private responses alongside chat transcripts to assess the stability of cultural personas and the impact of opinion diversity on group outcomes. Our findings suggest that multi-agent discussions can encourage collective decisions that reflect diverse perspectives, yet this benefit is tempered by the agents’ susceptibility to conformity due to perceived peer pressure and challenges in maintaining consistent personas and opinions. Counterintuitively, instructions that encourage debate in support of one’s opinions increase the rate of instability. Without addressing the factors we identify, the full potential of multi-agent frameworks for producing more culturally diverse AI outputs will remain untapped. 2024.c3nlp-1.2 @@ -51,10 +51,10 @@ Synchronizing Approach in Designing Annotation Guidelines for Multilingual Datasets: A <fixed-case>COVID</fixed-case>-19 Case Study Using <fixed-case>E</fixed-case>nglish and <fixed-case>J</fixed-case>apanese Tweets - KikiFerawati + KikiFerawati Wan JouSheKyoto Institute of Technology - ShokoWakamiyaNara Institute of Science and Technology - EijiAramakiNara Institute of Science and Technology, Japan + ShokoWakamiyaNara Institute of Science and Technology + EijiAramakiNara Institute of Science and Technology, Japan 32-41 The difference in culture between the U.S. and Japan is a popular subject for Western vs. Eastern cultural comparison for researchers. One particular challenge is to obtain and annotate multilingual datasets. In this study, we utilized COVID-19 tweets from the two countries as a case study, focusing particularly on discussions concerning masks. The annotation task was designed to gain insights into societal attitudes toward the mask policies implemented in both countries. The aim of this study is to provide a practical approach for the annotation task by thoroughly documenting how we aligned the multilingual annotation guidelines to obtain a comparable dataset. We proceeded to document the effective practices during our annotation process to synchronize our multilingual guidelines. Furthermore, we discussed difficulties caused by differences in expression style and culture, and potential strategies that helped improve our agreement scores and reduce discrepancies between the annotation results in both languages. These findings offer an alternative method for synchronizing multilingual annotation guidelines and achieving feasible agreement scores for cross-cultural annotation tasks. This study resulted in a multilingual guideline in English and Japanese to annotate topics related to public discourses about COVID-19 masks in the U.S. and Japan. 2024.c3nlp-1.3 @@ -63,11 +63,11 @@ <fixed-case>CRAFT</fixed-case>: Extracting and Tuning Cultural Instructions from the Wild - BinWangI2R, A*STAR + BinWangI2R, A*STAR GeyuLinInstitute of Infocomm Research, A*STAR ZhengyuanLiuI2R ChengweiWei - NancyChen + NancyChen 42-47 Large language models (LLMs) have rapidly evolved as the foundation of various natural language processing (NLP) applications. Despite their wide use cases, their understanding of culturally-related concepts and reasoning remains limited. Meantime, there is a significant need to enhance these models’ cultural reasoning capabilities, especially concerning underrepresented regions. This paper introduces a novel pipeline for extracting high-quality, culturally-related instruction tuning datasets from vast unstructured corpora. We utilize a self-instruction generation pipeline to identify cultural concepts and trigger instruction. By integrating with a general-purpose instruction tuning dataset, our model demonstrates enhanced capabilities in recognizing and understanding regional cultural nuances, thereby enhancing its reasoning capabilities. We conduct experiments across three regions: Singapore, the Philippines, and the United States, achieving performance improvement of up to 6%. Our research opens new avenues for extracting cultural instruction tuning sets directly from unstructured data, setting a precedent for future innovations in the field. 2024.c3nlp-1.4 @@ -86,15 +86,15 @@ Do Multilingual Large Language Models Mitigate Stereotype Bias? ShangruiNie - MichaelFrommFraunhofer Institute IAIS, Fraunhofer IAIS - CharlesWelchMcMaster University + MichaelFrommFraunhofer Institute IAIS, Fraunhofer IAIS + CharlesWelchMcMaster University RebekkaGörgeFraunhofer Institute IAIS, Fraunhofer IAIS - AkbarKarimiRheinische Friedrich-Wilhelms Universität Bonn + AkbarKarimiRheinische Friedrich-Wilhelms Universität Bonn JoanPlepiRheinische Friedrich-Wilhelms Universität Bonn NaziaMowmitaFraunhofer Institute IAIS, Fraunhofer IAIS and Rheinische Friedrich-Wilhelms-Universität Bonn NicolasFlores-HerrMax-Planck Institute and Fraunhofer Institute IAIS, Fraunhofer IAIS MehdiAliFraunhofer Institute IAIS, Fraunhofer IAIS - LucieFlekRheinische Friedrich-Wilhelms Universität Bonn + LucieFlekRheinische Friedrich-Wilhelms Universität Bonn 65-83 While preliminary findings indicate that multilingual LLMs exhibit reduced bias compared to monolingual ones, a comprehensive understanding of the effect of multilingual training on bias mitigation, is lacking. This study addresses this gap by systematically training six LLMs of identical size (2.6B parameters) and architecture: five monolingual models (English, German, French, Italian, and Spanish) and one multilingual model trained on an equal distribution of data across these languages, all using publicly available data. To ensure robust evaluation, standard bias benchmarks were automatically translated into the five target languages and verified for both translation quality and bias preservation by human annotators. Our results consistently demonstrate that multilingual training effectively mitigates bias. Moreover, we observe that multilingual models achieve not only lower bias but also superior prediction accuracy when compared to monolingual models with the same amount of training data, model architecture, and size. 2024.c3nlp-1.6 @@ -103,7 +103,7 @@ Sociocultural Considerations in Monitoring Anti-<fixed-case>LGBTQ</fixed-case>+ Content on Social Media - SidneyWongUniversity of Canterbury + SidneyWongUniversity of Canterbury 84-97 The purpose of this paper is to ascertain the influence of sociocultural factors (i.e., social, cultural, and political) in the development of hate speech detection systems. We set out to investigate the suitability of using open-source training data to monitor levels of anti-LGBTQ+ content on social media across different national-varieties of English. Our findings suggests the social and cultural alignment of open-source hate speech data sets influences the predicted outputs. Furthermore, the keyword-search approach of anti-LGBTQ+ slurs in the development of open-source training data encourages detection models to overfit on slurs; therefore, anti-LGBTQ+ content may go undetected. We recommend combining empirical outputs with qualitative insights to ensure these systems are fit for purpose. 2024.c3nlp-1.7 @@ -112,10 +112,10 @@ Are Generative Language Models Multicultural? A Study on <fixed-case>H</fixed-case>ausa Culture and Emotions using <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case> - IbrahimAhmadNortheastern University - ShiranDudyNortheastern University - ResmiRamachandranpillaiInstitute for Experiential AI and Linköping University - KennethChurchNortheastern University + IbrahimAhmadNortheastern University + ShiranDudyNortheastern University + ResmiRamachandranpillaiInstitute for Experiential AI and Linköping University + KennethChurchNortheastern University 98-106 Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa’s culture and emotions. We compare responses generated by ChatGPT with those provided by native Hausa speakers on 37 culturally relevant questions. We conducted experiments using emotion analysis. We also used two similarity metrics to measure the alignment between human and ChatGPT responses. We also collect human participants ratings and feedback on ChatGPT responses. Our results show that ChatGPT has some level of similarity to human responses, but also exhibits some gaps and biases in its knowledge and awareness of Hausa culture and emotions. We discuss the implications and limitations of our methodology and analysis and suggest ways to improve the performance and evaluation of LLMs for low-resource languages. 2024.c3nlp-1.8 @@ -125,7 +125,7 @@ Computational Language Documentation: Designing a Modular Annotation and Data Management Tool for Cross-cultural Applicability AlexandraO’NeilIndiana University at Bloomington - DanielSwansonIndiana University + DanielSwansonIndiana University ShobhanaChelliahIndiana University at Bloomington 107-116 While developing computational language documentation tools, researchers must center the role of language communities in the process by carefully reflecting on and designing tools to support the varying needs and priorities of different language communities. This paper provides an example of how cross-cultural considerations discussed in literature about language documentation, data sovereignty, and community-led documentation projects can motivate the design of a computational language documentation tool by reflecting on our design process as we work towards developing an annotation and data management tool. We identify three recurring themes for cross-cultural consideration in the literature - Linguistic Sovereignty, Cultural Specificity, and Reciprocity - and present eight essential features for an annotation and data management tool that reflect these themes. diff --git a/data/xml/2024.climatenlp.xml b/data/xml/2024.climatenlp.xml index 3325d6ef1f..8496c699cf 100644 --- a/data/xml/2024.climatenlp.xml +++ b/data/xml/2024.climatenlp.xml @@ -57,7 +57,7 @@ My Climate Advisor: An Application of <fixed-case>NLP</fixed-case> in Climate Adaptation for Agriculture VincentNguyen - SarvnazKarimiCSIRO + SarvnazKarimiCSIRO WillowHallgrenCSIRO AshleyHarkin MaheshPrakash @@ -70,8 +70,8 @@ Generative Debunking of Climate Misinformation FranciscoZanartuUniversity of Melbourne - YuliaOtmakhovaThe University of Melbourne - JohnCook + YuliaOtmakhovaThe University of Melbourne + JohnCook LeaFrermannUniversity of Melbourne 46-62 Misinformation about climate change causes numerous negative impacts, necessitating corrective responses. Psychological research has offered various strategies for reducing the influence of climate misinformation, such as the fact-myth-fallacy-fact-structure. However, practically implementing corrective interventions at scale represents a challenge. Automatic detection and correction of misinformation offers a solution to the misinformation problem. This study documents the development of large language models that accept as input a climate myth and produce a debunking that adheres to the fact-myth-fallacy-fact (“truth sandwich”) structure, by incorporating contrarian claim classification and fallacy detection into an LLM prompting framework. We combine open (Mixtral, Palm2) and proprietary (GPT-4) LLMs with prompting strategies of varying complexity. Experiments reveal promising performance of GPT-4 and Mixtral if combined with structured prompts. We identify specific challenges of debunking generation and human evaluation, and map out avenues for future work. We release a dataset of high-quality truth-sandwich debunkings, source code and a demo of the debunking system. @@ -82,7 +82,7 @@ Decoding Climate Disagreement: A Graph Neural Network-Based Approach to Understanding Social Media Dynamics RuiranSuUniversity of Oxford - JanetPierrehumbertUniversity of Oxford + JanetPierrehumbertUniversity of Oxford 63-81 This paper presents the ClimateSent-GAT Model, a novel approach that combines Graph Attention Networks (GATs) with natural language processing techniques to accurately identify and predict disagreements within Reddit comment-reply pairs. Our model classifies disagreements into three categories: agree, disagree, and neutral. Leveraging the inherent graph structure of Reddit comment-reply pairs, the model significantly outperforms existing benchmarks by capturing complex interaction patterns and sentiment dynamics. This research advances graph-based NLP methodologies and provides actionable insights for policymakers and educators in climate science communication. 2024.climatenlp-1.5 @@ -91,8 +91,8 @@ Evaluating <fixed-case>C</fixed-case>hat<fixed-case>N</fixed-case>et<fixed-case>Z</fixed-case>ero, an <fixed-case>LLM</fixed-case>-Chatbot to Demystify Climate Pledges - AngelHsuUniversity of North Carolina at Chapel Hill - MasonLaneyUniversity of North Carolina at Chapel Hill + AngelHsuUniversity of North Carolina at Chapel Hill + MasonLaneyUniversity of North Carolina at Chapel Hill JiZhangArboretica DiegoManyaUniversity of North Carolina at Chapel Hill LindaFarczadiArboretica @@ -106,16 +106,16 @@ Using <fixed-case>LLM</fixed-case>s to Build a Database of Climate Extreme Impacts NiLiVrije Universiteit Brussel ShorouqZahraRISE Research Institutes of Sweden AB - MarianaBritoHelmholtz Zentrum München - ClareFlynn + MarianaBritoHelmholtz Zentrum München + ClareFlynn OlofGörnerup - KoffiWorou + KoffiWorou MurathanKurfali ChanjuanMeng - WimThiery - JakobZscheischlerHelmholtz Centre for Environmental Research - UFZ + WimThiery + JakobZscheischlerHelmholtz Centre for Environmental Research - UFZ GabrieleMessoriUppsala University and Stockholm University - JoakimNivreUppsala University + JoakimNivreUppsala University 93-110 To better understand how extreme climate events impact society, we need to increase the availability of accurate and comprehensive information about these impacts. We propose a method for building large-scale databases of climate extreme impacts from online textual sources, using LLMs for information extraction in combination with more traditional NLP techniques to improve accuracy and consistency. We evaluate the method against a small benchmark database created by human experts and find that extraction accuracy varies for different types of information. We compare three different LLMs and find that, while the commercial GPT-4 model gives the best performance overall, the open-source models Mistral and Mixtral are competitive for some types of information. 2024.climatenlp-1.7 @@ -124,8 +124,8 @@ Envisioning <fixed-case>NLP</fixed-case> for intercultural climate communication - StevenBirdCharles Darwin University - AngelinaAquinoCharles Darwin University + StevenBirdCharles Darwin University + AngelinaAquinoCharles Darwin University IanGumbulaCharles Darwin University 111-122 Climate communication is often seen by the NLP community as an opportunity for machine translation, applied to ever smaller languages. However, over 90% the world’s linguistic diversity comes from languages with ‘primary orality’ and mostly spoken in non-Western oral societies. A case in point is the Aboriginal communities of Northern Australia, where we have been conducting workshops on climate communication, revealing shortcomings in existing communication practices along with new opportunities for improving intercultural communication. We present a case study of climate communication in an oral society, including the voices of many local people, and draw several lessons for the research program of NLP in the climate space. @@ -156,9 +156,9 @@ Large Scale Narrative Messaging around Climate Change: A Cross-Cultural Comparison HaiqiZhou - David GHobson + David GHobson DerekRuths - AndrewPiper + AndrewPiper 143-155 In this study, we explore the use of Large Language Models (LLMs) such as GPT-4 to extract and analyze the latent narrative messaging in climate change-related news articles from North American and Chinese media. By defining “narrative messaging” as the intrinsic moral or lesson of a story, we apply our model to a dataset of approximately 15,000 news articles in English and Mandarin, categorized by climate-related topics and ideological groupings. Our findings reveal distinct differences in the narrative values emphasized by different cultural and ideological contexts, with North American sources often focusing on individualistic and crisis-driven themes, while Chinese sources emphasize developmental and cooperative narratives. This work demonstrates the potential of LLMs in understanding and influencing climate communication, offering new insights into the collective belief systems that shape public discourse on climate change across different cultures. 2024.climatenlp-1.11 @@ -178,8 +178,8 @@ Structuring Sustainability Reports for Environmental Standards with <fixed-case>LLM</fixed-case>s guided by Ontology - AidaUsmanovaLeuphana Universitüt Lüneburg - RicardoUsbeckLeuphana Universitüt Lüneburg + AidaUsmanovaLeuphana Universitüt Lüneburg + RicardoUsbeckLeuphana Universitüt Lüneburg 168-177 Following the introduction of the European Sustainability Reporting Standard (ESRS), companies will have to adapt to a new policy and provide mandatory sustainability reports. However, implementing such reports entails a challenge, such as the comprehension of a large number of textual information from various sources. This task can be accelerated by employing Large Language Models (LLMs) and ontologies to effectively model the domain knowledge. In this study, we extended an existing ontology to model ESRS Topical Standard for disclosure. The developed ontology would enable automated reasoning over the data and assist in constructing Knowledge Graphs (KGs). Moreover, the proposed ontology extension would also help to identify gaps in companies’ sustainability reports with regard to the ESRS requirements.Additionally, we extracted knowledge from corporate sustainability reports via LLMs guided with a proposed ontology and developed their KG representation. 2024.climatenlp-1.13 @@ -188,11 +188,11 @@ Unlearning Climate Misinformation in Large Language Models - MichaelFore + MichaelFore SimranjitSinghMicrosoft ChaehongLee AmritanshuPandeyUniversity of Vermont - AntoniosAnastasopoulosAthena Research Center and George Mason University + AntoniosAnastasopoulosAthena Research Center and George Mason University DimitriosStamoulisMicrosoft 178-192 Misinformation regarding climate change is a key roadblock in addressing one of the most serious threats to humanity. This paper investigates factual accuracy in large language models (LLMs) regarding climate information. Using true/false labeled Q&A data for fine-tuning and evaluating LLMs on climate-related claims, we compare open-source models, assessing their ability to generate truthful responses to climate change questions. We investigate the detectability of models intentionally poisoned with false climate information, finding that such poisoning may not affect the accuracy of a model’s responses in other domains. Furthermore, we compare the effectiveness of unlearning algorithms, fine-tuning, and Retrieval-Augmented Generation (RAG) for factually grounding LLMs on climate change topics. Our evaluation reveals that unlearning algorithms can be effective for nuanced conceptual claims, despite previous findings suggesting their inefficacy in privacy contexts. These insights aim to guide the development of more factually reliable LLMs and highlight the need for additional work to secure LLMs against misinformation attacks. @@ -202,13 +202,13 @@ Statements: Universal Information Extraction from Tables with Large Language Models for <fixed-case>ESG</fixed-case> <fixed-case>KPI</fixed-case>s - LokeshMishraIBM Research + LokeshMishraIBM Research SohaylDhibi YusikKimInternational Business Machines - CesarBerrospi RamisInternational Business Machines + CesarBerrospi RamisInternational Business Machines ShubhamGuptaInternational Business Machines - MicheleDolfiInternational Business Machines - PeterStaar + MicheleDolfiInternational Business Machines + PeterStaar 193-214 Environment, Social, and Governance (ESG) KPIs assess an organization’s performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports. 2024.climatenlp-1.15 @@ -253,7 +253,7 @@ <fixed-case>SDG</fixed-case> target detection in environmental reports using Retrieval-augmented Generation with <fixed-case>LLM</fixed-case>s - DarioGarigliottiUniversity of Bergen + DarioGarigliottiUniversity of Bergen 241-250 With the consolidation of Large Language Models (LLM) as a dominant component in approaches for multiple linguistic tasks, the interest in these technologies has greatly increased within a variety of areas and domains. A particular scenario of information needs where to exploit these approaches is climate-aware NLP. Paradigmatically, the vast manual labour of inspecting long, heterogeneous documents to find environment-relevant expressions and claims suits well within a recently established Retrieval-augmented Generation (RAG) framework. In this paper, we tackle two dual problems within environment analysis dealing with the common goal of detecting a Sustainable Developmental Goal (SDG) target being addressed in a textual passage of an environmental assessment report.We develop relevant test collections, and propose and evaluate a series of methods within the general RAG pipeline, in order to assess the current capabilities of LLMs for the tasks of SDG target evidence identification and SDG target detection. 2024.climatenlp-1.19 @@ -263,7 +263,7 @@ Assessing the Effectiveness of <fixed-case>GPT</fixed-case>-4o in Climate Change Evidence Synthesis and Systematic Assessments: Preliminary Insights ElphinJoePennsylvania State University - SaiKoneruPennsylvania State University + SaiKoneruPennsylvania State University ChristineKirchhoffPennsylvania State University 251-257 In this research short, we examine the potential of using GPT-4o, a state-of-the-art large language model (LLM) to undertake evidence synthesis and systematic assessment tasks. Traditional workflows for such tasks involve large groups of domain experts who manually review and synthesize vast amounts of literature. The exponential growth of scientific literature and recent advances in LLMs provide an opportunity to complementing these traditional workflows with new age tools. We assess the efficacy of GPT-4o to do these tasks on a sample from the dataset created by the Global Adaptation Mapping Initiative (GAMI) where we check the accuracy of climate change adaptation related feature extraction from the scientific literature across three levels of expertise. Our results indicate that while GPT-4o can achieve high accuracy in low-expertise tasks like geographic location identification, their performance in intermediate and high-expertise tasks, such as stakeholder identification and assessment of depth of the adaptation response, is less reliable. The findings motivate the need for designing assessment workflows that utilize the strengths of models like GPT-4o while also providing refinements to improve their performance on these tasks. diff --git a/data/xml/2024.cmcl.xml b/data/xml/2024.cmcl.xml index 116d505180..0401efb9ee 100644 --- a/data/xml/2024.cmcl.xml +++ b/data/xml/2024.cmcl.xml @@ -54,7 +54,7 @@ Do large language models resemble humans in language use? - ZhenguangCai + ZhenguangCai XufengDuan DavidHaslett ShuqiWang @@ -67,8 +67,8 @@ The Curious Case of Representational Alignment: Unravelling Visio-Linguistic Tasks in Emergent Communication - TomKouwenhovenLeiden University, Leiden University - MaxPeeperkorn + TomKouwenhovenLeiden University, Leiden University + MaxPeeperkorn BramVan DijkLeiden University TessaVerhoefLeiden University, Leiden University 57-71 @@ -81,7 +81,7 @@ Hierarchical syntactic structure in human-like language models MichaelWolfman DonaldDunagan - JonathanBrennanUniversity of Michigan - Ann Arbor + JonathanBrennanUniversity of Michigan - Ann Arbor JohnHaleJohns Hopkins University, University of Georgia and DeepMind 72-80 Language models (LMs) are a meeting point for cognitive modeling and computational linguistics. How should they be designed to serve as adequate cognitive models? To address this question, this study contrasts two Transformer-based LMs that share the same architecture. Only one of them analyzes sentences in terms of explicit hierarchical structure. Evaluating the two LMs against fMRI time series via the surprisal complexity metric, the results implicate the superior temporal gyrus. These findings underline the need for hierarchical sentence structures in word-by-word models of human language comprehension. @@ -92,11 +92,11 @@ Do <fixed-case>LLM</fixed-case>s Agree with Humans on Emotional Associations to Nonsense Words? YuiMiyakawaNagoya University - ChihayaMatsuhira + ChihayaMatsuhira HirotakaKato TakatsuguHirayamaUniversity of Human Environments - TakahiroKomamizuNagoya University - IchiroIdeNagoya University and Nagoya University + TakahiroKomamizuNagoya University + IchiroIdeNagoya University and Nagoya University 81-85 Understanding human perception of nonsense words is helpful to devise product and character names that match their characteristics. Previous studies have suggested the usefulness of Large Language Models (LLMs) for estimating such human perception, but they did not focus on its emotional aspects. Hence, this study aims to elucidate the relationship of emotions evoked by nonsense words between humans and LLMs. Using a representative LLM, GPT-4, we reproduce the procedure of an existing study to analyze evoked emotions of humans for nonsense words. A positive correlation of 0.40 was found between the emotion intensity scores reproduced by GPT-4 and those manually annotated by humans. Although the correlation is not very high, this demonstrates that GPT-4 may agree with humans on emotional associations to nonsense words. Considering that the previous study reported that the correlation among human annotators was about 0.68 on average and that between a regression model trained on the annotations for real words and humans was 0.17, GPT-4’s agreement with humans is notably strong. 2024.cmcl-1.7 @@ -106,8 +106,8 @@ Large language models fail to derive atypicality inferences in a human-like manner CharlotteKurchUniversität des Saarlandes - MargaritaRyzhova - VeraDembergUniversität des Saarlandes + MargaritaRyzhova + VeraDembergUniversität des Saarlandes 86-100 Recent studies have claimed that large language models (LLMs) are capable of drawing pragmatic inferences (Qiu et al., 2023; Hu et al., 2022; Barattieri di San Pietro et al., 2023). The present paper sets out to test LLM’s abilities on atypicality inferences, a type of pragmatic inference that is triggered through informational redundancy. We test several state-of-the-art LLMs in a zero-shot setting and find that LLMs fail to systematically fail to derive atypicality inferences. Our robustness analysis indicates that when inferences are seemingly derived in a few-shot settings, these results can be attributed to shallow pattern matching and not pragmatic inferencing. We also analyse the performance of the LLMs at the different derivation steps required for drawing atypicality inferences – our results show that models have access to script knowledge and can use it to identify redundancies and accommodate the atypicality inference. The failure instead seems to stem from not reacting to the subtle maxim of quantity violations introduced by the informationally redundant utterances. 2024.cmcl-1.8 @@ -117,7 +117,7 @@ Predict but Also Integrate: an Analysis of Sentence Processing Models for <fixed-case>E</fixed-case>nglish and <fixed-case>H</fixed-case>indi NinaDelcaro - LucaOnnisUniversity of Oslo + LucaOnnisUniversity of Oslo RaquelAlhamaUniversity of Amsterdam, University of Amsterdam 101-108 Fluent speakers make implicit predictions about forthcoming linguistic items while processing sentences, possibly to increase efficiency in real-time comprehension. However, the extent to which prediction is the primary mode of processing human language is widely debated. The human language processor may also gain efficiency by integrating new linguistic information with prior knowledge and the preceding context, without actively predicting. At present, the role of probabilistic integration, as well as its computational foundation, remains relatively understudied. Here, we explored whether a Delayed Recurrent Neural Network (d-RNN, Turek et al., 2020), as an implementation of both prediction and integration, can explain patterns of human language processing over and above the contribution of a purely predictive RNN model. We found that incorporating integration contributes to explaining variability in eye-tracking data for English and Hindi. @@ -129,7 +129,7 @@ Transformer Attention vs Human Attention in Anaphora Resolution AnastasiaKozlova AlbinaAkhmetgareeva - AigulKhanovaHigher School of Economics + AigulKhanovaHigher School of Economics SemenKudriavtsev AlenaFenogenovaSaluteDevices 109-122 @@ -149,9 +149,9 @@ Daily auditory environments in <fixed-case>F</fixed-case>rench-speaking infants: A longitudinal dataset - EstelleHervé - ClémentFrançoisCNRS - LaurentPrevotUniversité d’Aix-Marseille + EstelleHervé + ClémentFrançoisCNRS + LaurentPrevotUniversité d’Aix-Marseille 132-151 Babies’ daily auditory environment plays a crucial role in language development. Most previous research estimating the quantitative and qualitative aspects of early speech inputs has predominantly focused on English- and Spanish-speaking families. In addition, validation studies for daylong recordings’ analysis tools are scarce on French data sets.In this paper, we present a French corpus of daylong audio recordings longitudinally collected with the LENA (Language ENvironment Analysis) system from infants aged 3 to 24 months. We conduct a thorough exploration of this data set, which serves as a quality check for both the data and the analysis tools.We evaluate the reliability of LENA metrics by systematically comparing them with those obtained from the ChildProject set of tools and by checking the known dynamics of the metrics with age. These metrics are also used to replicate, on our data set, findings from (Warlaumont et al, 2014) about the increase of infants’ speech vocalizations and temporal contingencies between infants and caregivers with age. 2024.cmcl-1.12 @@ -160,10 +160,10 @@ Analysing and Validating Language Complexity Metrics Across <fixed-case>S</fixed-case>outh <fixed-case>A</fixed-case>merican Indigenous Languages - FelipeSerrasUniversidade de São Paulo - MiguelCarpiUniversidade de São Paulo - MatheusBrancoUniversity of São Paulo, Universidade de São Paulo - MarceloFingerUniversidade de São Paulo + FelipeSerrasUniversidade de São Paulo + MiguelCarpiUniversidade de São Paulo + MatheusBrancoUniversity of São Paulo, Universidade de São Paulo + MarceloFingerUniversidade de São Paulo 152-165 Language complexity is an emerging concept critical for NLP and for quantitative and cognitive approaches to linguistics. In this work, we evaluate the behavior of a set of compression-based language complexity metrics when applied to a large set of native South American languages. Our goal is to validate the desirable properties of such metrics against a more diverse set of languages, guaranteeing the universality of the techniques developed on the basis of this type of theoretical artifact. Our analysis confirmed with statistical confidence most propositions about the metrics studied, affirming their robustness, despite showing less stability than when the same metrics were applied to Indo-European languages. We also observed that the trade-off between morphological and syntactic complexities is strongly related to language phylogeny. 2024.cmcl-1.13 @@ -175,7 +175,7 @@ DaphneWangQuandela MehrnooshSadrzadehUniversity College London MilošStanojevićUniversity College London, University of London and Google DeepMind - Wing-YeeChowUniversity College London, University of London + Wing-YeeChowUniversity College London, University of London RichardBrehenyUniversity College London, University of London 166-176 Psycholinguistic experiments reveal that efficiency of human language use is founded on predictions at both syntactic and lexical levels. Previous models of human prediction exploiting LLMs have used an information theoretic measure called surprisal, with success on naturalistic text in a wide variety of languages, but under-performance on challenging text such as garden path sentences. This paper introduces a novel framework that combines the lexical predictions of an LLM with the syntactic structures provided by a dependency parser. The framework gives rise to an Incompatibility Fraction. When tested on two garden path datasets, it correlated well with human reading times, distinguished between easy and hard garden path, and outperformed surprisal. @@ -186,8 +186,8 @@ Morphology Matters: Probing the Cross-linguistic Morphological Generalization Abilities of Large Language Models through a Wug Test DangAnh - LimorRavivMax-Planck Institute - LukasGalkeMax Planck Institute for Psycholinguistics + LimorRavivMax-Planck Institute + LukasGalkeMax Planck Institute for Psycholinguistics 177-188 We develop a multilingual version of the Wug Test, an artificial word completion experiment that is typically used to test the morphological knowledge of children, and apply it to the GPT family of large language models (LLMs). LLMs’ performance on this test was evaluated by native speakers of six different languages, who judged whether the inflected and derived forms generated by the models conform to the morphological rules of their language. Our results show that LLMs can generalize their morphological knowledge to new, unfamiliar words, but that their success in generating the “correct” generalization (as judged by native human speakers) is predicted by a language’s morphological complexity (specifically, integrative complexity). We further find that the amount of training data has surprisingly little on LLMs’ morphological generalization abilities within the scope of the analyzed languages. These findings highlight that “morphology matters”, and have important implications for improving low-resource language modeling. 2024.cmcl-1.15 @@ -198,7 +198,7 @@ Evaluating Grammatical Well-Formedness in Large Language Models: A Comparative Study with Human Judgments ZhuangQiu XufengDuan - ZhenguangCai + ZhenguangCai 189-198 Research in artificial intelligence has witnessed the surge of large language models (LLMs) demonstrating improved performance in various natural language processing tasks. This has sparked significant discussions about the extent to which large language models emulate human linguistic cognition and usage. This study delves into the representation of grammatical well-formedness in LLMs, which is a critical aspect of linguistic knowledge. In three preregistered experiments, we collected grammaticality judgment data for over 2400 English sentences with varying structures from ChatGPT and Vicuna, comparing them with human judgment data. The results reveal substantial alignment in the assessment of grammatical correctness between LLMs and human judgments, albeit with LLMs often showing more conservative judgments for grammatical correctness or incorrectness. 2024.cmcl-1.16 @@ -209,7 +209,7 @@ What does Kiki look like? Cross-modal associations between speech sounds and visual shapes in vision-and-language models TessaVerhoefLeiden University, Leiden University KianaShahrasbi - TomKouwenhovenLeiden University, Leiden University + TomKouwenhovenLeiden University, Leiden University 199-213 Humans have clear cross-modal preferences when matching certain novel words to visual shapes. Evidence suggests that these preferences play a prominent role in our linguistic processing, language learning, and the origins of signal-meaning mappings. With the rise of multimodal models in AI, such as vision-and-language (VLM) models, it becomes increasingly important to uncover the kinds of visio-linguistic associations these models encode and whether they align with human representations. Informed by experiments with humans, we probe and compare four VLMs for a well-known human cross-modal preference, the bouba-kiki effect. We do not find conclusive evidence for this effect but suggest that results may depend on features of the models, such as architecture design, model size, and training details. Our findings inform discussions on the origins of the bouba-kiki effect in human cognition and future developments of VLMs that align well with human cross-modal associations. 2024.cmcl-1.17 @@ -219,8 +219,8 @@ Evaluating Semantic Relations in Predicting Textual Labels for Images of Abstract and Concrete Concepts TarunTaterUniversität Stuttgart - SabineSchulte Im WaldeUniversity of Stuttgart - DiegoFrassinelliLudwig-Maximilians-Universität München + SabineSchulte Im WaldeUniversity of Stuttgart + DiegoFrassinelliLudwig-Maximilians-Universität München 214-220 This study investigates the performance of SigLIP, a state-of-the-art Vision-Language Model (VLM), in predicting labels for images depicting 1,278 concepts. Our analysis across 300 images per concept shows that the model frequently predicts the exact user-tagged labels, but similarly, it often predicts labels that are semantically related to the exact labels in various ways: synonyms, hypernyms, co-hyponyms, and associated words, particularly for abstract concepts. We then zoom into the diversity of the user tags of images and word associations for abstract versus concrete concepts. Surprisingly, not only abstract but also concrete concepts exhibit significant variability, thus challenging the traditional view that representations of concrete concepts are less diverse. 2024.cmcl-1.18 @@ -229,7 +229,7 @@ Diachronic change in verb usage statistics predicts differences in sentence processing across the lifespan - EllisCainUniversity of California, Merced + EllisCainUniversity of California, Merced RachelRyskinUniversity of California at Merced 221-230 Diachronic corpus analyses reveal that syntactic usage patterns change over time. Are these changes reflected in differences in language processing across the human lifespan? We use the attachment of with-prepositional phrases (PPs) as a case study for investigating this question: a with-PP can attach to a verb, describing an instrument with which to perform the action (e.g., Slice the cake [with a knife]), or to a direct object (DO), modifying the noun (e.g., Slice the cake [with the pink frosting]). The relative frequencies of the instrument and modifier constructions differ depending on the verb in the sentence — the ‘verb bias’. Using two diachronic corpora, Syntgram and CCOHA, we analyzed the co-occurrence statistics of 27 verbs and instrument vs. modifier with-PPs. Between the 1940s and the 2000s, some verbs were more instrument-biased (i.e., more likely to co-occur with with-PPs that attach to the verb than the DO) than others and co-occurrence patterns were more similar for temporally close decades, suggesting subtle diachronic changes in usage patterns. We collected sentence interpretation data probing with-PP attachment preferences in participants ranging in age from 25 to 75. Interpretations of globally ambiguous sentences (e.g., Pet the rabbit with the towel) differed depending on the verb (i.e., some verbs elicit more instrument than modifier interpretations of the PP than others and vice versa) and on the age of the participant. In particular, verbs which became less instrument-biased over time elicited more instrument interpretations among older adults than young adults, suggesting that variation in language comprehension can be in part predicted from the corpus statistics of the time periods that an individual experienced. @@ -242,7 +242,7 @@ EmilySadlier-Brown MillieLou MiikkaSilfverbergUniversity of British Columbia - CarlaKamUniversity of British Columbia + CarlaKamUniversity of British Columbia 231-241 This paper investigates the adverbial discourse particle actually. We compare LLM and human performance on cloze tests involving actually on examples sourced from the Providence Corpus of speech around children. We explore the impact of utterance context on cloze test performance. We find that context is always helpful, though the extent to which additional context is helpful, and what relative placement of context (i.e. before or after the masked word) is most helpful differs for individual models and humans. The best-performing LLM, GPT-4, narrowly outperforms humans. In an additional experiment, we explore cloze performance on synthetic LLM-generated examples, and find that several models vastly outperform humans. 2024.cmcl-1.20 @@ -251,8 +251,8 @@ <fixed-case>LLM</fixed-case>s’ morphological analyses of complex <fixed-case>FST</fixed-case>-generated <fixed-case>F</fixed-case>innish words - AnssiMoisioAalto University - MathiasCreutzUniversity of Helsinki + AnssiMoisioAalto University + MathiasCreutzUniversity of Helsinki MikkoKurimoAalto University 242-254 Rule-based language processing systems have been overshadowed by neural systems in terms of utility, but it remains unclear whether neural NLP systems, in practice, learn the grammar rules that humans use. This work aims to shed light on the issue by evaluating state-of-the-art LLMs in a task of morphological analysis of complex Finnish noun forms. We generate the forms using an FST tool, and they are unlikely to have occurred in the training sets of the LLMs, therefore requiring morphological generalisation capacity. We find that GPT-4-turbohas some difficulties in the task while GPT-3.5-turbo struggles and smaller models Llama2-70B and Poro-34B fail nearly completely. @@ -265,7 +265,7 @@ GuojunWuUniversity of Zurich LenaBolligerUniversity of Zurich DavidReichUniversität Potsdam - LenaJägerUniversity of Zurich and Universität Potsdam + LenaJägerUniversity of Zurich and Universität Potsdam 255-263 Eye movements in reading reveal humans’ cognitive processes involved in language understanding. The duration a reader’s eyes fixate on a word has been used as a measure of the visual attention given to that word or its significance to the reader. This study investigates the correlation between the importance attributed to input tokens by language models (LMs) on the one hand and humans, in the form of fixation durations, on the other hand. While previous research on the internal processes of LMs have employed the models’ attention weights, recent studies have argued in favor of gradient-based methods. Moreover, previous approaches to interpret LMs’ internals with human gaze have neglected the tasks readers performed during reading, even though psycholinguistic research underlines that reading patterns are task-dependent. We therefore employ a gradient-based saliency method to measure the importance of input tokens when LMs are targeted on specific tasks, and we find that task specificity plays a crucial role in the correlation between human- and model-assigned importance. Our implementation is available at https://github.com/gjwubyron/Scan. 2024.cmcl-1.22 diff --git a/data/xml/2024.conda.xml b/data/xml/2024.conda.xml index 74f916eecb..fe07afccbb 100644 --- a/data/xml/2024.conda.xml +++ b/data/xml/2024.conda.xml @@ -27,7 +27,7 @@ ChuangLiuTianjin University RenrenJin MarkSteedmanUniversity of Edinburgh - DeyiXiongTianjin University + DeyiXiongTianjin University 1-12 Chinese LLMs demonstrate impressive performance on NLP tasks, particularly on discipline knowledge benchmarks, with some results approaching those of GPT-4. Previous research has viewed these advancements as potential outcomes of data contamination or leakage, prompting efforts to create new detection methods and address evaluation issues in LLM benchmarks. However, there has been a lack of comprehensive assessment of the evolution of Chinese LLMs. To address this gap, this paper offers a thorough investigation of Chinese LLMs on discipline knowledge evaluation, delving into the advancements of various LLMs, including a group of related models and others. Specifically, we have conducted six assessments ranging from knowledge memorization to comprehension for robustness, encompassing tasks like predicting incomplete questions and options, identifying behaviors by the contaminational fine-tuning, and answering rephrased questions. Experimental findings indicate a positive correlation between the release time of LLMs and their memorization capabilities, but they struggle with variations in original question-options pairs. Additionally, our findings suggest that question descriptions have a more significant impact on LLMs’ performance. 2024.conda-1.1 @@ -37,8 +37,8 @@ Confounders in Instance Variation for the Analysis of Data Contamination BehzadMehrbakhshUniversidad Politécnica de Valencia - DarioGarigliottiUniversity of Bergen - FernandoMartínez-PlumedUniversitat Politècnica de València + DarioGarigliottiUniversity of Bergen + FernandoMartínez-PlumedUniversitat Politècnica de València JoseHernandez-OralloUniversitat Politecnica de Valencia 13-21 Test contamination is a serious problem for the evaluation of large language models (LLMs) because it leads to the overestimation of their performance and a quick saturation of benchmarks, even before the actual capability is achieved. One strategy to address this issue is the (adversarial) generation of variations, by including different exemplars and different rephrasings of the questions. However, these two interventions can lead to instances that can be more difficult (accumulating on the expected loss of performance by partly removing the contamination) but also to instances that can be less difficult (cancelling the expected loss of performance), which would make contamination undetectable. Understanding these two phenomena in terms of instance difficulty is critical to determine and measure contamination. In this paper we conduct a comprehensive analysis of these two interventions on an addition task with fine-tuned LLAMA-2 models. @@ -49,7 +49,7 @@ A Taxonomy for Data Contamination in Large Language Models MedhaPalavalli - AmandaBertschCarnegie Mellon University + AmandaBertschCarnegie Mellon University MatthewGormleySolventum and School of Computer Science, Carnegie Mellon University 22-40 Large language models pretrained on extensive web corpora demonstrate remarkable performance across a wide range of downstream tasks. However, a growing concern is data contamination, where evaluation datasets may unintentionally be contained in the pretraining corpus, inflating model performance. Decontamination, the process of detecting and removing such data, is a potential solution; yet these contaminants may originate from altered versions of the test set, evading detection during decontamination. How different types of contamination impact the performance of language models on downstream tasks is not fully understood. We present a taxonomy that categorizes the various types of contamination encountered by LLMs during the pretraining phase and identify which types pose the highest risk. We analyze the impact of contamination on two key NLP tasks—summarization and question answering—revealing how different types of contamination influence task performance during evaluation. diff --git a/data/xml/2024.gebnlp.xml b/data/xml/2024.gebnlp.xml index 230efb4a11..ee86d0d492 100644 --- a/data/xml/2024.gebnlp.xml +++ b/data/xml/2024.gebnlp.xml @@ -23,7 +23,7 @@ A Parameter-Efficient Multi-Objective Approach to Mitigate Stereotypical Bias in Language Models YifanWang - VeraDembergUniversität des Saarlandes + VeraDembergUniversität des Saarlandes 1-19 Pre-trained language models have shown impressive abilities of understanding and generating natural languages. However, they typically inherit undesired human-like bias and stereotypes from training data, which raises concerns about putting these models into use in real-world scenarios. Although prior research has proposed to reduce bias using different fairness objectives, they usually fail to capture different representations of bias and, therefore, struggle with fully debiasing models. In this work, we introduce a multi-objective probability alignment approach to overcome current challenges by incorporating multiple debiasing losses to locate and penalize bias in different forms. Compared to existing methods, our proposed method can more effectively and comprehensively reduce stereotypical bias, and maintains the language ability of pre-trained models at the same time. Besides, we adopt prefix-tuning to optimize fairness objectives, and results show that it can achieve better bias removal than full fine-tuning while requiring much fewer computational resources. Our code and data are available at https://github.com/Ewanwong/debias_NLG. 2024.gebnlp-1.1 @@ -45,7 +45,7 @@ We Don’t Talk About That: Case Studies on Intersectional Analysis of Social Bias in Large Language Models - HannahDevinney + HannahDevinney JennyBjörklundUppsala University HenrikBjörklundDept. Computing Science, Umeå University 33-44 @@ -57,7 +57,7 @@ An Explainable Approach to Understanding Gender Stereotype Text ManuelaJeyaraj - SarahDelanyTechnological University Dublin + SarahDelanyTechnological University Dublin 45-59 Gender Stereotypes refer to the widely held beliefs and assumptions about the typical traits, behaviours, and roles associated with a collective group of individuals of a particular gender in society. These typical beliefs about how people of a particular gender are described in text can cause harmful effects to individuals leading to unfair treatment. In this research, the aim is to identify the words and language constructs that can influence a text to be considered a gender stereotype. To do so, a transformer model with attention is fine-tuned for gender stereotype detection. Thereafter, words/language constructs used for the model’s decision are identified using a combined use of attention- and SHAP (SHapley Additive exPlanations)-based explainable approaches. Results show that adjectives and verbs were highly influential in predicting gender stereotypes. Furthermore, applying sentiment analysis showed that words describing male gender stereotypes were more positive than those used for female gender stereotypes. 2024.gebnlp-1.4 @@ -66,9 +66,9 @@ A Fairness Analysis of Human and <fixed-case>AI</fixed-case>-Generated Student Reflection Summaries - Bhiman KumarBaghel - Arun BalajieeLekshmi Narayanan - Michael MillerYoder + Bhiman KumarBaghel + Arun BalajieeLekshmi Narayanan + Michael MillerYoder 60-77 This study examines the fairness of human- and AI-generated summaries of student reflections in university STEM classes, focusing on potential gender biases. Using topic modeling, we first identify topics that are more prevalent in reflections from female students and others that are more common among male students. We then analyze whether human and AI-generated summaries reflect the concerns of students of any particular gender over others. Our analysis reveals that though human-generated and extractive AI summarization techniques do not show a clear bias, abstractive AI-generated summaries exhibit a bias towards male students. Pedagogical themes are over-represented from male reflections in these summaries, while concept-specific topics are under-represented from female reflections. This research contributes to a deeper understanding of AI-generated bias in educational contexts, highlighting the need for future work on mitigating these biases. 2024.gebnlp-1.5 @@ -77,7 +77,7 @@ On Shortcuts and Biases: How Finetuned Language Models Distinguish Audience-Specific Instructions in <fixed-case>I</fixed-case>talian and <fixed-case>E</fixed-case>nglish - NicolaFantonUniversity of Stuttgart, Universität Stuttgart + NicolaFantonUniversity of Stuttgart, Universität Stuttgart MichaelRothUniversity of Stuttgart 78-93 Instructional texts for different audience groups can help to address specific needs, but at the same time run the risk of perpetrating biases. In this paper, we extend previous findings on disparate social norms and subtle stereotypes in wikiHow in two directions: We explore the use of fine-tuned language models to determine how audience-specific instructional texts can be distinguished and we transfer the methodology to another language, Italian, to identify cross-linguistic patterns. We find that language models mostly rely on group terms, gender markings, and attributes reinforcing stereotypes. @@ -90,8 +90,8 @@ AleixSant CarlosEscolanoBarcelona Supercomputing Center AudreyMashBarcelona Supercomputing Center - FrancescaDe Luca FornaciariBarcelona Supercomputing Center and Universidad del País Vasco - MaiteMeleroBarcelona Supercomputing Center + FrancescaDe Luca FornaciariBarcelona Supercomputing Center and Universidad del País Vasco + MaiteMeleroBarcelona Supercomputing Center 94-139 This paper studies gender bias in machine translation through the lens of Large Language Models (LLMs). Four widely-used test sets are employed to benchmark various base LLMs, comparing their translation quality and gender bias against state-of-the-art Neural Machine Translation (NMT) models for English to Catalan (En → Ca) and English to Spanish (En → Es) translation directions. Our findings reveal pervasive gender bias across all models, with base LLMs exhibiting a higher degree of bias compared to NMT models. To combat this bias, we explore prompting engineering techniques applied to an instruction-tuned LLM. We identify a prompt structure that significantly reduces gender bias by up to 12% on the WinoMT evaluation dataset compared to more straightforward prompts. These results significantly reduce the gender bias accuracy gap between LLMs and traditional NMT systems. 2024.gebnlp-1.7 @@ -100,11 +100,11 @@ Detecting Gender Discrimination on Actor Level Using Linguistic Discourse Analysis - StefanieUrchsLudwig-Maximilians-Universität München and Hochschule München - VeronikaThurnerHochschule München - MatthiasAßenmacherLudwig-Maximilians-Universität München + StefanieUrchsLudwig-Maximilians-Universität München and Hochschule München + VeronikaThurnerHochschule München + MatthiasAßenmacherLudwig-Maximilians-Universität München ChristianHeumannLudwig-Maximilians-Universität München - StephanieThiemichenHochschule München + StephanieThiemichenHochschule München 140-149 With the usage of tremendous amounts of text data for training powerful large language models such as ChatGPT, the issue of analysing and securing data quality has become more pressing than ever. Any biases, stereotypes and discriminatory patterns that exist in the training data can be reproduced, reinforced or broadly disseminated by the models in production. Therefore, it is crucial to carefully select and monitor the text data that is used as input to train the model. Due to the vast amount of training data, this process needs to be (at least partially) automated. In this work, we introduce a novel approach for automatically detecting gender discrimination in text data on the actor level based on linguistic discourse analysis. Specifically, we combine existing information extraction (IE) techniques to partly automate the qualitative research done in linguistic discourse analysis. We focus on two important steps: Identifying the respectiveperson-named-entity (an actor) and all forms it is referred to (Nomination), and detecting the characteristics it is ascribed (Predication). Asa proof of concept, we integrate these two steps into a pipeline for automated text analysis. The separate building blocks of the pipeline could be flexibly adapted, extended, and scaled for bigger datasets to accommodate a wide range of usage scenarios and specific ML tasks or help social scientists with analysis tasks. We showcase and evaluate our approach on several real and simulated exemplary texts. 2024.gebnlp-1.8 @@ -125,7 +125,7 @@ Towards Fairer <fixed-case>NLP</fixed-case> Models: Handling Gender Bias In Classification Tasks NasimSobhani - SarahDelanyTechnological University Dublin + SarahDelanyTechnological University Dublin 167-178 Measuring and mitigating gender bias in natural language processing (NLP) systems is crucial to ensure fair and ethical AI. However, a key challenge is the lack of explicit gender information in many textual datasets. This paper proposes two techniques, Identity Term Sampling (ITS) and Identity Term Pattern Extraction (ITPE), as alternatives to template-based approaches for measuring gender bias in text data. These approaches identify test data for measuring gender bias in the dataset itself and can be used to measure gender bias on any NLP classifier. We demonstrate the use of these approaches for measuring gender bias across various NLP classification tasks, including hate speech detection, fake news identification, and sentiment analysis. Additionally, we show how these techniques can benefit gender bias mitigation, proposing a variant of Counterfactual Data Augmentation (CDA), called Gender-Selective CDA (GS-CDA), which reduces the amount of data augmentation required in training data while effectively mitigating gender bias and maintaining overall classification performance. 2024.gebnlp-1.10 @@ -136,7 +136,7 @@ Investigating Gender Bias in <fixed-case>STEM</fixed-case> Job Advertisements MalikaDikshit HoudaBouamorCarnegie Mellon University - NizarHabashNew York University Abu Dhabi + NizarHabashNew York University Abu Dhabi 179-189 Gender inequality has been historically prevalent in academia, especially within the fields of Science, Technology, Engineering, and Mathematics (STEM). In this study, we propose to examine gender bias in academic job descriptions in the STEM fields. We go a step further than previous studies that merely identify individual words as masculine-coded and feminine-coded and delve into the contextual language used in academic job advertisements. We design a novel approach to detect gender biases in job descriptions using Natural Language Processing techniques. Going beyond binary masculine-feminine stereotypes, we propose three big group types to understand gender bias in the language of job descriptions, namely agentic, balanced, and communal. We cluster similar information in job descriptions into these three groups using contrastive learning and various clustering techniques. This research contributes to the field of gender bias detection by providing a novel approach and methodology for categorizing gender bias in job descriptions, which can aid more effective and targeted job advertisements that will be equally appealing across all genders. 2024.gebnlp-1.11 @@ -145,10 +145,10 @@ Dissecting Biases in Relation Extraction: A Cross-Dataset Analysis on People’s Gender and Origin - MarcoStranisci - Pere-LluísHuguet Cabot - ElisaBassignana - RobertoNavigliSapienza University of Rome + MarcoStranisci + Pere-LluísHuguet Cabot + ElisaBassignana + RobertoNavigliSapienza University of Rome 190-202 Relation Extraction (RE) is at the core of many Natural Language Understanding tasks, including knowledge-base population and Question Answering. However, any Natural Language Processing system is exposed to biases, and the analysis of these has not received much attention in RE. We propose a new method for inspecting bias in the RE pipeline, which is completely transparent in terms of interpretability. Specifically, in this work we analyze biases related to gender and place of birth. Our methodology includes (i) obtaining semantic triplets (subject, object, semantic relation) involving ‘person’ entities from RE resources, (ii) collecting meta-information (‘gender’ and ‘place of birth’) using Entity Linking technologies, and then (iii) analyze the distribution of triplets across different groups (e.g., men versus women). We investigate bias at two levels: In the training data of three commonly used RE datasets (SREDFM, CrossRE, NYT), and in the predictions of a state-of-the-art RE approach (ReLiK). To enable cross-dataset analysis, we introduce a taxonomy of relation types mapping the label sets of different RE datasets to a unified label space. Our findings reveal that bias is a compounded issue affecting underrepresented groups within data and predictions for RE. 2024.gebnlp-1.12 @@ -157,7 +157,7 @@ Gender Bias in <fixed-case>T</fixed-case>urkish Word Embeddings: A Comprehensive Study of Syntax, Semantics and Morphology Across Domains - DuyguAltinok + DuyguAltinok 203-218 Gender bias in word representations has emerged as a prominent research area in recent years. While numerous studies have focused on measuring and addressing bias in English word embeddings, research on the Turkish language remains limited. This work aims to bridge this gap by conducting a comprehensive evaluation of gender bias in Turkish word embeddings, considering the dimensions of syntax, semantics, and morphology. We employ subword-based static word vectors trained on three distinct domains: web crawl, academical text, and medical text. Through the analysis of gender-associated words in each domain, we not only uncover gender bias but also gain insights into the unique characteristics of these domains. Additionally, we explore the influence of Turkish suffixes on word gender, providing a novel perspective on gender bias. Our findings reveal the pervasive nature of gender biases across various aspects of the Turkish language, including word frequency, semantics, parts-of-speech, and even the smallest linguistic unit - suffixes. Notably, we demonstrate that the majority of noun and verb lemmas, as well as adverbs and adjectives, exhibit masculine gendering in the general-purpose written language. This study is the first of its kind to offer a comprehensive examination of gender bias in the Turkish language. 2024.gebnlp-1.13 @@ -169,7 +169,7 @@ HaotianZhu KexinGaoUniversity of Washington FeiXiaUniversity of Washington, Seattle - MariOstendorfUniversity of Washington + MariOstendorfUniversity of Washington 219-236 Gender bias has been extensively studied in both the educational field and the Natural Language Processing (NLP) field, the former using human coding to identify patterns associated with and causes of gender bias in text and the latter to detect, measure and mitigate gender bias in NLP output and models. This work aims to use NLP to facilitate automatic, quantitative analysis of educational text within the framework of a gender bias taxonomy. Analyses of both educational texts and a lexical resource (WordNet) reveal patterns of bias that can inform and aid educators in updating textbooks and lexical resources and in designing assessment items. 2024.gebnlp-1.14 @@ -181,7 +181,7 @@ SarthakGargApple MozhdehGheiniUSC/ISI ClaraEmmanuelUniversity of New South Wales - TatianaLikhomanenkoApple + TatianaLikhomanenkoApple QinGaoApple MatthiasPaulikApple 237-254 @@ -192,12 +192,12 @@ Beyond Binary Gender Labels: Revealing Gender Bias in <fixed-case>LLM</fixed-case>s through Gender-Neutral Name Predictions - ZhiwenYouUniversity of Illinois at Urbana-Champaign + ZhiwenYouUniversity of Illinois at Urbana-Champaign HaeJinLee - ShubhanshuMishrashubhanshu.com + ShubhanshuMishrashubhanshu.com SullamJeoung ApratimMishraUniversity of Illinois at Urbana-Champaign - JinseokKim + JinseokKim JanaDiesnerTechnische Universität München 255-268 Name-based gender prediction has traditionally categorized individuals as either female or male based on their names, using a binary classification system. That binary approach can be problematic in the cases of gender-neutral names that do not align with any one gender, among other reasons. Relying solely on binary gender categories without recognizing gender-neutral names can reduce the inclusiveness of gender prediction tasks. We introduce an additional gender category, i.e., “neutral”, to study and address potential gender biases in Large Language Models (LLMs). We evaluate the performance of several foundational and large language models in predicting gender based on first names only. Additionally, we investigate the impact of adding birth years to enhance the accuracy of gender prediction, accounting for shifting associations between names and genders over time. Our findings indicate that most LLMs identify male and female names with high accuracy (over 80%) but struggle with gender-neutral names (under 40%), and the accuracy of gender prediction is higher for English-based first names than non-English names. The experimental results show that incorporating the birth year does not improve the overall accuracy of gender prediction, especially for names with evolving gender associations. We recommend using caution when applying LLMs for gender identification in downstream tasks, particularly when dealing with non-binary gender labels. @@ -207,7 +207,7 @@ Is there Gender Bias in Dependency Parsing? Revisiting “Women’s Syntactic Resilience” - PaulGo + PaulGo AgnieszkaFalenskaInterchange Forum for Reflecting on Intelligent Systems, University of Stuttgart 269-279 In this paper, we revisit the seminal work of Garimella et al. 2019, who reported that dependency parsers learn demographically-related signals from their training data and perform differently on sentences authored by people of different genders. We re-run all the parsing experiments from Garimella et al. 2019 and find that their results are not reproducible. Additionally, the original patterns suggesting the presence of gender biases fail to generalize to other treebank and parsing architecture. Instead, our data analysis uncovers methodological shortcomings in the initial study that artificially introduced differences into female and male datasets during preprocessing. These disparities potentially compromised the validity of the original conclusions. @@ -217,7 +217,7 @@ From ‘Showgirls’ to ‘Performers’: Fine-tuning with Gender-inclusive Language for Bias Reduction in <fixed-case>LLM</fixed-case>s - MarionBartl + MarionBartl SusanLeavyUniversity College Dublin 280-294 Gender bias is not only prevalent in Large Language Models (LLMs) and their training data, but also firmly ingrained into the structural aspects of language itself. Therefore, adapting linguistic structures within LLM training data to promote gender-inclusivity can make gender representations within the model more inclusive.The focus of our work are gender-exclusive affixes in English, such as in ‘show-girl’ or ‘man-cave’, which can perpetuate gender stereotypes and binary conceptions of gender.We use an LLM training dataset to compile a catalogue of 692 gender-exclusive terms along with gender-neutral variants and from this, develop a gender-inclusive fine-tuning dataset, the ‘Tiny Heap’. Fine-tuning three different LLMs with this dataset, we observe an overall reduction in gender-stereotyping tendencies across the models. Our approach provides a practical method for enhancing gender inclusivity in LLM training data and contributes to incorporating queer-feminist linguistic activism in bias mitigation research in NLP. @@ -228,9 +228,9 @@ Sociodemographic Bias in Language Models: A Survey and Forward Path VipulGuptaPennsylvania State University - PranavNarayanan Venkit - ShomirWilsonPennsylvania State University - RebeccaPassonneauPennsylvania State University + PranavNarayanan Venkit + ShomirWilsonPennsylvania State University + RebeccaPassonneauPennsylvania State University 295-322 Sociodemographic bias in language models (LMs) has the potential for harm when deployed in real-world settings. This paper presents a comprehensive survey of the past decade of research on sociodemographic bias in LMs, organized into a typology that facilitates examining the different aims: types of bias, quantifying bias, and debiasing techniques. We track the evolution of the latter two questions, then identify current trends and their limitations, as well as emerging techniques. To guide future research towards more effective and reliable solutions, and to help authors situate their work within this broad landscape, we conclude with a checklist of open questions. 2024.gebnlp-1.19 @@ -239,8 +239,8 @@ Stop! In the Name of Flaws: Disentangling Personal Names and Sociodemographic Attributes in <fixed-case>NLP</fixed-case> - VagrantGautamSaarland University - ArjunSubramonianUniversity of California, Los Angeles + VagrantGautamSaarland University + ArjunSubramonianUniversity of California, Los Angeles AnneLauscherUniversität Hamburg OsKeyes 323-337 @@ -252,7 +252,7 @@ Evaluating Gender Bias in Multilingual Multimodal <fixed-case>AI</fixed-case> Models: Insights from an <fixed-case>I</fixed-case>ndian Context KshitishGhate - ArjunChoudhry + ArjunChoudhry VanyaBannihatti Kumar 338-350 We evaluate gender biases in multilingual multimodal image and text models in two settings: text-to-image retrieval and text-to-image generation, to show that even seemingly gender-neutral traits generate biased results. We evaluate our framework in the context of people from India, working with two languages: English and Hindi. We work with frameworks built around mCLIP-based models to ensure a thorough evaluation of recent state-of-the-art models in the multilingual setting due to their potential for widespread applications. We analyze the results across 50 traits for retrieval and 8 traits for generation, showing that current multilingual multimodal models are biased towards men for most traits, and this problem is further exacerbated for lower-resource languages like Hindi. We further discuss potential reasons behind this observation, particularly stemming from the bias introduced by the pretraining datasets. @@ -263,7 +263,7 @@ Detecting and Mitigating <fixed-case>LGBTQIA</fixed-case>+ Bias in Large <fixed-case>N</fixed-case>orwegian Language Models SelmaBergstrand - BjörnGambäckNorwegian University of Science and Technology + BjörnGambäckNorwegian University of Science and Technology 351-364 The paper aims to detect and mitigate LGBTQIA+ bias in large language models (LLMs). As the usage of LLMs quickly increases, so does the significance of the harms they may cause due to bias. The research field of bias in LLMs has seen massive growth, but few attempts have been made to detect or mitigate other biases than gender bias, and most focus has been on English LLMs. This work shows experimentally that LLMs may cause representational harms towards LGBTQIA+ individuals when evaluated on sentence completion tasks and on a benchmark dataset constructed from stereotypes reported by the queer community of Norway, collected through a survey in order to directly involve the affected community. Furthermore, Norwegian training corpora are probed for queer bias, revealing strong associations between queer terms and anti-queer slurs, as well as words related to pedophilia. Finally, a fine-tuning-based debiasing method is applied to two Norwegian LLMs. This method does not consistently reduce bias, but shows that queer bias can be altered, laying the foundation for future debiasing approaches. By shedding light on the severe discrimination that can occur through the usage of LLMs, this paper contributes to the ongoing fight for equal rights for the LGBTQIA+ community. 2024.gebnlp-1.22 @@ -273,7 +273,7 @@ Whose wife is it anyway? Assessing bias against same-gender relationships in machine translation IanStewartPacific Northwest National Laboratory - RadaMihalceaUniversity of Michigan + RadaMihalceaUniversity of Michigan 365-375 Machine translation often suffers from biased data and algorithms that can lead to unacceptable errors in system output. While bias in gender norms has been investigated, less is known about whether MT systems encode bias about social relationships, e.g., “the lawyer kissed her wife.” We investigate the degree of bias against same-gender relationships in MT systems, using generated template sentences drawn from several noun-gender languages (e.g., Spanish) and comprised of popular occupation nouns. We find that three popular MT services consistently fail to accurately translate sentences concerning relationships between entities of the same gender. The error rate varies considerably based on the context, and same-gender sentences referencing high female-representation occupations are translated with lower accuracy. We provide this work as a case study in the evaluation of intrinsic bias in NLP systems with respect to social relationships. 2024.gebnlp-1.23 @@ -310,7 +310,7 @@ AgnieszkaFalenskaInterchange Forum for Reflecting on Intelligent Systems, University of Stuttgart SeraphinaGoldfarb-Tarrant RafaelMosqueraDynabench - DeboraNozzaBocconi University + DeboraNozzaBocconi University EduardoSánchezFAIR, Meta 399-404 We describe the details of the Shared Task of the 5th ACL Workshop on Gender Bias in Natural Language Processing (GeBNLP 2024). The task uses dataset to investigate the quality of Machine Translation systems on a particular case of gender robustness. We report baseline results as well as the results of the first participants. The shared task will be permanently available in the Dynabench platform. diff --git a/data/xml/2024.hucllm.xml b/data/xml/2024.hucllm.xml index b53df0c24e..8972c2a13c 100644 --- a/data/xml/2024.hucllm.xml +++ b/data/xml/2024.hucllm.xml @@ -25,8 +25,8 @@ Human Speech Perception in Noise: Can Large Language Models Paraphrase to Improve It? AnupamaChingacham MiaoranZhangSaarland University - VeraDembergUniversität des Saarlandes - DietrichKlakowSaarland University + VeraDembergUniversität des Saarlandes + DietrichKlakowSaarland University 1-15 Large Language Models (LLMs) can generate text by transferring style attributes like formality resulting in formal or informal text.However, instructing LLMs to generate text that when spoken, is more intelligible in an acoustically difficult environment, is an under-explored topic.We conduct the first study to evaluate LLMs on a novel task of generating acoustically intelligible paraphrases for better human speech perception in noise.Our experiments in English demonstrated that with standard prompting, LLMs struggle to control the non-textual attribute, i.e., acoustic intelligibility, while efficiently capturing the desired textual attributes like semantic equivalence. To remedy this issue, we propose a simple prompting approach, prompt-and-select, which generates paraphrases by decoupling the desired textual and non-textual attributes in the text generation pipeline.Our approach resulted in a 40% relative improvement in human speech perception, by paraphrasing utterances that are highly distorted in a listening condition with babble noise at signal-to-noise ratio (SNR) -5 dB. This study reveals the limitation of LLMs in capturing non-textual attributes, and our proposed method showcases the potential of using LLMs for better human speech perception in noise. 2024.hucllm-1.1 @@ -35,11 +35,11 @@ Human-Centered Design Recommendations for <fixed-case>LLM</fixed-case>-as-a-judge - QianPanIBM, International Business Machines + QianPanIBM, International Business Machines ZahraAshktorab MichaelDesmond MartínSantillán Cooper - JamesJohnson + JamesJohnson RahulNairIBM Research Europe ElizabethDalyIBM Research WernerGeyer @@ -63,8 +63,8 @@ To What Extent Are Large Language Models Capable of Generating Substantial Reflections for Motivational Interviewing Counseling Chatbots? A Human Evaluation - ErkanBasar - IrisHendrickxRadboud University Nijmegen, the Netherlands + ErkanBasar + IrisHendrickxRadboud University Nijmegen, the Netherlands EmielKrahmerTilburg University Gert-JanBruijn TiborBosseRadboud University @@ -76,12 +76,12 @@ Vision-Language Models under Cultural and Inclusive Considerations - AntoniaKaramolegkou - PhillipRust - RuixiangCui + AntoniaKaramolegkou + PhillipRust + RuixiangCui YongCao AndersSøgaardCopenhagen University - DanielHershcovichUniversity of Copenhagen + DanielHershcovichUniversity of Copenhagen 53-66 Large Vision Language Models can be used to assist visually impaired individuals by describing images they capture in their daily lives. Current evaluation datasets may not reflect the diverse cultural user backgrounds nor the situational context of this use case. To address this problem, we create a survey to determine caption preferences and propose a culture-centric evaluation benchmark by filtering VizWiz, an existing dataset with images taken by people who are blind. We then evaluate different models and prompts, investigating their reliability as visual assistants. While the evaluation results for state-of-the-art models seem promising, we identified some weak spots such as hallucinations and problems with conventional evaluation metrics. Our survey, data, code, and model outputs will be publicly available. 2024.hucllm-1.5 @@ -92,7 +92,7 @@ Evaluating Large Language Models on Social Signal Sensitivity: An Appraisal Theory Approach ZhenWu RitamDuttCarnegie Mellon University - CarolynRoseSchool of Computer Science, Carnegie Mellon University + CarolynRoseSchool of Computer Science, Carnegie Mellon University 67-80 We present a framework to assess the sensitivity of Large Language Models (LLMs) to textually embedded social signals using an Appraisal Theory perspective. We report on an experiment that uses prompts encoding three dimensions of social signals: Affect, Judgment, and Appreciation. In response to the prompt, an LLM generates both an analysis (Insight) and a conversational Response, which are analyzed in terms of sensitivity to the signals. We quantitatively evaluate the output text through topical analysis of the Insight and predicted social intelligence scores of the Response in terms of empathy and emotional polarity. Key findings show that LLMs are more sensitive to positive signals. The personas impact Responses but not the Insight. We discuss how our framework can be extended to a broader set of social signals, personas, and scenarios to evaluate LLM behaviors under various conditions. 2024.hucllm-1.6 diff --git a/data/xml/2024.kallm.xml b/data/xml/2024.kallm.xml index 70c53048ff..37728c278e 100644 --- a/data/xml/2024.kallm.xml +++ b/data/xml/2024.kallm.xml @@ -27,7 +27,7 @@ YeonSeonwoo SeunghyunYoonAdobe Research JamesThorneKAIST - AliceOhKorea Advanced Institute of Science and Technology + AliceOhKorea Advanced Institute of Science and Technology 1-11 Application of LLM to database queries on natural language sentences has demonstrated impressive results in both single and multi-hop scenarios.In the existing methodologies, the requirement to re-encode query vectors at each stage for processing multi-hop queries presents a significant bottleneck to the inference speed.This paper proposes VKGFR (Virtual Knowledge Graph based Fact Retriever) that leverages large language models to extract representations corresponding to a sentence’s knowledge graph, significantly enhancing inference speed for multi-hop reasoning without performance loss.Given that both the queries and natural language database sentences can be structured as a knowledge graph, we suggest extracting a Virtual Knowledge Graph (VKG) representation from sentences with LLM.Over the pre-constructed VKG, our VKGFR conducts retrieval with a tiny model structure, showing performance improvements with higher computational efficiency. We evaluate VKGFR on the WikiNLDB and MetaQA dataset, designed for multi-hop database reasoning over text. The results indicate 13x faster inference speed on the WikiNLDB dataset without performance loss. 2024.kallm-1.1 @@ -38,9 +38,9 @@ Zero- and Few-Shots Knowledge Graph Triplet Extraction with Large Language Models AndreaPapaluca DanielKreflLudwig-Maximilians-Universität München - SergioRodríguez MéndezAustralian National University - ArtemLenskyUniversity of New South Wales and University of Sydney, University of Sydney - HannaSuominenAustralian National University + SergioRodríguez MéndezAustralian National University + ArtemLenskyUniversity of New South Wales and University of Sydney, University of Sydney + HannaSuominenAustralian National University 12-23 In this work, we tested the Triplet Extraction (TE) capabilities of a variety of Large Language Models (LLMs) of different sizes in the Zero- and Few-Shots settings. In detail, we proposed a pipeline that dynamically gathers contextual information from a Knowledge Base (KB), both in the form of context triplets and of (sentence, triplets) pairs as examples, and provides it to the LLM through a prompt. The additional context allowed the LLMs to be competitive with all the older fully trained baselines based on the Bidirectional Long Short-Term Memory (BiLSTM) Network architecture. We further conducted a detailed analysis of the quality of the gathered KB context, finding it to be strongly correlated with the final TE performance of the model. In contrast, the size of the model appeared to only logarithmically improve the TE capabilities of the LLMs. We release the code on GitHub for reproducibility. 2024.kallm-1.2 @@ -61,8 +61,8 @@ Application of Generative <fixed-case>AI</fixed-case> as an Enterprise Wikibase Knowledge Graph <fixed-case>Q</fixed-case>&<fixed-case>A</fixed-case> System - RenêMendesUniversidade Presbiteriana Mackenzie - DimasOliveira + RenêMendesUniversidade Presbiteriana Mackenzie + DimasOliveira VictorGarcia 35-42 Generative AI and Large Language Models are increasingly used in business contexts. One application involves natural language conversations contextualized by company data, which can be accomplished by Enterprise Knowledge Graphs, standardized representations of data. This paper outlines an architecture for implementation of an Enterprise Knowledge Graph using open-source Wikibase software. Additionally, it is presented a Knowledge Graph Q&A System powered by Generative AI. @@ -73,7 +73,7 @@ <fixed-case>KGAST</fixed-case>: From Knowledge Graphs to Annotated Synthetic Texts NakanysethVuth - GillesSérassetUniversité Grenoble Alpes + GillesSérassetUniversité Grenoble Alpes DidierSchwabUniversité Grenoble Alpes 43-55 In recent years, the use of synthetic data, either as a complement or a substitute for original data, has emerged as a solution to challenges such as data scarcity and security risks. This paper is an initial attempt to automatically generate such data for Information Extraction tasks. We accomplished this by developing a novel synthetic data generation framework called KGAST, which leverages Knowledge Graphs and Large Language Models. In our preliminary study, we conducted simple experiments to generate synthetic versions of two datasets—a French security defense dataset and an English general domain dataset, after which we evaluated them both intrinsically and extrinsically. The results indicated that synthetic data can effectively complement original data, improving the performance of models on classes with limited training samples. This highlights KGAST’s potential as a tool for generating synthetic data for Information Extraction tasks. @@ -83,7 +83,7 @@ <fixed-case>HRG</fixed-case>raph: Leveraging <fixed-case>LLM</fixed-case>s for <fixed-case>HR</fixed-case> Data Knowledge Graphs with Information Propagation-based Job Recommendation - Azmine ToushikWasi + Azmine ToushikWasi 56-62 Knowledge Graphs (KGs) serving as semantic networks, prove highly effective in managing complex interconnected data in different domains, by offering a unified, contextualized, and structured representation with flexibility that allows for easy adaptation to evolving knowledge. Processing complex Human Resources (HR) data, KGs can help in different HR functions like recruitment, job matching, identifying learning gaps, and enhancing employee retention. Despite their potential, limited efforts have been made to implement practical HR knowledge graphs. This study addresses this gap by presenting a framework for effectively developing HR knowledge graphs from documents using Large Language Models. The resulting KG can be used for a variety of downstream tasks, including job matching, identifying employee skill gaps, and many more. In this work, we showcase instances where HR KGs prove instrumental in precise job matching, yielding advantages for both employers and employees. Empirical evidence from experiments with information propagation in KGs and Graph Neural Nets, along with case studies underscores the effectiveness of KGs in tasks such as job and employee recommendations and job area classification. Code and data are available at : https://github.com/azminewasi/HRGraph 2024.kallm-1.6 @@ -94,7 +94,7 @@ Adapting Multilingual <fixed-case>LLM</fixed-case>s to Low-Resource Languages with Knowledge Graphs via Adapters DaniilGurgurov MareikeHartmannUniversität des Saarlandes - SimonOstermannGerman Research Center for AI + SimonOstermannGerman Research Center for AI 63-74 This paper explores the integration of graph knowledge from linguistic ontologies into multilingual Large Language Models (LLMs) using adapters to improve performance for low-resource languages (LRLs) in sentiment analysis (SA) and named entity recognition (NER). Building upon successful parameter-efficient fine-tuning techniques, such as K-ADAPTER and MAD-X, we propose a similar approach for incorporating knowledge from multilingual graphs, connecting concepts in various languages with each other through linguistic relationships, into multilingual LLMs for LRLs. Specifically, we focus on eight LRLs — Maltese, Bulgarian, Indonesian, Nepali, Javanese, Uyghur, Tibetan, and Sinhala — and employ language-specific adapters fine-tuned on data extracted from the language-specific section of ConceptNet, aiming to enable knowledge transfer across the languages covered by the knowledge graph. We compare various fine-tuning objectives, including standard Masked Language Modeling (MLM), MLM with full-word masking, and MLM with targeted masking, to analyze their effectiveness in learning and integrating the extracted graph data. Through empirical evaluation on language-specific tasks, we assess how structured graph knowledge affects the performance of multilingual LLMs for LRLs in SA and NER, providing insights into the potential benefits of adapting language models for low-resource scenarios. 2024.kallm-1.7 @@ -116,7 +116,7 @@ Educational Material to Knowledge Graph Conversion: A Methodology to Enhance Digital Education MiquelCanal-EsteveUniversidad de Alicante - YoanGutierrezUniversity of Alicante + YoanGutierrezUniversity of Alicante 85-91 This article argues that digital educational content should be structured as knowledge graphs (KGs). Unlike traditional repositories such as Moodle, a KG offers a more flexible representation of the relationships between concepts, facilitating intuitive navigation and discovery of connections. In addition, it integrates effectively with Large Language Models, enhancing personalized explanations, answers, and recommendations. This article studies different proposals based on semantics and knowledge modelling to determine the most appropriate ways to strengthen intelligent educational technologies. 2024.kallm-1.9 @@ -138,7 +138,7 @@ Zero-Shot Fact-Checking with Semantic Triples and Knowledge Graphs MoyYuanUniversity of Cambridge - AndreasVlachosUniversity of Cambridge + AndreasVlachosUniversity of Cambridge 105-115 Despite progress in automated fact-checking, most systems require a significant amount of labeled training data, which is expensive. In this paper, we propose a novel zero-shot method, which instead of operating directly on the claim and evidence sentences, decomposes them into semantic triples augmented using external knowledge graphs, and uses large language models trained for natural language inference. This allows it to generalize to adversarial datasets and domains that supervised models require specific training data for. Our empirical results show that our approach outperforms previous zero-shot approaches on FEVER, FEVER-Symmetric, FEVER 2.0, and Climate-FEVER, while being comparable or better than supervised models on the adversarial and the out-of-domain datasets. 2024.kallm-1.11 @@ -151,7 +151,7 @@ TylerSadler Mohammad RezaTaesiri WenjieXu - MarekReformat + MarekReformat 116-124 Advanced language models with impressive capabilities to process textual information can more effectively extract high-quality triples, which are the building blocks of knowledge graphs. Our work examines language models’ abilities to extract entities and the relationships between them. We use a diverse data augmentation process to fine-tune large language models to extract triples from the text. Fine-tuning is performed using a mix of trainers from HuggingFace and five public datasets, such as different variations of the WebNLG, SKE, DocRed, FewRel, and KELM. Evaluation involves comparing model outputs with test-set triples based on several criteria, such as type, partial, exact, and strict accuracy.The obtained results outperform ChatGPT and even match or exceed the performance of GPT-4. 2024.kallm-1.12 diff --git a/data/xml/2024.knowledgenlp.xml b/data/xml/2024.knowledgenlp.xml index b587453e4c..1ba5d19fc2 100644 --- a/data/xml/2024.knowledgenlp.xml +++ b/data/xml/2024.knowledgenlp.xml @@ -26,9 +26,9 @@ <fixed-case>GAD</fixed-case>e<fixed-case>P</fixed-case>o: Graph-Assisted Declarative Pooling Transformers for Document-Level Relation Extraction AndreiComan - ChristosTheodoropoulos - Marie-FrancineMoensKU Leuven, KU Leuven - JamesHendersonIdiap Research Institute + ChristosTheodoropoulos + Marie-FrancineMoensKU Leuven, KU Leuven + JamesHendersonIdiap Research Institute 1-14 Document-level relation extraction typically relies on text-based encoders and hand-coded pooling heuristics to aggregate information learned by the encoder. In this paper, we leverage the intrinsic graph processing capabilities of the Transformer model and propose replacing hand-coded pooling methods with new tokens in the input, which are designed to aggregate information via explicit graph relations in the computation of attention weights. We introduce a joint text-graph Transformer model and a graph-assisted declarative pooling (GADePo) specification of the input, which provides explicit and high-level instructions for information aggregation. GADePo allows the pooling process to be guided by domain-specific knowledge or desired outcomes but still learned by the Transformer, leading to more flexible and customisable pooling strategies. We evaluate our method across diverse datasets and models and show that our approach yields promising results that are consistently better than those achieved by the hand-coded pooling functions. 2024.knowledgenlp-1.1 @@ -42,12 +42,12 @@ ChaithraBhat AnkitaGupta RuiyiZhangAdobe Systems - ShubhamAgarwalAdobe Systems + ShubhamAgarwalAdobe Systems KarishmaBaggaAdobe Research - SeunghyunYoonAdobe Research - NedimLipkaAdobe Systems + SeunghyunYoonAdobe Research + NedimLipkaAdobe Systems RyanRossiAdobe Research - FranckDernoncourtAdobe Systems + FranckDernoncourtAdobe Systems 15-29 Question-answering for domain-specific applications has recently attracted much interest due to the latest advancements in large language models (LLMs). However, accurately assessing the performance of these applications remains a challenge, mainly due to the lack of suitable benchmarks that effectively simulate real-world scenarios. To address this challenge, we introduce two product question-answering (QA) datasets focused on Adobe Acrobat and Photoshop products to help evaluate the performance of existing models on domain-specific product QA tasks. Additionally, we propose a novel knowledge-driven RAG-QA framework to enhance the performance of the models in the product QA task. Our experiments demonstrated that inducing domain knowledge through query reformulation allowed for increased retrieval and generative performance when compared to standard RAG-QA methods. This improvement, however, is slight, and thus illustrates the challenge posed by the datasets introduced. 2024.knowledgenlp-1.2 @@ -56,9 +56,9 @@ Collecting High-quality Multi-modal Conversational Search Data for <fixed-case>E</fixed-case>-Commerce - MarcusCollinsAmazon + MarcusCollinsAmazon OlegRokhlenko - EugeneAgichteinAmazon and Emory University + EugeneAgichteinAmazon and Emory University ShervinMalmasiAmazon 30-43 Continued improvement of conversational assistants in knowledge-rich domains like E-Commerce requires large volumes of realistic high-quality conversation data to power increasingly sophisticated large language model chatbots, dialogue managers, response rankers, and recommenders. The problem is exacerbated for multi-modal interactions in realistic conversational product search and recommendation. Here, an artificial sales agent must interact intelligently with a customer using both textual and visual information and incorporate results from external search systems, such as a product catalog. Yet, it remains an open question how to best crowd-source large-scale, naturalistic multi-modal dialogue and action data, required to train such an artificial agent. We describe our crowd-sourced task where one worker (the Buyer) plays the role of the customer, and another (the Seller) plays the role of the sales agent. We identify subtle interactions between one worker’s environment and their partner’s behavior mediated by workers’ word choice. We find that limiting information presented to the Buyer, both in their backstory and by the Seller, improves conversation quality. We also show how conversations are improved through minimal automated Seller “coaching”. While typed and spoken messages are slightly different, the differences are not as large as frequently assumed. We plan to release our platform code and the resulting dialogues to advance research on conversational search agents. @@ -85,7 +85,7 @@ KoseiBuma ShoMiyakawaUniversity of Tsukuba, Tsukuba University TakehitoUtsuroUniversity of Tsukuba - MasaharuYoshiokaHokkaido University + MasaharuYoshiokaHokkaido University 59-72 This paper aims to augment fans’ ability to critique and exploreinformation related to celebrities of interest. First, we collect postsfrom X (formerly Twitter) that discuss matters related to specificcelebrities. For the collection of major impressions from these posts,we employ ChatGPT as a large language model (LLM) to analyze andsummarize key sentiments. Next, based on collected impressions, wesearch for Web pages and collect the content of the top 30 ranked pagesas the source for exploring the reasons behind those impressions. Oncethe Web page content collection is complete, we collect and aggregatedetailed reasons for the impressions on the celebrities from the contentof each page. For this part, we continue to use ChatGPT, enhanced bythe retrieval augmented generation (RAG) framework, to ensure thereliability of the collected results compared to relying solely on theprior knowledge of the LLM. Evaluation results by comparing a referencethat is manually collected and aggregated reasons with those predictedby ChatGPT revealed that ChatGPT achieves high accuracy in reasoncollection and aggregation. Furthermore, we compared the performance ofChatGPT with an existing model of mT5 in reason collection and confirmedthat ChatGPT exhibits superior performance. 2024.knowledgenlp-1.5 diff --git a/data/xml/2024.knowllm.xml b/data/xml/2024.knowllm.xml index 68df7a269d..2174325c1b 100644 --- a/data/xml/2024.knowllm.xml +++ b/data/xml/2024.knowllm.xml @@ -50,7 +50,7 @@ WeidongGuoTencent ZhuweiRao YuXuTencent - DiNiuUniversity of Alberta and University of Alberta + DiNiuUniversity of Alberta and University of Alberta 27-31 Ensuring factual consistency between the summary and the original document is paramount in summarization tasks. Consequently, considerable effort has been dedicated to detecting inconsistencies. With the advent of Large Language Models (LLMs), recent studies have begun to leverage their advanced language understanding capabilities for inconsistency detection. However, early attempts have shown that LLMs underperform traditional models due to their limited ability to follow instructions and the absence of an effective detection methodology. In this study, we reassess summary inconsistency detection with LLMs, comparing the performances of GPT-3.5 and GPT-4. To advance research in LLM-based inconsistency detection, we propose SIFiD (Summary Inconsistency Detection with Filtered Document) that identify key sentences within documents by either employing natural language inference or measuring semantic similarity between summaries and documents. 2024.knowllm-1.3 @@ -70,11 +70,11 @@ Retrieval-Augmented Knowledge Integration into Language Models: A Survey - YuxuanChenGerman Research Center for AI, German Research Center for AI and Freie Universität Berlin + YuxuanChenGerman Research Center for AI, German Research Center for AI and Freie Universität Berlin DanielRöderGerman Research Center for AI Justus-JonasErker - LeonhardHennigGerman Research Center for AI - PhilippeThomasGerman Research Center for AI + LeonhardHennigGerman Research Center for AI + PhilippeThomasGerman Research Center for AI SebastianMöller RolandRollerGerman Research Center for AI 45-63 @@ -85,7 +85,7 @@ <fixed-case>C</fixed-case>linical<fixed-case>RAG</fixed-case>: Enhancing Clinical Decision Support through Heterogeneous Knowledge Retrieval - YuxingLu + YuxingLu XukaiZhao JinzhuoWangPeking University 64-68 @@ -97,7 +97,7 @@ Modeling Uncertainty and Using Post-fusion as Fallback Improves Retrieval Augmented Generation with <fixed-case>LLM</fixed-case>s YeLiuSalesForce.com - RuiMengSalesForce Research + RuiMengSalesForce Research Meghana MoorthyBhatSalesforce Research ShafiqJotySalesForce.com and Nanyang Technological University CaimingXiongSalesforce Research @@ -113,7 +113,7 @@ <fixed-case>A</fixed-case>c<fixed-case>K</fixed-case>nowledge: Acquired Knowledge Representation by Small Language Model Without Pre-training SouravDas SanjayChatterji - ImonMukherjee + ImonMukherjee 83-95 Large language models (LLMs) are pre-trained on enormous amounts of text data and show acclaimed success in knowledge representation. However, there are two bottlenecks with this approach. (1) Pre-training data cannot be regularly updated once the models are deployed, and it is not very fruitful if the model cannot represent updated knowledge. (2) The consistently increasing size and computational resources make it difficult for non-commercial and individual researchers to fine-tune and scale these language models. Major LLMs with external knowledge are also proprietary. In this paper, we propose AcKnowledge, a framework wrapped around a small, non-pre-trained language model for an open-domain question-answering (QA) experiment. AcKnowledge learns relevant knowledge from the internet via meta-learning based on user questions, and re-learns from user feedback if knowledge is misrepresented. Our efficient knowledge representation framework avoids pre-training overhead while enabling updated information. Benchmarking shows competitive performance against similarly sized state-of-the-art (SoTA) LLMs on gold standard QA datasets, demonstrating the potential of integrating internet search and user feedback for improved performance and generalizability. 2024.knowllm-1.8 @@ -125,10 +125,10 @@ JanHoffbauer SylwesterSawickiUniversität Potsdam MarcUlrichUniversität Potsdam - TolgaBuzKearney + TolgaBuzKearney KonstantinDoblerHasso Plattner Institute MoritzSchneiderHasso Plattner Institute - GerardDe MeloHasso Plattner Institute and University of Potsdam + GerardDe MeloHasso Plattner Institute and University of Potsdam 96-108 Powerful LLMs like ChatGPT are adopted rapidly for a wide array of tasks, but their limitations in domain-specific areas become apparent, particularly when prompted to recite facts. This is critical especially for knowledge workers, who are adopting LLM-based tools rapidly.While there are various techniques that can help ingest knowledge into LLMs such as instruction tuning and alignment, most have disadvantages. We examine the impact of prominent training techniques on LLMs’ knowledge accuracy using a knowledge-dense dataset that we curate from r/AskHistorians, a rich source of historical knowledge. We evaluate the impact of different models sizes from 1.3B to 7B parameters and other factors such as LoRA adapters, quantization, overfitting, and the inclusion of Reddit data in pretraining.In addition, we measure linguistic metrics and human and LLM-based preference. Our results suggest that pretraining and model size have a much stronger effect on knowledge accuracy than continued pretraining – unless the model is overfit to the tested knowledge.Fine-tuning on our Reddit dataset introduces less complex, but slightly more toxic language. Our study explores the challenges of injecting domain-specific datasets into LLMs and has implications for practitioners, e.g., when LLMs are to be fine-tuned with a company’s datasets. 2024.knowllm-1.9 @@ -150,7 +150,7 @@ <fixed-case>P</fixed-case>rompt<fixed-case>RE</fixed-case>: Weakly-Supervised Document-Level Relation Extraction via Prompting-Based Data Programming ChufanGao XulinFan - JimengSunUniversity of Illinois, Urbana Champaign, College of Computing and Georgia Institute of Technology + JimengSunUniversity of Illinois, Urbana Champaign, College of Computing and Georgia Institute of Technology XuanWangVirginia Polytechnic Institute and State University 132-145 Relation extraction aims to classify the relationships between two entities into pre-defined categories. While previous research has mainly focused on sentence-level relation extraction, recent studies have expanded the scope to document-level relation extraction. Traditional relation extraction methods heavily rely on human-annotated training data, which is time-consuming and labor-intensive. To mitigate the need for manual annotation, recent weakly-supervised approaches have been developed for sentence-level relation extraction while limited work has been done on document-level relation extraction. Weakly-supervised document-level relation extraction faces significant challenges due to an imbalanced number “no relation” instances and the failure of directly probing pretrained large language models for document relation extraction. To address these challenges, we propose PromptRE, a novel weakly-supervised document-level relation extraction method that combines prompting-based techniques with data programming. Furthermore, PromptRE incorporates the label distribution and entity types as prior knowledge to improve the performance. By leveraging the strengths of both prompting and data programming, PromptRE achieves improved performance in relation classification and effectively handles the “no relation” problem. Experimental results on ReDocRED, a benchmark dataset for document-level relation extraction, demonstrate the superiority of PromptRE over baseline approaches. @@ -161,7 +161,7 @@ Patent Response System Optimised for Faithfulness: Procedural Knowledge Embodiment with Knowledge Graph and Retrieval Augmented Generation Jung-MeiChuNational Taiwan University - Hao-ChengLoJCIPRNET and National Taiwan University + Hao-ChengLoJCIPRNET and National Taiwan University JiehHsiangNational Taiwan University Chun-ChiehChoJCIPRNET 146-155 @@ -174,7 +174,7 @@ Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders JinseokKimSeoul National University JaewonJungSeoul National University - SangyeopKimCoxwave and Seoul National University + SangyeopKimCoxwave and Seoul National University SohhyungParkSeoul National University SungzoonChoSeoul National University 156-170 @@ -186,7 +186,7 @@ Measuring the Inconsistency of Large Language Models in Preferential Ranking XiutianZhaoUniversity of Edinburgh - KeWangHuawei Technologies Ltd. + KeWangHuawei Technologies Ltd. WeiPengHuawei Technologies Ltd. 171-176 Despite large language models’ (LLMs’) recent advancements, their bias and hallucination issues persist, and their ability to offer consistent and preferential rankings remains underexplored. This study investigates the capacity of LLMs to provide consistent ordinal preferences, a crucial aspect in scenarios lacking absolute answers. We introduce a formalization of consistency based on order theory, outlining criteria such as transitivity, asymmetry, reversibility, and independence from irrelevant alternatives. Our diagnostic experiments on selected state-of-the-art LLMs reveal their inability to meet these criteria, indicating a strong positional bias and poor transitivity, with preferences easily swayed by irrelevant alternatives. These findings highlight a significant inconsistency in LLM-generated preferential rankings, underscoring the need for further research to address these limitations. @@ -198,7 +198,7 @@ Retrieval-augmented generation in multilingual settings NadezhdaChirkovaNaver Labs Europe DavidRau - HervéDéjeanNaver Labs Europe + HervéDéjeanNaver Labs Europe ThibaultFormalNaver Labs Europe StéphaneClinchantNaver Labs Europe VassilinaNikoulinaNaver Labs Europe diff --git a/data/xml/2024.langmol.xml b/data/xml/2024.langmol.xml index 134f718778..e809c1f2a1 100644 --- a/data/xml/2024.langmol.xml +++ b/data/xml/2024.langmol.xml @@ -69,8 +69,8 @@ KamyarZeinalipour NedaJamshidi MonicaBianchiniUniversity of Siena - MarcoMagginiUniversity of Siena - MarcoGoriUniversity of Siena + MarcoMagginiUniversity of Siena + MarcoGoriUniversity of Siena 34-47 Pre-trained LLMs have demonstrated substantial capabilities across a range of conventional natural language processing (NLP) tasks, such as summarization and entity recognition. In this paper, we explore the application of LLMs in the generation of high-quality protein sequences. Specifically, we adopt a suite of pre-trained LLMs, including Mistral-7B, Llama-2-7B, Llama-3-8B, and gemma-7B, to produce valid protein sequences. All of these models are publicly available (https://github.com/KamyarZeinalipour/protein-design-LLMs).Unlike previous work in this field, our approach utilizes a relatively small dataset comprising 42,000 distinct human protein sequences. We retrain these models to process protein-related data, ensuring the generation of biologically feasible protein structures. Our findings demonstrate that even with limited data, the adapted models exhibit efficiency comparable to established protein-focused models such as ProGen varieties, ProtGPT2, and ProLLaMA, which were trained on millions of protein sequences. To validate and quantify the performance of our models, we conduct comparative analyses employing standard metrics such as pLDDT, RMSD, TM-score, and REU. Furthermore, we commit to making the trained versions of all four models publicly available, fostering greater transparency and collaboration in the field of computational biology. 2024.langmol-1.5 @@ -79,7 +79,7 @@ Enhanced <fixed-case>B</fixed-case>io<fixed-case>T</fixed-case>5+ for Molecule-Text Translation: A Three-Stage Approach with Data Distillation, Diverse Training, and Voting Ensemble - QizhiPeiRenmin University of China and Microsoft + QizhiPeiRenmin University of China and Microsoft LijunWuByteDance KaiyuanGao JinhuaZhu @@ -110,8 +110,8 @@ XiaozheWan LiweiLiu YamengLi - WenkaiXiangLingang laboratory - MingyueZhengShanghai Institute of Materia Medica + WenkaiXiangLingang laboratory + MingyueZhengShanghai Institute of Materia Medica 66-73 Large language models (LLMs) have made substantial strides, but their use in reliably tackling issues within specialized domains, particularly in interdisciplinary areas like pharmaceutical sciences, is hindered by data heterogeneity, knowledge complexity, unique objectives, and a spectrum of constraint conditions. In this area, diverse modalities such as nucleic acids, proteins, molecular structures, and natural language are often involved. We designed a specialized token set and introduced a new Mixture-of-Experts (MoEs) pre-training and fine-tuning strategy to unify these modalities in one model. With this strategy, we’ve created a multi-modal mixture-of-experts foundational model for pharmaceutical sciences, named SciMind. This model has undergone extensive pre-training on publicly accessible datasets including nucleic acid sequences, protein sequences, molecular structure strings, and biomedical texts, and delivers good performance on biomedical text comprehension, promoter prediction, protein function prediction, molecular description, and molecular generation. 2024.langmol-1.8 @@ -122,7 +122,7 @@ Knowledge Graph Extraction from Total Synthesis Documents AndresM Bran ZlatkoJončev - PhilippeSchwallerSwiss Federal Institute of Technology Lausanne + PhilippeSchwallerSwiss Federal Institute of Technology Lausanne 74-84 Knowledge graphs (KGs) have emerged as a powerful tool for organizing and integrating complex information, making it a suitable format for scientific knowledge. However, translating scientific knowledge into KGs is challenging as a wide variety of styles and elements to present data and ideas is used. Although efforts for KG extraction (KGE) from scientific documents exist, evaluation remains challenging and field-dependent; and existing benchmarks do not focuse on scientific information. Furthermore, establishing a general benchmark for this task is challenging as not all scientific knowledge has a ground-truth KG representation, making any benchmark prone to ambiguity. Here we propose Graph of Organic Synthesis Benchmark (GOSyBench), a benchmark for KG extraction from scientific documents in chemistry, that leverages the native KG-like structure of synthetic routes in organic chemistry. We develop KG-extraction algorithms based on LLMs (GPT-4, Claude, Mistral) and VLMs (GPT-4o), the best of which reaches 73% recovery accuracy and 59% precision, leaving a lot of room for improvement. We expect GOSyBench can serve as a valuable resource for evaluating and advancing KGE methods in the scientific domain, ultimately facilitating better organization, integration, and discovery of scientific knowledge. 2024.langmol-1.9 @@ -133,13 +133,13 @@ <fixed-case>NLP</fixed-case>eople at <i> <fixed-case>L</fixed-case>+<fixed-case>M</fixed-case>-24</i> Shared Task: An Ensembled Approach for Molecule Captioning from <fixed-case>SMILES</fixed-case> ShinnosukeTanakaInternational Business Machines - CarolMakInternational Business Machines + CarolMakInternational Business Machines FlaviuCipciganInternational Business Machines - JamesBarry + JamesBarry MohabElkarefInternational Business Machines MovinaMoses VishnudevKuruvanthodiInternational Business Machines - GeethMel + GeethMel 85-90 This paper presents our approach submitted to the Language + Molecules 2024 (L+M-24) Shared Task in the Molecular Captioning track. The task involves generating captions that describe the properties of molecules that are provided in SMILES format.We propose a method for the task that decomposes the challenge of generating captions from SMILES into a classification problem,where we first predict the molecule’s properties. The molecules whose properties can be predicted with high accuracy show high translation metric scores in the caption generation by LLMs, while others produce low scores. Then we use the predicted properties to select the captions generated by different types of LLMs, and use that prediction as the final output. Our submission achieved an overall increase score of 15.21 on the dev set and 12.30 on the evaluation set, based on translation metrics and property metrics from the baseline. 2024.langmol-1.10 @@ -149,7 +149,7 @@ Knowlab’s Submission to <fixed-case>L</fixed-case>+<fixed-case>M</fixed-case> Shared Task: All you need is continued pretraining of chemistry texts even for molecule captioning YunsooKim - HonghanWuUniversity College London, University of London + HonghanWuUniversity College London, University of London 91-96 This paper presents our submission to the L+M-24 shared task, focused on translating molecular structures into natural language descriptions, known as the molecule captioning task. We selected a small language model (SLM), Phi-3-mini-4k, to evaluate the impact of continued pretraining and instruction tuning for domain-specific chemical knowledge. The Phi-3 model was continued pretrained with 90M chemistry textbooks and abstracts, followed by instruction tuning on 150K question answering sets of SMILES and general chemistry knowledge. Despite the continued pretraining phase not including direct exposure to SMILES representations, it significantly enhanced the Phi-3 model’s performance, a 300% increase for the BLEU scores, in the molecule captioning task. The code and model are released at https://github.com/bluesky333/Phi3KnowChem to facilitate research in chemical small language modeling. 2024.langmol-1.11 @@ -159,9 +159,9 @@ <fixed-case>M</fixed-case>ol2<fixed-case>L</fixed-case>ang-<fixed-case>VLM</fixed-case>: Vision- and Text-Guided Generative Pre-trained Language Models for Advancing Molecule Captioning through Multimodal Fusion DuongTran - Nhat TruongPhamSungkyunkwan University - NguyenNguyen - BalachandranManavalanSung Kyun Kwan University + Nhat TruongPhamSungkyunkwan University + NguyenNguyen + BalachandranManavalanSung Kyun Kwan University 97-102 This paper introduces Mol2Lang-VLM, an enhanced method for refining generative pre-trained language models for molecule captioning using multimodal features to achieve more accurate caption generation. Our approach leverages the encoder and decoder blocks of the Transformer-based architecture by introducing third sub-layers into both. Specifically, we insert sub-layers in the encoder to fuse features from SELFIES strings and molecular images, while the decoder fuses features from SMILES strings and their corresponding descriptions. Moreover, cross multi-head attention is employed instead of common multi-head attention to enable the decoder to attend to the encoder’s output, thereby integrating the encoded contextual information for better and more accurate caption generation. Performance evaluation on the CheBI-20 and L+M-24 benchmark datasets demonstrates Mol2Lang-VLM’s superiority, achieving higher accuracy and quality in caption generation compared to existing methods. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/mol2lang/. 2024.langmol-1.12 @@ -171,7 +171,7 @@ <fixed-case>DNA</fixed-case> Language Model and Interpretable Graph Neural Network Identify Genes and Pathways Involved in Rare Diseases AliSaadatEPFL - JacquesFellayEPFL + JacquesFellayEPFL 103-115 Identification of causal genes and pathways is a critical step for understanding the genetic underpinnings of rare diseases. We propose novel approaches to gene prioritization and pathway identification using DNA language model, graph neural networks, and genetic algorithm. Using HyenaDNA, a long-range genomic foundation model, we generated dynamic gene embeddings that reflect changes caused by deleterious variants. These gene embeddings were then utilized to identify candidate genes and pathways. We validated our method on a cohort of rare disease patients with partially known genetic diagnosis, demonstrating the re-identification of known causal genes and pathways and the detection of novel candidates. These findings have implications for the prevention and treatment of rare diseases by enabling targeted identification of new drug targets and therapeutic pathways. 2024.langmol-1.13 @@ -180,8 +180,8 @@ Repurformer: Transformers for Repurposing-Aware Molecule Generation - ChanghunLee - GyuminLeeKorea University + ChanghunLee + GyuminLeeKorea University 116-127 Generating as diverse molecules as possible with desired properties is crucial for drug discovery research, which invokes many approaches based on deep generative models today. Despite recent advancements in these models, particularly in variational autoencoders (VAEs), generative adversarial networks (GANs), Transformers, and diffusion models, a significant challenge known as the sample bias problem remains. This problem occurs when generated molecules targeting the same protein tend to be structurally similar, reducing the diversity of generation. To address this, we propose leveraging multi-hop relationships among proteins and compounds. Our model, Repurformer, integrates bi-directional pretraining with Fast Fourier Transform (FFT) and low-pass filtering (LPF) to capture complex interactions and generate diverse molecules. A series of experiments on BindingDB dataset confirm that Repurformer successfully creates substitutes for anchor compounds that resemble positive compounds, increasing diversity between the anchor and generated compounds. 2024.langmol-1.14 @@ -190,10 +190,10 @@ <fixed-case>L</fixed-case>ang2<fixed-case>M</fixed-case>ol-Diff: A Diffusion-Based Generative Model for Language-to-Molecule Translation Leveraging <fixed-case>SELFIES</fixed-case> Representation - NguyenNguyen - Nhat TruongPhamSungkyunkwan University + NguyenNguyen + Nhat TruongPhamSungkyunkwan University DuongTran - BalachandranManavalanSung Kyun Kwan University + BalachandranManavalanSung Kyun Kwan University 128-134 Generating de novo molecules from textual descriptions is challenging due to potential issues with molecule validity in SMILES representation and limitations of autoregressive models. This work introduces Lang2Mol-Diff, a diffusion-based language-to-molecule generative model using the SELFIES representation. Specifically, Lang2Mol-Diff leverages the strengths of two state-of-the-art molecular generative models: BioT5 and TGM-DLM. By employing BioT5 to tokenize the SELFIES representation, Lang2Mol-Diff addresses the validity issues associated with SMILES strings. Additionally, it incorporates a text diffusion mechanism from TGM-DLM to overcome the limitations of autoregressive models in this domain. To the best of our knowledge, this is the first study to leverage the diffusion mechanism for text-based de novo molecule generation using the SELFIES molecular string representation. Performance evaluation on the L+M-24 benchmark dataset shows that Lang2Mol-Diff outperforms all existing methods for molecule generation in terms of validity. Our code and pre-processed data are available at https://github.com/nhattruongpham/mol-lang-bridge/tree/lang2mol/. 2024.langmol-1.15 diff --git a/data/xml/2024.loresmt.xml b/data/xml/2024.loresmt.xml index 951fc348aa..3c1131e76c 100644 --- a/data/xml/2024.loresmt.xml +++ b/data/xml/2024.loresmt.xml @@ -51,7 +51,7 @@ <fixed-case>K</fixed-case>pop<fixed-case>MT</fixed-case>: Translation Dataset with Terminology for Kpop Fandom JiWooKimSung Kyun Kwan University YunsuKimaiXplain, Inc. - JinYeongBakSungkyunkwan University + JinYeongBakSungkyunkwan University 37-43 While machines learn from existing corpora, humans have the unique capability to establish and accept new language systems. This makes human form unique language systems within social groups. Aligning with this, we focus on a gap remaining in addressing translation challenges within social groups, where in-group members utilize unique terminologies. We propose KpopMT dataset, which aims to fill this gap by enabling precise terminology translation, choosing Kpop fandom as an initiative for social groups given its global popularity. Expert translators provide 1k English translations for Korean posts and comments, each annotated with specific terminology within social groups’ language systems. We evaluate existing translation systems including GPT models on KpopMT to identify their failure cases. Results show overall low scores, underscoring the challenges of reflecting group-specific terminologies and styles in translation. We make KpopMT publicly available. 2024.loresmt-1.3 @@ -61,8 +61,8 @@ Challenges in <fixed-case>U</fixed-case>rdu Machine Translation AbdulBasitLahore University of Management Sciences - Abdul HameedAzeemiLahore University of Management Sciences - Agha AliRazaLahore University of Management Sciences + Abdul HameedAzeemiLahore University of Management Sciences + Agha AliRazaLahore University of Management Sciences 44-49 Recent advancements in Neural Machine Translation (NMT) systems have significantly improved model performance on various translation benchmarks. However, these systems still face numerous challenges when translating low-resource languages such as Urdu. In this work, we highlight the specific issues faced by machine translation systems when translating Urdu language. We first conduct a comprehensive evaluation of English to Urdu Machine Translation with four diverse models: GPT-3.5 (a large language model), opus-mt-en-ur (a bilingual translation model), NLLB (a model trained for translating 200 languages), and IndicTrans2 (a specialized model for translating low-resource Indic languages). The results demonstrate that IndicTrans2 significantly outperforms other models in Urdu Machine Translation. To understand the differences in the performance of these models, we analyze the Urdu word distribution in different training datasets and compare the training methodologies. Finally, we uncover the specific translation issues and provide suggestions for improvements in Urdu machine translation systems. 2024.loresmt-1.4 @@ -99,7 +99,7 @@ AniruddhaRoy PretamRay AyushMaheshwari - SudeshnaSarkarIndian Institute of Technology Kharagpur, Dhirubhai Ambani Institute Of Information and Communication Technology + SudeshnaSarkarIndian Institute of Technology Kharagpur, Dhirubhai Ambani Institute Of Information and Communication Technology PawanGoyalIIT Kharagpur 64-73 Neural Machine Translation (NMT) remains a formidable challenge, especially when dealing with low-resource languages. Pre-trained sequence-to-sequence (seq2seq) multi-lingual models, such as mBART-50, have demonstrated impressive performance in various low-resource NMT tasks. However, their pre-training has been confined to 50 languages, leaving out support for numerous low-resource languages, particularly those spoken in the Indian subcontinent. Expanding mBART-50’s language support requires complex pre-training, risking performance decline due to catastrophic forgetting. Considering these expanding challenges, this paper explores a framework that leverages the benefits of a pre-trained language model along with knowledge distillation in a seq2seq architecture to facilitate translation for low-resource languages, including those not covered by mBART-50. The proposed framework employs a multilingual encoder-based seq2seq model as the foundational architecture and subsequently uses complementary knowledge distillation techniques to mitigate the impact of imbalanced training. Our framework is evaluated on three low-resource Indic languages in four Indic-to-Indic directions, yielding significant BLEU-4 and chrF improvements over baselines. Further, we conduct human evaluation to confirm effectiveness of our approach. Our code is publicly available at https://github.com/raypretam/Two-step-low-res-NMT. @@ -111,7 +111,7 @@ Leveraging <fixed-case>M</fixed-case>andarin as a Pivot Language for Low-Resource Machine Translation between <fixed-case>C</fixed-case>antonese and <fixed-case>E</fixed-case>nglish King YiuSuenFano Labs RudolfChowFano Labs - Albert Y.S.LamUniversity of Hong Kong and Fano Labs + Albert Y.S.LamUniversity of Hong Kong and Fano Labs 74-84 Cantonese, the second most prevalent Chinese dialect after Mandarin, has been relatively overlooked in machine translation (MT) due to a scarcity of bilingual resources. In this paper, we propose to leverage Mandarin, a high-resource language, as a pivot language for translating between Cantonese and English. Our method utilizes transfer learning from pre-trained Bidirectional and Auto-Regressive Transformer (BART) models to initialize auxiliary source-pivot and pivot-target MT models. The parameters of the trained auxiliary models are then used to initialize the source-target model. Based on our experiments, our proposed method outperforms several baseline initialization strategies, naive pivot translation, and two commercial translation systems in both translation directions. 2024.loresmt-1.8 @@ -147,9 +147,9 @@ Tokenisation in Machine Translation Does Matter: The impact of different tokenisation approaches for <fixed-case>M</fixed-case>altese KurtAbela - KurtMicallefUniversity of Malta - MarcTantiUniversity of Malta - ClaudiaBorgUniversity of Malta + KurtMicallefUniversity of Malta + MarcTantiUniversity of Malta + ClaudiaBorgUniversity of Malta 109-120 In Machine Translation, various tokenisers are used to segment inputs before training a model. Despite tokenisation being mostly considered a solved problem for languages such as English, it is still unclear as to how effective different tokenisers are for morphologically rich languages. This study aims to explore how different approaches to tokenising Maltese impact machine translation results on the English-Maltese language pair.We observed that the OPUS-100 dataset has tokenisation inconsistencies in Maltese. We empirically found that training models on the original OPUS-100 dataset led to the generation of sentences with these issues.We therefore release an updated version of the OPUS-100 parallel English-Maltese dataset, referred to as OPUS-100-Fix, fixing these inconsistencies in Maltese by using the MLRS tokeniser. We show that after fixing the inconsistencies in the dataset, results on the fixed test set increase by 2.49 BLEU points over models trained on the original OPUS-100. We also experiment with different tokenisers, including BPE and SentencePiece to find the ideal tokeniser and vocabulary size for our setup, which was shown to be BPE with a vocabulary size of 8,000. Finally, we train different models in both directions for the ENG-MLT language pair using OPUS-100-Fix by training models from scratch as well as fine-tuning other pre-trained models, namely mBART-50 and NLLB, where a finetuned NLLB model performed the best. 2024.loresmt-1.11 @@ -160,7 +160,7 @@ Machine Translation Through Cultural Texts: Can Verses and Prose Help Low-Resource Indigenous Models? AntoineCadotteUniversité du Québec à Montréal NathalieAndré - FatihaSadatuniversité du Quebec à Montréal + FatihaSadatuniversité du Quebec à Montréal 121-127 We propose the first MT models for Innu-Aimun, an Indigenous language in Eastern Canada, in an effort to provide assistance tools for translation and language learning. This project is carried out in collaboration with an Innu community school and involves the participation of students in Innu-Aimun translation, within the framework of a meaningful consideration of Indigenous perspectives.Our contributions in this paper result from the three initial stages of this project. First, we aim to align bilingual Innu-Aimun/French texts with collaboration and participation of Innu-Aimun locutors. Second, we present the training and evaluation results of the MT models (both statistical and neural) based on these aligned corpora. And third, we collaboratively analyze some of the translations resulting from the MT models.We also see these developments for Innu-Aimun as a useful case study for answering a larger question: in a context where few aligned bilingual sentences are available for an Indigenous language, can cultural texts such as literature and poetry be used in the development of MT models? 2024.loresmt-1.12 @@ -169,8 +169,8 @@ Rule-Based, Neural and <fixed-case>LLM</fixed-case> Back-Translation: Comparative Insights from a Variant of <fixed-case>L</fixed-case>adin - SamuelFrontullUniversität Innsbruck - GeorgMoserUniversität Innsbruck + SamuelFrontullUniversität Innsbruck + GeorgMoserUniversität Innsbruck 128-138 This paper explores the impact of different back-translation approaches on machine translation for Ladin, specifically the Val Badia variant. Given the limited amount of parallel data available for this language (only 18k Ladin-Italian sentence pairs), we investigate the performance of a multilingual neural machine translation model fine-tuned for Ladin-Italian. In addition to the available authentic data, we synthesise further translations by using three different models: a fine-tuned neural model, a rule-based system developed specifically for this language pair, and a large language model. Our experiments show that all approaches achieve comparable translation quality in this low-resource scenario, yet round-trip translations highlight differences in model performance. 2024.loresmt-1.13 @@ -204,9 +204,9 @@ Adopting Ensemble Learning for Cross-lingual Classification of Crisis-related Text On Social Media - ShareefaAl Amer + ShareefaAl Amer MarkLee - PhillipSmithUniversity of Birmingham + PhillipSmithUniversity of Birmingham 159-165 Cross-lingual classification poses a significant challenge in Natural Language Processing (NLP), especially when dealing with languages with scarce training data. This paper delves into the adaptation of ensemble learning to address this challenge, specifically for disaster-related social media texts. Initially, we employ Machine Translation to generate a parallel corpus in the target language to mitigate the issue of data scarcity and foster a robust training environment. Following this, we implement the bagging ensemble technique, integrating multiple classifiers into a cohesive model that demonstrates enhanced performance over individual classifiers. Our experimental results reveal significant improvements in adapting models for Arabic, utilising only English training data and markedly outperforming models intended for linguistically similar languages to English, with our ensemble model achieving an accuracy and F1 score of 0.78 when tested on original Arabic data. This research makes a substantial contribution to the field of cross-lingual classification, establishing a new benchmark for enhancing the effectiveness of language transfer in linguistically challenging scenarios. 2024.loresmt-1.16 @@ -217,7 +217,7 @@ Finetuning End-to-End Models for <fixed-case>E</fixed-case>stonian Conversational Spoken Language Translation TiiaSildam AndraVelve - TanelAlumäeTallinn University of Technology + TanelAlumäeTallinn University of Technology 166-174 This paper investigates the finetuning of end-to-end models for bidirectional Estonian-English and Estonian-Russian conversational speech-to-text translation. Due to the limited availability of speech translation data for Estonian, we created additional training data by web scraping and synthesizing data from speech recognition datasets using machine translation. We evaluated three publicly available end-to-end models: Whisper, OWSM 3.1, and SeamlessM4T. Our results indicate that fine-tuning with synthetic data enhances translation accuracy by a large margin, with SeamlessM4T matching or surpassing cascaded speech translation systems that use state-of-the-art speech recognition and machine translation models. 2024.loresmt-1.17 @@ -226,11 +226,11 @@ Benchmarking Low-Resource Machine Translation Systems - AnaSilvaUniversität Paderborn + AnaSilvaUniversität Paderborn NikitSrivastavaUniversität Paderborn TatianaMoteu Ngoli MichaelRöderPaderborn University - DiegoMoussallem + DiegoMoussallem Axel-CyrilleNgonga NgomoUniversität Paderborn 175-185 Assessing the performance of machine translation systems is of critical value, especially to languages with lower resource availability.Due to the large evaluation effort required by the translation task, studies often compare new systems against single systems or commercial solutions. Consequently, determining the best-performing system for specific languages is often unclear. This work benchmarks publicly available translation systems across 4 datasets and 26 languages, including low-resource languages. We consider both effectiveness and efficiency in our evaluation.Our results are made public through BENG—a FAIR benchmarking platform for Natural Language Generation tasks. @@ -242,7 +242,7 @@ Rosetta Balcanica: Deriving a “Gold Standard” Neural Machine Translation (<fixed-case>NMT</fixed-case>) Parallel Dataset from High-Fidelity Resources for <fixed-case>W</fixed-case>estern <fixed-case>B</fixed-case>alkan Languages EdmonBegoliThe University of Tennessee and Oak Ridge National Laboratory MariaMahbubOak Ridge National Laboratory - SudarshanSrinivasanOak Ridge National Laboratory + SudarshanSrinivasanOak Ridge National Laboratory 186-192 The Rosetta Balcanica is an ongoing effort in resource expansion for low-resource Western Balkans languages. This effort focuses on discovering and using accurately translated, officially mapped, and curated parallel language resources and their preparation and use as neural machine translation (NMT) datasets. Some of the guiding principles, practices, and methods employed by Rosetta Balcanica are generalizable and could apply to other low-resource language resource expansion efforts. With this goal in mind, we present our rationale and approach to discovering and using meticulously translated and officially curated low-resource language resources and our use of these resources to develop a parallel “gold standard” translation training resource. Secondly, we describe our specific methodology for NMT dataset development from these resources and its publication to a widely-used and accessible repository for natural language processing (Hugging Face Hub). Finally, we discuss the trade-offs and limitations of our current approach, and the roadmap for future development and the expansion of the current Rosetta Balcanica language resource. 2024.loresmt-1.19 diff --git a/data/xml/2024.nlp4convai.xml b/data/xml/2024.nlp4convai.xml index 794f6f860d..ce3c443e93 100644 --- a/data/xml/2024.nlp4convai.xml +++ b/data/xml/2024.nlp4convai.xml @@ -26,9 +26,9 @@ On the Benchmarking of <fixed-case>LLM</fixed-case>s for Open-Domain Dialogue Evaluation - JohnMendonçaInstituto Superior Técnico - AlonLaviePhrase and School of Computer Science, Carnegie Mellon University - IsabelTrancosoInstituto Superior Técnico + JohnMendonçaInstituto Superior Técnico + AlonLaviePhrase and School of Computer Science, Carnegie Mellon University + IsabelTrancosoInstituto Superior Técnico 1-12 Large Language Models (LLMs) have showcased remarkable capabilities in various Natural Language Processing tasks. For automatic open-domain dialogue evaluation in particular, LLMs have been seamlessly integrated into evaluation frameworks, and together with human evaluation, compose the backbone of most evaluations. However, existing evaluation benchmarks often rely on outdated datasets and evaluate aspects like Fluency and Relevance, which fail to adequately capture the capabilities and limitations of state-of-the-art chatbot models. This paper critically examines current evaluation benchmarks, highlighting that the use of older response generators and quality aspects fail to accurately reflect modern chatbot capabilities. A small annotation experiment on a recent LLM-generated dataset (SODA) reveals that LLM evaluators such as GPT-4 struggle to detect actual deficiencies in dialogues generated by current LLM chatbots. 2024.nlp4convai-1.1 @@ -37,7 +37,7 @@ Exploring Description-Augmented Dataless Intent Classification RuoyuHu - FoaadKhosmoodCalifornia Polytechnic State University, San Luis Obispo + FoaadKhosmoodCalifornia Polytechnic State University, San Luis Obispo AbbasEdalatImperial College London 13-36 In this work, we introduce several schemes to leverage description-augmented embedding similarity for dataless intent classification using current state-of-the-art (SOTA) text embedding models. We report results of our methods on four commonly used intent classification datasets and compare against previous works of a similar nature. Our work shows promising results for dataless classification scaling to a large number of unseen intents. We show competitive results and significant improvements (+6.12% Avg.) over strong zero-shot baselines, all without training on labelled or task-specific data. Furthermore, we provide qualitative error analysis of the shortfalls of this methodology to help guide future research in this area. @@ -46,7 +46,7 @@ Revealing User Familiarity Bias in Task-Oriented Dialogue via Interactive Evaluation - TakyoungKimUniversity of Illinois at Urbana-Champaign + TakyoungKimUniversity of Illinois at Urbana-Champaign JaminShinNAVER Young-HoKimNAVER AI Lab SanghwanBaeNAVER Cloud @@ -61,7 +61,7 @@ AnkitaGupta ChulakaGunasekaraInternational Business Machines HuiWanIBM Research AI - JatinGanhotraInternational Business Machines + JatinGanhotraInternational Business Machines SachindraJoshi MarinaDanilevskyInternational Business Machines 56-72 @@ -73,10 +73,10 @@ Engineering Conversational Search Systems: A Review of Applications, Architectures, and Functional Components - PhillipSchneider + PhillipSchneider WesselPoelmanKU Leuven - MichaelRovatsosUniversity of Edinburgh - FlorianMatthesTechnische Universität München + MichaelRovatsosUniversity of Edinburgh + FlorianMatthesTechnische Universität München 73-88 Conversational search systems enable information retrieval via natural language interactions, with the goal of maximizing users’ information gain over multiple dialogue turns. The increasing prevalence of conversational interfaces adopting this search paradigm challenges traditional information retrieval approaches, stressing the importance of better understanding the engineering process of developing these systems. We undertook a systematic literature review to investigate the links between theoretical studies and technical implementations of conversational search systems. Our review identifies real-world application scenarios, system architectures, and functional components. We consolidate our results by presenting a layered architecture framework and explaining the core functions of conversational search systems. Furthermore, we reflect on our findings in light of the rapid progress in large language models, discussing their capabilities, limitations, and directions for future research. 2024.nlp4convai-1.5 @@ -90,7 +90,7 @@ DongkyuLee JoongboShinLG AI Research HyunkyungBaeLG AI Research - JeesooBangLG AI Research + JeesooBangLG AI Research SeonghwanKimLG AI Research Stanley JungkyuChoiLanguage Lab, LG AI Research HonglakLeeUniversity of Michigan - Ann Arbor and LG AI Research @@ -103,7 +103,7 @@ Chamain: Harmonizing Character Persona Integrity with Domain-Adaptive Knowledge in Dialogue Generation Seung-MooYangSeoul National University of Science and Technology JeehyunLee - Won IkChoSamsung Advanced Institute of Technology + Won IkChoSamsung Advanced Institute of Technology 101-113 Recent advances in large language models (LLMs) have shown their capacity for generating natural dialogues, leveraging extensive pre-trained knowledge. However, the seamless integration of domain-specific knowledge into dialogue agents, without undermining their personas or unique textual style, remains a challenging task. Traditional approaches, such as constructing knowledge-aware character dialogue datasets or training LLMs from the ground up, require considerable resources. Sequentially fine-tuning character chatbots across multiple datasets or applying existing merging techniques often leads to catastrophic forgetting, resulting in the loss of both knowledge and the character’s distinct persona. This compromises the model’s ability to consistently generate character-driven dialogues within a user-centric framework. In this context, we introduce a novel model merging method, Chamain, which effortlessly enhances the performance of character models, much like finding a “free lunch”. Chamain merges domain-specific knowledge into a character model by parameter-wise weight combination of instruction-tuned models and learns to reflect persona’s unique characteristics and style through Layer-wise merging. Our experiments demonstrate that Chamain effectively maintains style while also solving domain-specific problems to a certain extent compared to the baselines, even showing a higher style probability compared to the character model in legal QA. 2024.nlp4convai-1.7 @@ -112,9 +112,9 @@ Faithful Persona-based Conversational Dataset Generation with Large Language Models PegahJandaghi - XianghaiShengGoogle - XinyiBaiGoogle - JayPujaraUniversity of Southern California + XianghaiShengGoogle + XinyiBaiGoogle + JayPujaraUniversity of Southern California HakimSidahmed 114-139 High-quality conversational datasets are essential for developing AI models that can communicate with users. One way to foster deeper interactions between a chatbot and its user is through personas, aspects of the user’s character that provide insights into their personality, motivations, and behaviors. Training Natural Language Processing (NLP) models on a diverse and comprehensive persona-based dataset can lead to conversational models that create a deeper connection with the user, and maintain their engagement. In this paper, we leverage the power of Large Language Models (LLMs) to create a large, high-quality conversational dataset from a seed dataset. We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations. The Generator is an LLM prompted to output conversations. The Critic consists of a mixture of expert LLMs that control the quality of the generated conversations. These experts select the best generated conversations, which we then use to improve the Generator. We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our generation framework on different dimensions through extensive experiments, and observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat during an AI detection test decreases from 17.2% to 8.8% over three iterations. diff --git a/data/xml/2024.nlrse.xml b/data/xml/2024.nlrse.xml index 1bd64bbc7d..88c5df6e37 100644 --- a/data/xml/2024.nlrse.xml +++ b/data/xml/2024.nlrse.xml @@ -61,7 +61,7 @@ Applying <fixed-case>RLAIF</fixed-case> for Code Generation with <fixed-case>API</fixed-case>-usage in Lightweight <fixed-case>LLM</fixed-case>s SujanDuttaRochester Institute of Technology SayantanMahinderApple - RavitejaAnanthaApple + RavitejaAnanthaApple BortikBandyopadhyayApple 39-45 Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline’s performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate. @@ -71,7 +71,7 @@ <fixed-case>S</fixed-case>umm<fixed-case>EQ</fixed-case>u<fixed-case>AL</fixed-case>: Summarization Evaluation via Question Answering using Large Language Models JunyuanLiu - ZhengyanShi + ZhengyanShi AldoLipaniUniversity College London, University of London 46-55 Summarization is hard to evaluate due to its diverse and abstract nature. Although N-gram-based metrics like BLEU and ROUGE are prevalent, they often do not align well with human evaluations. While model-based alternatives such as BERTScore improve, they typically require extensive labelled data. The advent of Large Language Models (LLMs) presents a promising avenue for evaluation. To this end, we introduce SummEQuAL, a novel content-based framework using LLMs for unified, reproducible summarization evaluation. SummEQuAL evaluates summaries by comparing their content with the source document, employing a question-answering approach to gauge both recall and precision. To validate SummEQuAL’s effectiveness, we develop a dataset based on MultiWOZ. We conduct experiments on SummEval and our MultiWOZ-based dataset, showing that SummEQuAL largely improves the quality of summarization evaluation. Notably, SummEQuAL demonstrates a 19.7% improvement over QuestEval in terms of sample-level Pearson correlation with human assessments of consistency on the SummEval dataset. Furthermore, it exceeds the performance of the BERTScore baseline by achieving a 17.3% increase in Spearman correlation on our MultiWOZ-based dataset. Our study illuminates the potential of LLMs for a unified evaluation framework, setting a new paradigm for future summarization evaluation. @@ -81,7 +81,7 @@ <fixed-case>LOGIC</fixed-case>-<fixed-case>LM</fixed-case>++: Multi-Step Refinement for Symbolic Formulations ShashankKirtaniaMicrosoft - PriyanshuGuptaMicrosoft + PriyanshuGuptaMicrosoft ArjunRadhakrishnaMicrosoft 56-63 In this paper we examine the limitations of Large Language Models (LLMs) for complex reasoning tasks. While current approaches leverage formal languages as intermediate representation for these reasoning problems, they still struggle with generating intermediate for-mal specifications with great correctness and in refining these representations. To address these issues, this paper proposes Logic-LM++, an improvement on Logic-LM (Pan et al., 2023). It uses the ability of LLMs to do pairwise comparisons, allowing the evaluation of the refinements suggested by the LLM. The paper demonstrates that Logic-LM++ outperforms Logic-LM and LLM based techniques on natural language reasoning tasks on two datasets, FOLIO, ProofWriter and AR-LSAT. Logic-LM++ show an average improvement of 18.5% on standard prompting, 12.3% on chain of thought prompting and 5% on Logic-LM. diff --git a/data/xml/2024.privatenlp.xml b/data/xml/2024.privatenlp.xml index bfc9bb2b28..89de2ca40b 100644 --- a/data/xml/2024.privatenlp.xml +++ b/data/xml/2024.privatenlp.xml @@ -25,7 +25,7 @@ Noisy Neighbors: Efficient membership inference attacks against <fixed-case>LLM</fixed-case>s - FilippoGalliScuola Normale Superiore + FilippoGalliScuola Normale Superiore LucaMelisMeta TommasoCucinottaScuola Superiore Sant’Anna Pisa 1-6 @@ -36,7 +36,7 @@ Don’t forget private retrieval: distributed private similarity search for large language models GuyZyskind - TobinSouth + TobinSouth AlexPentlandMassachusetts Institute of Technology 7-19 While the flexible capabilities of large language models (LLMs) allow them to answer a range of queries based on existing learned knowledge, information retrieval to augment generation is an important tool to allow LLMs to answer questions on information not included in pre-training data. Such private information is increasingly being generated in a wide array of distributed contexts by organizations and individuals. Performing such information retrieval using neural embeddings of queries and documents always leaked information about queries and database content unless both were stored locally. We present Private Retrieval Augmented Generation (PRAG), an approach that uses multi-party computation (MPC) to securely transmit queries to a distributed set of servers containing a privately constructed database to return top-k and approximate top-k documents. This is a first-of-its-kind approach to dense information retrieval that ensures no server observes a client’s query or can see the database content. The approach introduces a novel MPC friendly protocol for inverted file approximate search (IVF) that allows for fast document search over distributed and private data in sublinear communication complexity. This work presents new avenues through which data for use in LLMs can be accessed and used without needing to centralize or forgo privacy. @@ -57,7 +57,7 @@ Protecting Privacy in Classifiers by Token Manipulation Re’emHarel YairElboherBen-Gurion University of the Negev - YuvalPinterBen-Gurion University of the Negev + YuvalPinterBen-Gurion University of the Negev 29-38 Using language models as a remote service entails sending private information to an untrusted provider. In addition, potential eavesdroppers can intercept the messages, thereby exposing the information. In this work, we explore the prospects of avoiding such data exposure at the level of text manipulation. We focus on text classification models, examining various token mapping and contextualized manipulation functions in order to see whether classifier accuracy may be maintained while keeping the original text unrecoverable. We find that although some token mapping functions are easy and straightforward to implement, they heavily influence performance on the downstream task, and via a sophisticated attacker can be reconstructed. In comparison, the contextualized manipulation provides an improvement in performance. 2024.privatenlp-1.4 @@ -65,9 +65,9 @@ A Collocation-based Method for Addressing Challenges in Word-level Metric Differential Privacy - StephenMeisenbacher + StephenMeisenbacher MaulikChevliTechnische Universität München - FlorianMatthesTechnische Universität München + FlorianMatthesTechnische Universität München 39-51 Applications of Differential Privacy (DP) in NLP must distinguish between the syntactic level on which a proposed mechanism operates, often taking the form of *word-level* or *document-level* privatization. Recently, several word-level *Metric* Differential Privacy approaches have been proposed, which rely on this generalized DP notion for operating in word embedding spaces. These approaches, however, often fail to produce semantically coherent textual outputs, and their application at the sentence- or document-level is only possible by a basic composition of word perturbations. In this work, we strive to address these challenges by operating *between* the word and sentence levels, namely with *collocations*. By perturbing n-grams rather than single words, we devise a method where composed privatized outputs have higher semantic coherence and variable length. This is accomplished by constructing an embedding model based on frequently occurring word groups, in which unigram words co-exist with bi- and trigram collocations. We evaluate our method in utility and privacy tests, which make a clear case for tokenization strategies beyond the word level. 2024.privatenlp-1.5 @@ -102,13 +102,13 @@ Unlocking the Potential of Large Language Models for Clinical Text Anonymization: A Comparative Study DavidPissarra - IsabelCuriosoFraunhofer Portugal AICOS + IsabelCuriosoFraunhofer Portugal AICOS JoãoAlveiraFraunhofer Portugal DuartePereiraNA - BrunoRibeiroFraunhofer Portugal AICOS + BrunoRibeiroFraunhofer Portugal AICOS TomásSouperUniversity of Notre Dame VascoGomesFraunhofer AICOS - AndréCarreiroFraunhofer AICOS + AndréCarreiroFraunhofer AICOS VitorRollaFraunhofer Portugal 74-84 Automated clinical text anonymization has the potential to unlock the widespread sharing of textual health data for secondary usage while assuring patient privacy. Despite the proposal of many complex and theoretically successful anonymization solutions in literature, these techniques remain flawed. As such, clinical institutions are still reluctant to apply them for open access to their data. Recent advances in developing Large Language Models (LLMs) pose a promising opportunity to further the field, given their capability to perform various tasks. This paper proposes six new evaluation metrics tailored to the challenges of generative anonymization with LLMs. Moreover, we present a comparative study of LLM-based methods, testing them against two baseline techniques. Our results establish LLM-based models as a reliable alternative to common approaches, paving the way toward trustworthy anonymization of clinical text. @@ -122,9 +122,9 @@ JoãoAlveiraFraunhofer Portugal DavidPissarra DuartePereiraNA - IsabelCuriosoFraunhofer Portugal AICOS - AndréCarreiroFraunhofer AICOS - HenriqueLopes CardosoUniversidade do Porto + IsabelCuriosoFraunhofer Portugal AICOS + AndréCarreiroFraunhofer AICOS + HenriqueLopes CardosoUniversidade do Porto 85-90 Anonymization of clinical text is crucial to allow the sharing and disclosure of health records while safeguarding patient privacy. However, automated anonymization processes are still highly limited in healthcare practice, as these systems cannot assure the anonymization of all private information. This paper explores the application of a novel technique that guarantees the removal of all sensitive information through the usage of text embeddings obtained from a de-identified dataset, replacing every word or sentence of a clinical note. We analyze the performance of different embedding techniques and models by evaluating them using recently proposed evaluation metrics. The results demonstrate that sentence replacement is better at keeping relevant medical information untouched, while the word replacement strategy performs better in terms of anonymization sensitivity. 2024.privatenlp-1.9 @@ -133,7 +133,7 @@ <fixed-case>P</fixed-case>ocket<fixed-case>LLM</fixed-case>: Enabling On-Device Fine-Tuning for Personalized <fixed-case>LLM</fixed-case>s DanPeng - ZhihuiFu + ZhihuiFu JunWangOPPO Research Institute 91-96 Recent advancements in large language models (LLMs) have indeed showcased their impressive capabilities. On mobile devices, the wealth of valuable, non-public data generated daily holds great promise for locally fine-tuning personalized LLMs, while maintaining privacy through on-device processing. However, the constraints of mobile device resources pose challenges to direct on-device LLM fine-tuning, mainly due to the memory-intensive nature of derivative-based optimization required for saving gradients and optimizer states. To tackle this, we propose employing derivative-free optimization techniques to enable on-device fine-tuning of LLM, even on memory-limited mobile devices. Empirical results demonstrate that the RoBERTa-large model and OPT-1.3B can be fine-tuned locally on the OPPO Reno 6 smartphone using around 4GB and 6.5GB of memory respectively, using derivative-free optimization techniques. This highlights the feasibility of on-device LLM fine-tuning on mobile devices, paving the way for personalized LLMs on resource-constrained devices while safeguarding data privacy. @@ -142,10 +142,10 @@ Smart Lexical Search for Label Flipping Adversial Attack - AlbertoGutiérrez-Megías - Salud MaríaJiménez-ZafraUniversidad de Jaén - L. AlfonsoUreñaUniversidad de Jaén - EugenioMartínez-CámaraUniversidad de Jaén + AlbertoGutiérrez-Megías + Salud MaríaJiménez-ZafraUniversidad de Jaén + L. AlfonsoUreñaUniversidad de Jaén + EugenioMartínez-CámaraUniversidad de Jaén 97-106 Language models are susceptible to vulnerability through adversarial attacks, using manipulations of the input data to disrupt their performance. Accordingly, it represents a cibersecurity leak. Data manipulations are intended to be unidentifiable by the learning model and by humans, small changes can disturb the final label of a classification task. Hence, we propose a novel attack built upon explainability methods to identify the salient lexical units to alter in order to flip the classification label. We asses our proposal on a disinformation dataset, and we show that our attack reaches high balance among stealthiness and efficiency. 2024.privatenlp-1.11 @@ -176,7 +176,7 @@ Improving Authorship Privacy: Adaptive Obfuscation with the Dynamic Selection of Techniques - HemanthKandulaRaytheon BBN + HemanthKandulaRaytheon BBN DamianosKarakos HaolingQiuRaytheon BBN Technologies Corp. BrianUlicny @@ -219,8 +219,8 @@ A Privacy-preserving Approach to Ingest Knowledge from Proprietary Web-based to Locally Run Models for Medical Progress Note Generation - SarveshSoniNational Institutes of Health - DinaDemner-FushmanNational Library of Medicine + SarveshSoniNational Institutes of Health + DinaDemner-FushmanNational Library of Medicine 178-183 Clinical documentation is correlated with increasing clinician burden, leading to the rise of automated methods to generate medical notes. Due to the sensitive nature of patient electronic health records (EHRs), locally run models are preferred for a variety of reasons including privacy, bias, and cost. However, most open-source locally run models (including medical-specific) are much smaller with limited input context size compared to the more powerful closed-source large language models (LLMs) generally available through web APIs (Application Programming Interfaces). In this paper, we propose a framework to harness superior reasoning capabilities and medical knowledge from closed-source online LLMs in a privacy-preserving manner and seamlessly incorporate it into locally run models. Specifically, we leverage a web-based model to distill the vast patient information available in EHRs into a clinically relevant subset without sending sensitive patient health information online and use this distilled knowledge to generate progress notes by a locally run model. Our ablation results indicate that the proposed framework improves the performance of the Mixtral model on progress note generation by 4.6 points on ROUGE (a text-matching based metric) and 7.56 points on MEDCON F1 (a metric that measures the clinical concepts overlap). 2024.privatenlp-1.18 diff --git a/data/xml/2024.sdp.xml b/data/xml/2024.sdp.xml index dac746c416..3ec3e33a61 100644 --- a/data/xml/2024.sdp.xml +++ b/data/xml/2024.sdp.xml @@ -27,13 +27,13 @@ Overview of the Fourth Workshop on Scholarly Document Processing TirthankarGhosalOak Ridge National Laboratory AmanpreetSinghAllen Institute for Artificial Intelligence - AnitaDe Waard - PhilippMayr + AnitaDe Waard + PhilippMayr AakankshaNaikAllen Institute for Artificial Intelligence and National Institutes of Health OrionWeller - YoonjooLeeKorea Advanced Institute of Science & Technology + YoonjooLeeKorea Advanced Institute of Science & Technology ZejiangShenMassachusetts Institute of Technology - YanxiaQinNational University of Singapore + YanxiaQinNational University of Singapore 1-6 The workshop on Scholarly Document Processing (SDP) started in 2020 to accelerate research, inform policy and educate the public on natural language processing for scientific text. The fourth iteration of the workshop, SDP24 was held at the 62nd Annual Meeting of the Association for Computational Linguistics (ACL24) as a hybrid event. The SDP workshop saw a great increase in interest, with 57 submissions, of which 28 were accepted. The program consisted of a research track, four invited talks and two shared tasks: 1) DAGPap24: Detecting automatically generated scientific papers and 2) Context24: Multimodal Evidence and Grounding Context Identification for Scientific Claims. The program was geared towards NLP, information extraction, information retrieval, and data mining for scholarly documents, with an emphasis on identifying and providing solutions to open challenges. 2024.sdp-1.1 @@ -43,7 +43,7 @@ Overview of the <fixed-case>D</fixed-case>ag<fixed-case>P</fixed-case>ap24 Shared Task on Detecting Automatically Generated Scientific Paper SavvasChamezopoulosElsevier, USA DrahomiraHerrmannovaElsevier, USA - AnitaDe Waard + AnitaDe Waard DrahomiraHerrmannovaElsevier, USA DomenicRosatiDalhousie University, Canada YuryKashnitskyElsevier, USA @@ -69,7 +69,7 @@ Controllable Citation Sentence Generation with Language Models NianlongGuUniversity of Zurich - RichardHahnloserETHZ - ETH Zurich + RichardHahnloserETHZ - ETH Zurich 22-37 Citation generation aims to generate a citation sentence that refers to a chosen paper in the context of a manuscript. However, a rigid citation generation process is at odds with an author’s desire to control specific attributes, such as 1) the citation intent, e.g., either introducing background information or comparing results, and 2) keywords that should appear in the citation text. To provide these degrees of controllability during citation generation, we propose to integrate the manuscript context, the context of the referenced paper, and the desired control attributes into a structured template and use it to fine-tune a language model (LM) via next-token prediction. We then utilize Proximal Policy Optimization to directly optimize the LM in favor of a high score of our proposed controllability metric. The proposed workflow harmoniously combines citation attribute suggestion and conditional citation generation into one LM, allowing for better user control. 2024.sdp-1.4 @@ -79,7 +79,7 @@ Toward Structured Related Work Generation with Novelty Statements KazuyaNishimura KuniakiSaitoBoston University - ToshoHirasawaOmron Sinic X + ToshoHirasawaOmron Sinic X YoshitakaUshikuRidge-i, OMRON SINIC X and National Institute of Advanced Industrial Science and Technology 38-57 To help readers understand the novelty and the research context, an excellent related work section is structured (i.e., the section consists of paragraphs determined by categorizing papers into several topics) and includes descriptions of novelty. However, previous studies viewed related work generation as multi-document summarization, and the structure and novelty statement are ignored in such studies. In this paper, we redefine the related work generation task as summarization with structure (i.e., multiple paragraphs with citation) and novelty statement. For this task, we propose a quality-oriented dataset and evaluation metrics. Experiments evaluated the state-of-the-art language models on our tasks, and we confirmed the issues with the current models and the validity of the evaluation indicators. @@ -90,8 +90,8 @@ Understanding Survey Paper Taxonomy about Large Language Models via Graph Representation Learning - JunZhuangBoise State University and Indiana University Purdue University Indianapolis - CaseyKenningtonBoise State University + JunZhuangBoise State University and Indiana University Purdue University Indianapolis + CaseyKenningtonBoise State University 58-69 As new research on Large Language Models (LLMs) continues, it is difficult to keep up with new research and models. To help researchers synthesize the new research many have written survey papers, but even those have become numerous. In this paper, we develop a method to automatically assign survey papers to a taxonomy. We collect the metadata of 144 LLM survey papers and explore three paradigms to classify papers within the taxonomy. Our work indicates that leveraging graph structure information on co-category graphs can significantly outperform the language models in two paradigms; pre-trained language models’ fine-tuning and zero-shot/few-shot classifications using LLMs. We find that our model surpasses an average human recognition level and that fine-tuning LLMs using weak labels generated by a smaller model, such as the GCN in this study, can be more effective than using ground-truth labels, revealing the potential of weak-to-strong generalization in the taxonomy classification task. 2024.sdp-1.6 @@ -99,7 +99,7 @@ Beyond Retrieval: Topic-based Alignment of Scientific Papers to Research Proposal - Rudra NathPalit + Rudra NathPalit ManasiPatwardhan LovekeshVig GautamShroff @@ -121,7 +121,7 @@ Cited Text Spans for Scientific Citation Text Generation - XiangciLi + XiangciLi Yi-HuiLee JessicaOuyangUniversity of Texas at Dallas 90-104 @@ -131,10 +131,10 @@ <fixed-case>C</fixed-case>ite<fixed-case>A</fixed-case>ssist: A System for Automated Preprint Citation and <fixed-case>B</fixed-case>ib<fixed-case>T</fixed-case>e<fixed-case>X</fixed-case> Generation - LarsKaesberg - TerryRuasGeorg-August Universität Göttingen - Jan PhilipWahleUniversity of Göttingen, Germany - BelaGippGeorg-August Universität Göttingen + LarsKaesberg + TerryRuasGeorg-August Universität Göttingen + Jan PhilipWahleUniversity of Göttingen, Germany + BelaGippGeorg-August Universität Göttingen 105-119 We present CiteAssist, a system to automate the generation of BibTeX entries for preprints, streamlining the process of bibliographic annotation. Our system extracts metadata, such as author names, titles, publication dates, and keywords, to create standardized annotations within the document. CiteAssist automatically attaches the BibTeX citation to the end of a PDF and links it on the first page of the document so other researchers gain immediate access to the correct citation of the article. This method promotes platform flexibility by ensuring that annotations remain accessible regardless of the repository used to publish or access the preprint. The annotations remain available even if the preprint is viewed externally to CiteAssist. Additionally, the system adds relevant related papers based on extracted keywords to the preprint, providing researchers with additional publications besides those in related work for further reading. Researchers can enhance their preprints organization and reference management workflows through a free and publicly available web interface. 2024.sdp-1.10 @@ -142,10 +142,10 @@ An end-to-end entity recognition and disambiguation framework for identifying Author Affiliation from literature publications - LianghongLin + LianghongLin Wenxixie-c@my.cityu.edu.hkWenxixie-c@my.cityu.edu.hkNA Spczili@speed-polyu.edu.hkSpczili@speed-polyu.edu.hkNA - TianyongHao + TianyongHao 120-129 Author affiliation information plays a key role in bibliometric analyses and is essential for evaluating studies. However, as author affiliation information has not been standardized, which leads to difficulties such as synonym ambiguity and incomplete data during automated processing. To address the challenge, this paper proposes an end-to-end entity recognition and disambiguation framework for identifying author affiliation from literature publications. For entity disambiguation, an algorithm combining word embedding and spatial embedding is presented considering that author affiliation texts often contain rich geographic information. The disambiguation algorithm utilizes the semantic information and geographic information, which effectively enhances entity recognition and disambiguation effect. In addition, the proposed framework facilitates the effective utilization of the extensive literature in the PubMed database for comprehensive bibliometric analysis. The experimental results verify the robustness and effectiveness of the algorithm. 2024.sdp-1.11 @@ -165,9 +165,9 @@ <fixed-case>A</fixed-case>ffil<fixed-case>G</fixed-case>ood: Building reliable institution name disambiguation tools to improve scientific literature analysis - NicolauDuran-SilvaUniversitat Pompeu Fabra + NicolauDuran-SilvaUniversitat Pompeu Fabra PabloAccuostoUniversitat Pompeu Fabra - PiotrPrzybyła + PiotrPrzybyła HoracioSaggionUniversitat Pompeu Fabra and Universitat Pompeu Fabra 135-144 The accurate attribution of scientific works to research organizations is hindered by the lack of openly available manually annotated data–in particular when multilingual and complex affiliation strings are considered. The AffilGood framework introduced in this paper addresses this gap. We identify three sub-tasks relevant for institution name disambiguation and make available annotated datasets and tools aimed at each of them, including i) a dataset annotated with affiliation spans in noisy automatically-extracted strings; ii) a dataset annotated with named entities for the identification of organizations and their locations; iii) seven datasets annotated with the Research Organization Registry (ROR) identifiers for the evaluation of entity-linking systems. In addition, we describe, evaluate and make available newly developed tools that use these datasets to provide solutions for each of the identified sub-tasks. Our results confirm the value of the developed resources and methods in addressing key challenges in institution name disambiguation. @@ -177,8 +177,8 @@ Metadata Enhancement Using Large Language Models HyunjuSong - StevenBethardUniversity of Arizona - AndreaThomerUniversity of Arizona + StevenBethardUniversity of Arizona + AndreaThomerUniversity of Arizona 145-154 In the natural sciences, a common form of scholarly document is a physical sample record, which provides categorical and textual metadata for specimens collected and analyzed for scientific research. Physical sample archives like museums and repositories publish these records in data repositories to support reproducible science and enable the discovery of physical samples. However, the success of resource discovery in such interfaces depends on the completeness of the sample records. We investigate approaches for automatically completing the scientific metadata fields of sample records. We apply large language models in zero and few-shot settings and incorporate the hierarchical structure of the taxonomy. We show that a combination of record summarization, bottom-up taxonomy traversal, and few-shot prompting yield F1 as high as 0.928 on metadata completion in the Earth science domain. 2024.sdp-1.14 @@ -199,7 +199,7 @@ YacineBrihmoucheCriteo ThéoDelemazure AntoineGauquier - PierreSenellartEcole Normale Supérieure + PierreSenellartEcole Normale Supérieure 165-174 This paper explores the initial steps towards extracting information about theorems and proofs from scholarly documents to build a knowledge base of interlinked results. Specifically, we consider two main tasks: extracting results and their proofs from the PDFs of scientific articles and establishing which results are used in the proofs of others across the scientific literature. We discuss the problem statement, methodologies, and preliminary findings employed in both phases of our approach, highlighting the challenges faced. 2024.sdp-1.16 @@ -207,7 +207,7 @@ AutoRef: Generating Refinements of Reviews Given Guidelines - SohamChitnis + SohamChitnis ManasiPatwardhanTata Consultancy Services Limited, India AshwinSrinivasan Tanmay TulsidasVerlekar @@ -237,7 +237,7 @@ HiroyukiShindoNara Institute of Science and Technology, Japan HirokiTeranishiNara Institute of Science and Technology, Japan and RIKEN HirokiOuchiNAIST - TaroWatanabeNara Institute of Science and Technology, Japan + TaroWatanabeNara Institute of Science and Technology, Japan 202-214 Tables in scientific papers contain crucial information, such as experimental results.Entity Linking (EL) is a promising technology that analyses tables and associates them with a knowledge base.EL for table cells requires identifying the referent concept of each cell while understanding the context relevant to each cell in the paper. However, extracting the relevant context from the paper is challenging because the relevant parts are scattered in the main text and captions.This study defines a rule-based method for extracting broad context from the main text, including table captions and sentences that mention the table.Furthermore, we propose synthetic context as a more refined context generated by large language models (LLMs).In a synthetic context, contexts from the entire paper are refined by summarizing, injecting supplemental knowledge, and clarifying the referent concept.We observe this approach improves accuracy for EL by more than 10 points on the S2abEL dataset, and our qualitative analysis suggests potential future works. 2024.sdp-1.19 @@ -286,10 +286,10 @@ An Analysis of Tasks and Datasets in Peer Reviewing - MoritzStaudingerTechnische Universität Wien + MoritzStaudingerTechnische Universität Wien WojciechKusaAllegro - FlorinaPiroiTechnische Universität Wien - AllanHanburyComplexity Science Hub and Technische Universität Wien + FlorinaPiroiTechnische Universität Wien + AllanHanburyComplexity Science Hub and Technische Universität Wien 257-268 Taking note of the current challenges of the peer review system, this paper inventories the research tasks for analysing and possibly automating parts of the reviewing process, like matching submissions with a reviewer’s domain of expertise. For each of these tasks we list their associated datasets, analysing their quality in terms of available documentation of creation and use. Building up on this, we give a set of recommendations to take into account when collecting and releasing data. 2024.sdp-1.24 @@ -299,7 +299,7 @@ Zero-shot Scientific Claim Verification Using <fixed-case>LLM</fixed-case>s and Citation Text CarlosAlvarez MaxwellBennett - LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence + LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence 269-276 Due to rapidly changing and advancing science, it is important to check the veracity of scientific claims and whether they are supported by research evidence. Previous versions of this task depended on supervised training, where labeled datasets were constructed through manual claim writing and evidence identification, sometimes coupled with mining citation relationships in papers. In this work, we investigate whether zero-shot scientific claim verification could be enabled using large language models (LLMs) and distant supervision examples taken directly from citation texts. We derive an in-context learning (ICL) dataset, SCitance, consisting of citation sentences (“citances”), LLM-generated negations, evidence documents, and veracity labels, and find that prompting GPT-4 with ICL examples from this dataset yields comparable performance (within 1 point F1) to previous finetuned models trained on manually curated claim-evidence pairs. Our results suggest that prompting LLMs with citance-evidence pairs directly poses a viable alternative to finetuning scientific claim verification models with manually-curated data. 2024.sdp-1.25 @@ -317,7 +317,7 @@ <fixed-case>C</fixed-case>o<fixed-case>SAE</fixed-case>mb: Contrastive Section-aware Aspect Embeddings for Scientific Articles ShrutiSinghIIT Gandhinagar - MayankSinghIndian Institute of Technology Gandhinagar + MayankSinghIndian Institute of Technology Gandhinagar 283-292 Research papers are long documents that contain information about various aspects such as background, prior work, methodology, and results. Existing works on scientific document representation learning only leverage the title and abstract of the paper. We present CoSAEmb, a model that learns representations from the full text of 97402 scientific papers from the S2ORC dataset. We present a novel supervised contrastive training framework for long documents using triplet loss and margin gradation. Our framework can be used to learn representations of long documents with any existing encoder-only transformer model without retraining it from scratch. CoSAEmb shows improved performance on information retrieval from the paper’s full text in comparison to models trained only on paper titles and abstracts. We also evaluate CoSAEmb on SciRepEval and CSFCube benchmarks, showing comparable performance with existing state-of-the-art models. 2024.sdp-1.27 @@ -335,7 +335,7 @@ Harnessing <fixed-case>CLIP</fixed-case> for Evidence Identification in Scientific Literature: A Multimodal Approach to Context24 Shared Task AnukritiKumar - LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence + LucyWangUniversity of Washington and Allen Institute for Artificial Intelligence 307-313 Knowing whether scientific claims are supported by evidence is fundamental to scholarly communication and evidence-based decision-making. We present our approach to Task 1 of the Context24 Shared Task—Contextualizing Scientific Figures and Tables (SDP@ACL2024), which focuses on identifying multimodal evidence from scientific publications that support claims. We finetune CLIP, a state-of-the-art model for image-text similarity tasks, to identify and rank figures and tables in papers that substantiate specific claims. Our methods focus on text and image preprocessing techniques and augmenting the organizer-provided training data with labeled examples from the SciMMIR and MedICaT datasets. Our best-performing model achieved NDCG@5 and NDCG@10 values of 0.26 and 0.30, respectively, on the Context24 test split. Our findings underscore the effectiveness of data augmentation and preprocessing in improving the model’s ability in evidence matching. 2024.sdp-1.29 @@ -343,13 +343,13 @@ <fixed-case>CSIRO</fixed-case> at Context24: Contextualising Scientific Figures and Tables in Scientific Literature - NecvaBölücüCSIRO + NecvaBölücüCSIRO VincentNguyen - RoelienTimmer + RoelienTimmer HuichenYang, CSIRO MaciejRybinski - StephenWanCSIRO - SarvnazKarimiCSIRO + StephenWanCSIRO + SarvnazKarimiCSIRO 314-323 Finding evidence for claims from content presented in experimental results of scientific articles is difficult. The evidence is often presented in the form of tables and figures, and correctly matching it to scientific claims presents automation challenges. The Context24 shared task is launched to support the development of systems able to verify claims by extracting supporting evidence from articles. We explore different facets of this shared task modelled as a search problem and as an information extraction task. We experiment with a range of methods in each of these categories for the two sub-tasks of evidence identification and grounding context identification in the Context24 shared task. 2024.sdp-1.30 @@ -357,7 +357,7 @@ <fixed-case>OSX</fixed-case> at Context24: How Well Can <fixed-case>GPT</fixed-case> Tackle Contexualizing Scientific Figures and Tables - ToshoHirasawaOmron Sinic X + ToshoHirasawaOmron Sinic X 324-331 Identifying the alignment between different parts of a scientific paper is fundamental to scholarly document processing.In the Context24 shared task, participants are given a scientific claim and asked to identify (1) key figures or tables that support the claim and (2) methodological details.While employing a supervised approach to train models on task-specific data is a prevailing strategy for both subtasks, such an approach is not feasible for low-resource domains.Therefore, this paper introduces data-free systems supported by Large Language Models.We propose systems based on GPT-4o and GPT-4-turbo for each task.The experimental results reveal the zero-shot capabilities of GPT-4* in both tasks. 2024.sdp-1.31 diff --git a/data/xml/2024.sighan.xml b/data/xml/2024.sighan.xml index 9473af2b69..6f7c09ca83 100644 --- a/data/xml/2024.sighan.xml +++ b/data/xml/2024.sighan.xml @@ -37,7 +37,7 @@ ZihanWang Liuxz2@chinatelecom.cnLiuxz2@chinatelecom.cnNA Liusx14@chinatelecom.cnLiusx14@chinatelecom.cnNA - YitongYao中国电信数字智能科技分公司 + YitongYao中国电信数字智能科技分公司 Huangyy121@chinatelecom.cnHuangyy121@chinatelecom.cnNA LiMengxiang ZhongjiangHe @@ -45,7 +45,7 @@ Pulw@chinatelecom.cnPulw@chinatelecom.cnNA Xuhn@chinatelecom.cnXuhn@chinatelecom.cnNA ChaoWang - ShuangyongSong + ShuangyongSong 10-20 In this paper, we present TeleChat, a collection of large language models (LLMs) with parameters of 7 billion and 12 billion. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, encompassing trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including general dialogue generation, language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves state-of-the-art performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat-7B and TeleChat-12B, along with code and a portion of our filtered high-quality pretraining data, to the public community. 2024.sighan-1.2 @@ -54,7 +54,7 @@ Few-shot Question Generation for Reading Comprehension YinPoonCity University of Hong Kong - John Sie YuenLeeCity University of Hong Kong + John Sie YuenLeeCity University of Hong Kong Yu YanLamHong Kong Metropolitan University Wing LamSuenHong Kong Metropolitan University Elsie Li ChenOngHong Kong Metropolitan University @@ -67,7 +67,7 @@ Adversarial Learning for Multi-Lingual Entity Linking BingbingWang - BinLiang + BinLiang ZhixinBai YongzhuoMaHarbin Institute of Technology, Shenzhen 28-35 @@ -77,7 +77,7 @@ Incremental pre-training from smaller language models - HanZhang + HanZhang HuiWang RuifengXuHarbin Institute of Technology 36-44 @@ -87,8 +87,8 @@ Holistic Exploration on Universal Decompositional Semantic Parsing: Architecture, Data Augmentation, and <fixed-case>LLM</fixed-case> Paradigm - HexuanDengHarbin Institute of Technology, Shenzhen - XinZhangHarbin Institute of Technology, Shenzhen + HexuanDengHarbin Institute of Technology, Shenzhen + XinZhangHarbin Institute of Technology, Shenzhen MeishanZhangHarbin Institute of Technology (Shenzhen), China and Tianjin University, China XueboLiuHarbin Institute of Technolgy, Shenzhen MinZhangHarbin Institute of Technology, Shenzhen @@ -101,7 +101,7 @@ Who Responded to Whom: The Joint Effects of Latent Topics and Discourse in Conversation Structure LuJiFudan University LeiChen - JingLiThe Hong Kong Polytechnic University + JingLiThe Hong Kong Polytechnic University ZhongyuWeiFudan University QiZhangFudan University XuanjingHuangFudan University @@ -112,9 +112,9 @@ <fixed-case>C</fixed-case>antonese Natural Language Processing in the Transformers Era - RongXiangHong Kong Polytechnic University + RongXiangHong Kong Polytechnic University MingLiao - JingLiThe Hong Kong Polytechnic University + JingLiThe Hong Kong Polytechnic University 69-79 Despite being spoken by a large population of speakers worldwide, Cantonese is under-resourced in terms of the data scale and diversity compared to other major languages. This limitation has excluded it from the current “pre-training and fine-tuning” paradigm that is dominated by Transformer architectures.In this paper, we provide a comprehensive review on the existing resources and methodologies for Cantonese Natural Language Processing, covering the recent progress in language understanding, text generation and development of language models.We finally discuss two aspects of the Cantonese language that could make it potentially challenging even for state-of-the-art architectures: colloquialism and multilinguality. 2024.sighan-1.8 @@ -124,7 +124,7 @@ Auto-<fixed-case>ACE</fixed-case>: An Automatic Answer Correctness Evaluation Method for Conversational Question Answering ZhixinBai BingbingWang - BinLiang + BinLiang RuifengXuHarbin Institute of Technology 80-87 Conversational question answering aims to respond to questions based on relevant contexts and previous question-answer history. Existing studies typically use ground-truth answers in history, leading to the inconsistency between the training and inference phases. However, in real-world scenarios, progress in question answering can only be made using predicted answers. Since not all predicted answers are correct, indiscriminately using all predicted answers for training introduces noise into the model. To tackle these challenges, we propose an automatic answer correctness evaluation method named **Auto-ACE**. Specifically, we first construct an Att-BERT model which employs attention weight to the BERT model, so as to bridge the relation between the current question and the question-answer pair in history. Furthermore, to reduce the interference of the irrelevant information in the predicted answer, A-Scorer, an answer scorer is designed to evaluate the confidence of the predicted answer. We conduct a series of experiments on QuAC and CoQA datasets, and the results demonstrate the effectiveness and practicality of our proposed Auto-ACE framework. @@ -134,11 +134,11 @@ <fixed-case>TMAK</fixed-case>-Plus at <fixed-case>SIGHAN</fixed-case>-2024 dim<fixed-case>ABSA</fixed-case> Task: Multi-Agent Collaboration for Transparent and Rational Sentiment Analysis XinKangTokushima University - ZhifeiZhangTongji University + ZhifeiZhangTongji University JiazhengZhouTokushima University YunongWuDataa Robotics (Chengdu Branch) XuefengShiNantong University - KazuyukiMatsumotoTokushima University + KazuyukiMatsumotoTokushima University 88-95 The TMAK-Plus team proposes a Multi-Agent Collaboration (MAC) model for the dimensional Aspect-Based Sentiment Analysis (dimABSA) task at SIGHAN-2024. The MAC model leverages Neuro-Symbolic AI to solve dimABSA transparently and rationally through symbolic message exchanges among generative AI agents. These agents collaborate on aspect detection, opinion detection, aspect classification, and intensity estimation. We created 8 sentiment intensity agents with distinct character traits to mimic diverse sentiment perceptions and average their outputs. The AI agents received clear instructions and 20 training examples to ensure task understanding. Our results suggest that the MAC model is effective in solving the dimABSA task and offers a transparent and rational approach to understanding the solution process. 2024.sighan-1.10 @@ -147,9 +147,9 @@ <fixed-case>YNU</fixed-case>-<fixed-case>HPCC</fixed-case> at <fixed-case>SIGHAN</fixed-case>-2024 dim<fixed-case>ABSA</fixed-case> Task: Using <fixed-case>PLM</fixed-case>s with a Joint Learning Strategy for Dimensional Intensity Prediction Wangzehui@stu.ynu.edu.cnWangzehui@stu.ynu.edu.cnNA - YouZhang + YouZhang JinWangYunnan University - DanXu + DanXu XuejieZhangYunnan University 96-101 The dimensional approach can represent more fine-grained emotional information than discrete affective states. In this paper, a pretrained language model (PLM) with a joint learning strategy is proposed for the SIGHAN-2024 shared task on Chinese dimensional aspect-based sentiment analysis (dimABSA), which requires submitted models to provide fine-grained multi-dimensional (Valance and Arousal) intensity predictions for given aspects of a review. The proposed model consists of three parts: an input layer that concatenates both given aspect terms and input sentences; a Chinese PLM encoder that generates aspect-specific review representation; and separate linear predictors that jointly predict Valence and Arousal sentiment intensities. Moreover, we merge simplified and traditional Chinese training data for data augmentation. Our systems ranked 2nd place out of 5 participants in subtask 1-intensity prediction. The code is publicly available at https://github.com/WZH5127/2024_subtask1_intensity_prediction. @@ -159,7 +159,7 @@ <fixed-case>CCIIPL</fixed-case>ab at <fixed-case>SIGHAN</fixed-case>-2024 dim<fixed-case>ABSA</fixed-case> Task: Contrastive Learning-Enhanced Span-based Framework for <fixed-case>C</fixed-case>hinese Dimensional Aspect-Based Sentiment Analysis ZeliangTong - WeiWeiHuazhong University of Science and Technology + WeiWeiHuazhong University of Science and Technology 102-111 This paper describes our system and findings for SIGHAN-2024 Shared Task Chinese Dimensional Aspect-Based Sentiment Analysis (dimABSA). Our team CCIIPLab proposes an Contrastive Learning-Enhanced Span-based (CL-Span) framework to boost the performance of extracting triplets/quadruples and predicting sentiment intensity. We first employ a span-based framework that integrates contextual representations and incorporates rotary position embedding. This approach fully considers the relational information of entire aspect and opinion terms, and enhancing the model’s understanding of the associations between tokens. Additionally, we utilize contrastive learning to predict sentiment intensities in the valence-arousal dimensions with greater precision. To improve the generalization ability of the model, additional datasets are used to assist training. Experiments have validated the effectiveness of our approach. In the official test results, our system ranked 2nd among the three subtasks. 2024.sighan-1.12 @@ -171,7 +171,7 @@ HanjieZhao XingrenWang ShanhongLiu - YuxiangJia + YuxiangJia HongyingZan 112-120 The DimABSA task requires fine-grained sentiment intensity prediction for restaurant reviews, including scores for Valence and Arousal dimensions for each Aspect Term. In this study, we propose a Coarse-to-Fine In-context Learning (CFICL) method based on the Baichuan2-7B model for the DimABSA task in the SIGHAN 2024 workshop. Our method improves prediction accuracy through a two-stage optimization process. In the first stage, we use fixed in-context examples and prompt templates to enhance the model’s sentiment recognition capability and provide initial predictions for the test data. In the second stage, we encode the Opinion field using BERT and select the most similar training data as new in-context examples based on similarity. These examples include the Opinion field and its scores, as well as related opinion words and their average scores. By filtering for sentiment polarity, we ensure that the examples are consistent with the test data. Our method significantly improves prediction accuracy and consistency by effectively utilizing training data and optimizing in-context examples, as validated by experimental results. @@ -182,7 +182,7 @@ <fixed-case>JN</fixed-case>-<fixed-case>NLP</fixed-case> at <fixed-case>SIGHAN</fixed-case>-2024 dim<fixed-case>ABSA</fixed-case> Task: Extraction of Sentiment Intensity Quadruples Based on Paraphrase Generation YunfanJiang Liutianci@stu.jiangnan.edu.cnLiutianci@stu.jiangnan.edu.cnNA - Heng-yangLuJiangnan University + Heng-yangLuJiangnan University 121-126 Aspect-based sentiment analysis(ABSA) is a fine-grained sentiment analysis task, which aims to extract multiple specific sentiment elements from text. The current aspect-based sentiment analysis task mainly involves four basic elements: aspect term, aspect category, opinion term, and sentiment polarity. With the development of ABSA, methods for predicting the four sentiment elements are gradually increasing. However, traditional ABSA usually only distinguishes between “positive”, “negative”, or “neutral”attitudes when judging sentiment polarity, and this simplified classification method makes it difficult to highlight the sentimentintensity of different reviews. SIGHAN 2024 provides a more challenging evaluation task, the Chinese dimensional ABSA shared task (dimABSA), which replaces the traditional sentiment polarity judgment task with a dataset in a multidimensional space with continuous sentiment intensity scores, including valence and arousal. Continuous sentiment intensity scores can obtain more detailed emotional information. In this task, we propose a new paraphrase generation paradigm that uses generative questioning in an end-to-end manner to predict sentiment intensity quadruples, which can fully utilize semantic information and reduce propagation errors in the pipeline approach. 2024.sighan-1.14 @@ -191,7 +191,7 @@ <fixed-case>DS</fixed-case>-Group at <fixed-case>SIGHAN</fixed-case>-2024 dim<fixed-case>ABSA</fixed-case> Task: Constructing In-context Learning Structure for Dimensional Aspect-Based Sentiment Analysis Ling-angMengBeijing Institute of Technology - TianyuZhaoBeijing Institute of Technology + TianyuZhaoBeijing Institute of Technology DaweiSongBeijing Institute of Technology and Open University 127-132 Aspect-Based Sentiment Analysis (ABSA) is an important subtask in Natural Language Processing (NLP). More recent research within ABSA have consistently focused on conducting more precise sentiment analysis on aspects, i.e., dimensional Aspect-Based Sentiment Analysis (dimABSA). However, previous approaches have not systematically explored the use of Large Language Models (LLMs) in dimABSA. To fill the gap, we propose a novel In-Context Learning (ICL) structure with a novel aspect-aware ICL example selection method, to enhance the performance of LLMs in dimABSA. Experiments show that our proposed ICL structure significantly improves the fine-grained sentiment analysis abilities of LLMs. @@ -200,13 +200,13 @@ Fine-tuning after Prompting: an Explainable Way for Classification - ZezhongWang + ZezhongWang LuyaoYe - HongruWang + HongruWang BoyangXue YimingDu - BinLiang - Kam-FaiWong + BinLiang + Kam-FaiWong 133-142 Prompting is an alternative approach for utilizing pre-trained language models (PLMs) in classification tasks. In contrast to fine-tuning, prompting is more understandable for humans because it utilizes natural language to interact with the PLM, but it often falls short in terms of accuracy. While current research primarily focuses on enhancing the performance of prompting methods to compete with fine-tuning, we believe that these two approaches are not mutually exclusive, each having its strengths and weaknesses. In our study, we depart from the competitive view of prompting versus fine-tuning and instead combine them, introducing a novel method called F&P. This approach enables us to harness the advantages of Fine-tuning for accuracy and the explainability of Prompting simultaneously. Specifically, we reformulate the sample into a prompt and subsequently fine-tune a linear classifier on top of the PLM. Following this, we extract verbalizers according to the weight of this classifier. During the inference phase, we reformulate the sample in the same way and query the PLM. The PLM generates a word, which is then subject to a dictionary lookup by the verbalizer to obtain the prediction. Experiments show that keeping only 30 keywords for each class can achieve comparable performance as fine-tuning. On the other hand, both the prompt and verbalizers are constructed in natural language, making them fully understandable to humans. Hence, the F&P method offers an effective and transparent way to employ a PLM for classification tasks. 2024.sighan-1.16 @@ -225,13 +225,13 @@ <fixed-case>P</fixed-case>er<fixed-case>LTQA</fixed-case>: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering YimingDu - HongruWang + HongruWang ZhengyiZhao - BinLiang - BaojunWang - WanjunZhong - ZezhongWang - Kam-FaiWong + BinLiang + BaojunWang + WanjunZhong + ZezhongWang + Kam-FaiWong 152-164 In conversational AI, effectively employing long-term memory improves personalized and consistent response generation. Existing work only concentrated on a single type of long-term memory, such as preferences, dialogue history, or social relationships, overlooking their interaction in real-world contexts. To this end, inspired by the concept of semantic memory and episodic memory from cognitive psychology, we create a new and more comprehensive Chinese dataset, coined as PerLTQA, in which world knowledge, profiles, social relationships, events, and dialogues are considered to leverage the interaction between different types of long-term memory for question answering (QA) in conversation. Further, based on PerLTQA, we propose a novel framework for memory integration in QA, consisting of three subtasks: Memory Classification, Memory Retrieval, and Memory Fusion, which provides a comprehensive paradigm for memory modeling, enabling consistent and personalized memory utilization. This essentially allows the exploitation of more accurate memory information for better responses in QA. We evaluate this framework using five LLMs and three retrievers. Experimental results demonstrate the importance of personal long-term memory in the QA task 2024.sighan-1.18 @@ -239,10 +239,10 @@ Overview of the <fixed-case>SIGHAN</fixed-case> 2024 shared task for <fixed-case>C</fixed-case>hinese dimensional aspect-based sentiment analysis - Lung-HaoLeeNational Yang Ming Chiao Tung University + Lung-HaoLeeNational Yang Ming Chiao Tung University Liang-ChihYuYuan Ze University SugeWang - JianLiaoShanxi University + JianLiaoShanxi University 165-174 This paper describes the SIGHAN-2024 shared task for Chinese dimensional aspect-based sentiment analysis (ABSA), including task description, data preparation, performance metrics, and evaluation results. Compared to representing affective states as several discrete classes (i.e., sentiment polarity), the dimensional approach represents affective states as continuous numerical values (called sentiment intensity) in the valence-arousal space, providing more fine-grained affective states. Therefore, we organized a dimensional ABSA (shorted dimABSA) shared task, comprising three subtasks: 1) intensity prediction, 2) triplet extraction, and 3) quadruple extraction, receiving a total of 214 submissions from 61 registered participants during evaluation phase. A total of eleven teams provided selected submissions for each subtask and seven teams submitted technical reports for the subtasks. This shared task demonstrates current NLP techniques for dealing with Chinese dimensional ABSA. All data sets with gold standards and evaluation scripts used in this shared task are publicly available for future research. 2024.sighan-1.19 diff --git a/data/xml/2024.sigturk.xml b/data/xml/2024.sigturk.xml index aab492d13a..b1665d46c1 100644 --- a/data/xml/2024.sigturk.xml +++ b/data/xml/2024.sigturk.xml @@ -55,8 +55,8 @@ A coreference corpus of <fixed-case>T</fixed-case>urkish situated dialogs - FarukBüyüktekinMiddle East Technical University - UmutÖzgeMiddle East Technical University + FarukBüyüktekinMiddle East Technical University + UmutÖzgeMiddle East Technical University 42-52 The paper introduces a publicly available corpus of Turkish situated dialogs annotated for coreference. We developed an annotation scheme for coreference annotation in Turkish, a language with pro-drop and rich agglutinating morphology. The annotation scheme is tailored for these aspects of the language, making it potentially applicable to similar languages. The corpus comprises 60 dialogs containing in total 3900 sentences, 18360 words, and 6120 mentions. 2024.sigturk-1.4 @@ -76,7 +76,7 @@ Towards a Clean Text Corpus for <fixed-case>O</fixed-case>ttoman <fixed-case>T</fixed-case>urkish FatihKaragöz BeratDoğan - Şaziye BetülÖzateşBoğaziçi University + Şaziye BetülÖzateşBoğaziçi University 62-70 Ottoman Turkish, as a historical variant of modern Turkish, suffers from a scarcity of available corpora and NLP models. This paper outlines our pioneering endeavors to address this gap by constructing a clean text corpus of Ottoman Turkish materials. We detail the challenges encountered in this process and offer potential solutions. Additionally, we present a case study wherein the created corpus is employed in continual pre-training of BERTurk, followed by evaluation of the model’s performance on the named entity recognition task for Ottoman Turkish. Preliminary experimental results suggest the effectiveness of our corpus in adapting existing models developed for modern Turkish to historical Turkish. 2024.sigturk-1.6 @@ -98,7 +98,7 @@ Do <fixed-case>LLM</fixed-case>s Speak <fixed-case>K</fixed-case>azakh? A Pilot Evaluation of Seven Models AkylbekMaxutovInstitute of Smart Systems and Artificial Intelligence AyanMyrzakhmet - PavelBraslavskiNazarbayev University + PavelBraslavskiNazarbayev University 81-91 We conducted a systematic evaluation of seven large language models (LLMs) on tasks in Kazakh, a Turkic language spoken by approximately 13 million native speakers in Kazakhstan and abroad. We used six datasets corresponding to different tasks – questions answering, causal reasoning, middle school math problems, machine translation, and spelling correction. Three of the datasets were prepared for this study. As expected, the quality of the LLMs on the Kazakh tasks is lower than on the parallel English tasks. GPT-4 shows the best results, followed by Gemini and . In general, LLMs perform better on classification tasks and struggle with generative tasks. Our results provide valuable insights into the applicability of currently available LLMs for Kazakh. We made the data collected for this study publicly available: https://github.com/akylbekmaxutov/LLM-eval-using-Kazakh. 2024.sigturk-1.8 @@ -107,10 +107,10 @@ Intelligent Tutor to Support Teaching and Learning of <fixed-case>T</fixed-case>atar AlsuZakirova - JueHouUniversity of Helsinki + JueHouUniversity of Helsinki AnisiaKatinskaia - Anh-DucVuUniversity of Helsinki, University of Helsinki - RomanYangarberUniversity of Helsinki + Anh-DucVuUniversity of Helsinki, University of Helsinki + RomanYangarberUniversity of Helsinki 92-101 This paper presents our work on tools to support the Tatar language, using Revita, a web-based Intelligent Tutoring System for language teaching and learning. The system allows the users — teachers and learners — to upload arbitrary authentic texts, and automatically creates exercises based on these texts that engage the learners in active production of language. It provides graduated feedback when they make mistakes, and performs continuous assessment, based on which the system selects exercises for the learners at the appropriate level. The assessment also helps the students maintain their learning pace, and helps the teachers to monitor their progress.The paper describes the functionality currently implemented for Tatar, which enables learners — who possess basic proficiency beyond the beginner level — to improve their competency, using texts of their choice as learning content. Support for Tatar is being developed to increase public interest in learning the language of this important regional minority, as well as to to provide tools for improving fluency to “heritage speakers” — those who have substantial passive competency, but lack active fluency and need support for regular practice. 2024.sigturk-1.9 diff --git a/data/xml/2024.smm4h.xml b/data/xml/2024.smm4h.xml index 0a6e1a943f..aef65584f2 100644 --- a/data/xml/2024.smm4h.xml +++ b/data/xml/2024.smm4h.xml @@ -19,10 +19,10 @@ <fixed-case>T</fixed-case>hang<fixed-case>DLU</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Encoder-decoder models for classifying text data on social disorders in children and adolescents - ThangTa + ThangTa AbuRahman LotfollahNajjarUniversity of Nebraska at Omaha - AlexanderGelbukhInstituto Politécnico Nacional + AlexanderGelbukhInstituto Politécnico Nacional 1-4 This paper describes our participation in Task 3 and Task 5 of the #SMM4H (Social Media Mining for Health) 2024 Workshop, explicitly targeting the classification challenges within tweet data. Task 3 is a multi-class classification task centered on tweets discussing the impact of outdoor environments on symptoms of social anxiety. Task 5 involves a binary classification task focusing on tweets reporting medical disorders in children. We applied transfer learning from pre-trained encoder-decoder models such as BART-base and T5-small to identify the labels of a set of given tweets. We also presented some data augmentation methods to see their impact on the model performance. Finally, the systems obtained the best F1 score of 0.627 in Task 3 and the best F1 score of 0.841 in Task 5 2024.smm4h-1.1 @@ -40,8 +40,8 @@ <fixed-case>DILAB</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: <fixed-case>R</fixed-case>o<fixed-case>BERT</fixed-case>a Ensemble for Identifying Children’s Medical Disorders in <fixed-case>E</fixed-case>nglish Tweets - Azmine ToushikWasi - SheikhRahman + Azmine ToushikWasi + SheikhRahman 10-12 This paper details our system developed for the 9th Social Media Mining for Health Research and Applications Workshop (SMM4H 2024), addressing Task 5 focused on binary classification of English tweets reporting children’s medical disorders. Our objective was to enhance the detection of tweets related to children’s medical issues. To do this, we use various pre-trained language models, like RoBERTa and BERT. We fine-tuned these models on the task-specific dataset, adjusting model layers and hyperparameters in an attempt to optimize performance. As we observe unstable fluctuations in performance metrics during training, we implement an ensemble approach that combines predictions from different learning epochs. Our model achieves promising results, with the best-performing configuration achieving F1 score of 93.8% on the validation set and 89.8% on the test set. 2024.smm4h-1.3 @@ -49,8 +49,8 @@ <fixed-case>DILAB</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Analyzing Social Anxiety Effects through Context-Aware Transfer Learning on <fixed-case>R</fixed-case>eddit Data - SheikhRahman - Azmine ToushikWasi + SheikhRahman + Azmine ToushikWasi 13-16 This paper illustrates the system we design for Task 3 of the 9th Social Media Mining for Health (SMM4H 2024) shared tasks. The task presents posts made on the Reddit social media platform, specifically the *r/SocialAnxiety* subreddit, along with one or more outdoor activities as pre-determined keywords for each post. The task then requires each post to be categorized as either one of *positive*, *negative*, *no effect*, or *not outdoor activity* based on what effect the keyword(s) have on social anxiety. Our approach focuses on fine-tuning pre-trained language models to classify the posts. Additionally, we use fuzzy string matching to select only the text around the given keywords so that the model only has to focus on the contextual sentiment associated with the keywords. Using this system, our peak score is 0.65 macro-F1 on the validation set and 0.654 on test set. 2024.smm4h-1.4 @@ -59,7 +59,7 @@ Dolomites@#<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Helping <fixed-case>LLM</fixed-case>s “Know The Drill” in Low-Resource Settings - A Study on Social Media Posts GiulianoTortoreto - Seyed MahedMousavi + Seyed MahedMousavi 17-22 The amount of data to fine-tune LLMs plays a crucial role in the performance of these models in downstream tasks. Consequently, it is not straightforward to deploy these models in low-resource settings. In this work, we investigate two new multi-task learning data augmentation approaches for fine-tuning LLMs when little data is available: “In-domain Augmentation” of the training data and extracting “Drills” as smaller tasks from the target dataset. We evaluate the proposed approaches in three natural language processing settings in the context of SMM4H 2024 competition tasks: multi-class classification, entity recognition, and information extraction. The results show that both techniques improve the performance of the models in all three settings, suggesting a positive impact from the knowledge learned in multi-task training to perform the target task. 2024.smm4h-1.5 @@ -68,7 +68,7 @@ <fixed-case>RIGA</fixed-case> at <fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case>-2024 Task 1: Enhancing <fixed-case>ADE</fixed-case> discovery with <fixed-case>GPT</fixed-case>-4 EduardsMukans - GuntisBarzdinsUniversity of Latvia + GuntisBarzdinsUniversity of Latvia 23-27 The following is a description of the RIGA team’s submissions for the SMM4H-2024 Task 1: Extraction and normalization of adverse drug events (ADEs) in English tweets. Our approach focuses on utilizing Large Language Models (LLMs) to generate data that enhances the fine-tuning of classification and Named Entity Recognition (NER) models. Our solution significantly outperforms mean and median submissions of other teams. The efficacy of our ADE extraction from tweets is comparable to the current state-of-the-art solution, established as the task baseline. The code for our method is available on GitHub (https://github.com/emukans/smm4h2024-riga) 2024.smm4h-1.6 @@ -111,7 +111,7 @@ <fixed-case>UTR</fixed-case>ad-<fixed-case>NLP</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Why <fixed-case>LLM</fixed-case>-Generated Texts Fail to Improve Text Classification Models YosukeYamagishi - YutaNakamuraThe University of Tokyo + YutaNakamuraThe University of Tokyo 42-47 In this paper, we present our approach to addressing the binary classification tasks, Tasks 5 and 6, as part of the Social Media Mining for Health (SMM4H) text classification challenge. Both tasks involved working with imbalanced datasets that featured a scarcity of positive examples. To mitigate this imbalance, we employed a Large Language Model to generate synthetic texts with positive labels, aiming to augment the training data for our text classification models. Unfortunately, this method did not significantly improve model performance. Through clustering analysis using text embeddings, we discovered that the generated texts significantly lacked diversity compared to the raw data. This finding highlights the challenges of using synthetic text generation for enhancing model efficacy in real-world applications, specifically in the context of health-related social media data. 2024.smm4h-1.10 @@ -119,7 +119,7 @@ <fixed-case>HBUT</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024 Task1: Extraction and Normalization of Adverse Drug Events with a Large Language Model - YuanzhiKeHubei University of Technology + YuanzhiKeHubei University of Technology HanboJin XinyunWuHubei University of Technology CaiquanXiongHubei University of Technology @@ -141,7 +141,7 @@ <fixed-case>HBUT</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024 Task2: Cross-lingual Few-shot Medical Entity Extraction using a Large Language Model - YuanzhiKeHubei University of Technology + YuanzhiKeHubei University of Technology ZhangjuYin XinyunWuHubei University of Technology CaiquanXiongHubei University of Technology @@ -154,10 +154,10 @@ <fixed-case>PCIC</fixed-case> at <fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Enhancing <fixed-case>R</fixed-case>eddit Post Classification on Social Anxiety Using Transformer Models and Advanced Loss Functions LeonHecht VictorPozos - HelenaGomez AdornoInstituto de Investigaciones en Matemáticas Aplicadas y en Sistemas - UNAM + HelenaGomez AdornoInstituto de Investigaciones en Matemáticas Aplicadas y en Sistemas - UNAM GibranFuentes-Pineda - GerardoSierraUniversidad Nacional Autónoma de México - GemmaBel-EnguixUniversidad Nacional Autónoma de México + GerardoSierraUniversidad Nacional Autónoma de México + GemmaBel-EnguixUniversidad Nacional Autónoma de México 63-66 We present our approach to solving the task of identifying the effect of outdoor activities on social anxiety based on reddit posts. We employed state-of-the-art transformer models enhanced with a combination of advanced loss functions. Data augmentation techniques were also used to address class imbalance within the training set. Our method achieved a macro-averaged F1-score of 0.655 on the test data, surpassing the workshop’s mean F1-Score of 0.519. These findings suggest that integrating weighted loss functions improves the performance of transformer models in classifying unbalanced text data, while data augmentation can improve the model’s ability to generalize. 2024.smm4h-1.14 @@ -165,7 +165,7 @@ Transformers at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Identification of Tweets Reporting Children’s Medical Disorders And Effects of Outdoor Spaces on Social Anxiety Symptoms on <fixed-case>R</fixed-case>eddit Using <fixed-case>R</fixed-case>o<fixed-case>BERT</fixed-case>a - KritiSinghal + KritiSinghal JatinBedi 67-70 With the widespread increase in the use of social media platforms such as Twitter, Instagram, and Reddit, people are sharing their views on various topics. They have become more vocal on these platforms about their views and opinions on the medical challenges they are facing. This data is a valuable asset of medical insights in the study and research of healthcare. This paper describes our adoption of transformer-based approaches for tasks 3 and 5. For both tasks, we fine-tuned large RoBERTa, a BERT-based architecture, and achieved a highest F1 score of 0.413 and 0.900 in tasks 3 and 5, respectively. @@ -186,13 +186,13 @@ <fixed-case>P</fixed-case>olyu<fixed-case>CBS</fixed-case> at <fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: <fixed-case>LLM</fixed-case>-based Medical Disorder and Adverse Drug Event Detection with Low-rank Adaptation - ZhaiYu + ZhaiYu XiaoyiBao - EmmanueleChersoniThe Hong Kong Polytechnic University - BeatricePortelli + EmmanueleChersoniThe Hong Kong Polytechnic University + BeatricePortelli SophiaLeeHong Kong Polytechnic University - JinghangGuHong Kong Polytechnic University - Chu-RenHuang + JinghangGuHong Kong Polytechnic University + Chu-RenHuang 74-78 This is the demonstration of systems and results of our team’s participation in the Social Medical Mining for Health (SMM4H) 2024 Shared Task. Our team participated in two tasks: Task 1 and Task 5. Task 5 requires the detection of tweet sentences that claim children’s medical disorders from certain users. Task 1 needs teams to extract and normalize Adverse Drug Event terms in the tweet sentence. The team selected several Pre-trained Language Models and generative Large Language Models to meet the requirements. Strategies to improve the performance include cloze test, prompt engineering, Low Rank Adaptation etc. The test result of our system has an F1 score of 0.935, Precision of 0.954 and Recall of 0.917 in Task 5 and an overall F1 score of 0.08 in Task 1. 2024.smm4h-1.17 @@ -203,7 +203,7 @@ HarikaAbburiInternational Institute of Information Technology Hyderabad NirmalaPudotaDeloitte BalajiVeeramaniDeloitte - EdwardBowen + EdwardBowen SanmitraBhattacharya 79-82 The advent of Large Language Models (LLMs) such as Generative Pre-trained Transformers (GPT-4) mark a transformative era in Natural Language Generation (NLG). These models demonstrate the ability to generate coherent text that closely resembles human-authored content. They are easily accessible and have become invaluable tools in handling various text-based tasks, such as data annotation, report generation, and question answering. In this paper, we investigate GPT-4’s ability to discern between data it has annotated and data annotated by humans, specifically within the context of tweets in the medical domain. Through experimental analysis, we observe GPT-4 outperform other state-of-the-art models. The dataset used in this study was provided by the SMM4H (Social Media Mining for Health Research and Applications) shared task. Our model achieved an accuracy of 0.51, securing a second rank in the shared task. @@ -213,9 +213,9 @@ <fixed-case>IMS</fixed-case>_medic<fixed-case>ALY</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Detecting Impacts of Outdoor Spaces on Social Anxiety with Data Augmented Ensembling AmelieWuehrlUniversity of Stuttgart, Universität Stuttgart - LynnGreschnerOtto-Friedrich Universität Bamberg + LynnGreschnerOtto-Friedrich Universität Bamberg YarikMenchaca Resendiz - RomanKlingerOtto-Friedrich Universität Bamberg + RomanKlingerOtto-Friedrich Universität Bamberg 83-87 Many individuals affected by Social Anxiety Disorder turn to social media platforms to share their experiences and seek advice. This includes discussing the potential benefits of engaging with outdoor environments. As part of #SMM4H 2024, Shared Task 3 focuses on classifying the effects of outdoor spaces on social anxiety symptoms in Reddit posts. In our contribution to the task, we explore the effectiveness of domain-specific models (trained on social media data – SocBERT) against general domain models (trained on diverse datasets – BERT, RoBERTa, GPT-3.5) in predicting the sentiment related to outdoor spaces. Further, we assess the benefits of augmenting sparse human-labeled data with synthetic training instances and evaluate the complementary strengths of domain-specific and general classifiers using an ensemble model. Our results show that (1) fine-tuning small, domain-specific models generally outperforms large general language models in most cases. Only one large language model (GPT-4) exhibits performance comparable to the fine-tuned models (52% F1). Further, we find that (2) synthetic data does improve the performance of fine-tuned models in some cases, and (3) models do not appear to complement each other in our ensemble setup. 2024.smm4h-1.19 @@ -223,7 +223,7 @@ 1024m at <fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Tasks 3, 5 & 6 - Self Reported Health Text Classification through Ensembles - RamKadiyala + RamKadiyala M.v.p.Rao 88-94 Social media is a great source of data for users reporting information and regarding their health and how various things have had an effect on them. This paper presents various approaches using Transformers and Large Language Models and their ensembles, their performance along with advantages and drawbacks for various tasks of SMM4H’24 - Classifying texts on impact of nature and outdoor spaces on the author’s mental health (Task 3), Binary classification of tweets reporting their children’s health disorders like Asthma, Autism, ADHD and Speech disorder (task 5), Binary classification of users self-reporting their age (task 6). @@ -232,9 +232,9 @@ Experimenting with Transformer-based and Large Language Models for Classifying Effects of Outdoor Spaces on Social Anxiety in Social Media Data - FalwahAlhamed - JuliaIveQueen Mary, University of London - LuciaSpeciaImperial College London + FalwahAlhamed + JuliaIveQueen Mary, University of London + LuciaSpeciaImperial College London 95-97 Social Anxiety Disorder (SAD) is a common condition, affecting a significant portion of the population. While research suggests spending time in nature can alleviate anxiety, the specific impact on SAD remains unclear. This study explores the relationship between discussions of outdoor spaces and social anxiety on social media. We leverage transformer-based and large language models (LLMs) to analyze a social media dataset focused on SAD. We developed three methods for the task of predicting the effects of outdoor spaces on SAD in social media. A two-stage pipeline classifier achieved the best performance of our submissions with results exceeding baseline performance. 2024.smm4h-1.21 @@ -312,7 +312,7 @@ <fixed-case>UKYNLP</fixed-case>@<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case>2024: Language Model Methods for Health Entity Tagging and Classification on Social Media (Tasks 4 & 5) - MotasemObeidat + MotasemObeidat VinuEkanayakeUniversity of Kentucky Md Sultan AlNahianUniversity of Kentucky RamakanthKavuluruUniversity of Kentucky @@ -323,11 +323,11 @@ <fixed-case>LHS</fixed-case>712_<fixed-case>ADEN</fixed-case>ot<fixed-case>G</fixed-case>ood at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024 Task 1: Deep-<fixed-case>LLMADE</fixed-case>miner: A deep learning and <fixed-case>LLM</fixed-case> pharmacovigilance pipeline for extraction and normalization of adverse drug event mentions on <fixed-case>T</fixed-case>witter - YifanZheng - JunGong + YifanZheng + JunGong ShushunRenUniversity of Michigan DaltonSimancek - V.G.VinodVydiswaranUniversity of Michigan - Ann Arbor + V.G.VinodVydiswaranUniversity of Michigan - Ann Arbor 130-132 Adverse drug events (ADEs) pose major public health risks, with traditional reporting systems often failing to capture them. Our proposed pipeline, called Deep-LLMADEminer, used natural language processing approaches to tackle this issue for #SMM4H 2024 shared task 1. Using annotated tweets, we built a three part pipeline: RoBERTa for classification, GPT-4-turbo for span extraction, and BioBERT for normalization. Our models achieved F1-scores of 0.838, 0.306, and 0.354, respectively, offering a novel system for Task 1 and similar pharmacovigilance tasks. 2024.smm4h-1.30 @@ -353,7 +353,7 @@ <fixed-case>KUL</fixed-case>@<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case>2024: Optimizing Text Classification with Quality-Assured Augmentation Strategies SumamFrancisKU Leuven, KU Leuven - Marie-FrancineMoensKU Leuven, KU Leuven + Marie-FrancineMoensKU Leuven, KU Leuven 142-145 This paper presents our models for the Social Media Mining for Health 2024 shared task, specifically Task 5, which involves classifying tweets reporting a child with childhood disorders (annotated as “1”) versus those merely mentioning a disorder (annotated as “0”). We utilized a classification model enhanced with diverse textual and language model-based augmentations. To ensure quality, we used semantic similarity, perplexity, and lexical diversity as evaluation metrics. Combining supervised contrastive learning and cross-entropy-based learning, our best model, incorporating R-drop and various LM generation-based augmentations, achieved an impressive F1 score of 0.9230 on the test set, surpassing the task mean and median scores. 2024.smm4h-1.33 @@ -364,7 +364,7 @@ ValeriaFraga NehaNair DaltonSimancek - V.G.VinodVydiswaranUniversity of Michigan - Ann Arbor + V.G.VinodVydiswaranUniversity of Michigan - Ann Arbor 146-148 This paper summarizes our participation in the Shared Task 4 of #SMM4H 2024. Task 4 was a named entity recognition (NER) task identifying clinical and social impacts of non-medical substance use in English Reddit posts. We employed the Bidirectional Encoder Representations from Transformers (BERT) model to complete this task. Our team achieved an F1-score of 0.892 on a validation set and a relaxed F1-score of 0.191 on the test set. 2024.smm4h-1.34 @@ -375,7 +375,7 @@ HafizhYusuf DavidBelmonteUniversity of Michigan DaltonSimancek - V.G.VinodVydiswaranUniversity of Michigan - Ann Arbor + V.G.VinodVydiswaranUniversity of Michigan - Ann Arbor 149-152 The goal of Social Media Mining for Health (#SMM4H) 2024 Task 7 was to train a machine learning model that is able to distinguish between annotations made by humans and those made by a Large Language Model (LLM). The dataset consisted of tweets originating from #SMM4H 2023 Task 3, wherein the objective was to extract COVID-19 symptoms in Latin-American Spanish tweets. Due to the lack of additional annotated tweets for classification, we reframed the task using the available tweets and their corresponding human or machine annotator labels to explore differences between the two subsets of tweets. We conducted an exploratory data analysis and trained a BERT-based classifier to identify sampling biases between the two subsets. The exploratory data analysis found no significant differences between the samples and our best classifier achieved a precision of 0.52 and a recall of 0.51, indicating near-random performance. This confirms the lack of sampling biases between the two sets of tweets and is thus a valid dataset for a task designed to assess the authorship of annotations by humans versus machines. 2024.smm4h-1.35 @@ -385,7 +385,7 @@ <fixed-case>TL</fixed-case>ab at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024: Retrieval-Augmented Generation for <fixed-case>ADE</fixed-case> Extraction and Normalization JacobBerkowitzCedars-Sinai Medical Center ApoorvaSrinivasanCedars-Sinai Medical Center - JoseCortinaCedars Sinai + JoseCortinaCedars Sinai NicholasTatonetti1Cedars-Sinai Medical Center 153-157 SMM4H 2024 Task 1 is focused on the identification of standardized Adverse Drug Events (ADEs) in tweets. We introduce a novel Retrieval-Augmented Generation (RAG) method, leveraging the capabilities of Llama 3, GPT-4, and the SFR-embedding-mistral model, along with few-shot prompting techniques, to map colloquial tweet language to MedDRA Preferred Terms (PTs) without relying on extensive training datasets. Our method achieved competitive performance, with an F1 score of 0.359 in the normalization task and 0.392 in the named entity recognition (NER) task. Notably, our model demonstrated robustness in identifying previously unseen MedDRA PTs (F1=0.363) greatly surpassing the median task score of 0.141 for such terms. @@ -394,9 +394,9 @@ <fixed-case>BIT</fixed-case>@<fixed-case>UA</fixed-case> at #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024 Tasks 1 and 5: finding adverse drug events and children’s medical disorders in <fixed-case>E</fixed-case>nglish tweets - LuisAfonsoUniversidade de Aveiro + LuisAfonsoUniversidade de Aveiro JoãoAlmeidaUniversidade de Aveiro - RuiAntunesUniversidade de Aveiro + RuiAntunesUniversidade de Aveiro JoséOliveiraUniversidade de Aveiro 158-162 In this paper we present our proposed systems, for Tasks 1 and 5 of the #SMM4H-2024 shared task (Social Media Mining for Health), responsible for identifying health-related aspects in English social media text. Task 1 consisted of identifying text spans mentioning adverse drug events and linking them to unique identifiers from the medical terminology MedDRA, whereas in Task 5 the aim was to distinguish tweets that report a user having a child with a medical disorder from tweets that merely mention a disorder.For Task 1, our system, composed of a pre-trained RoBERTa model and a random forest classifier, achieved 0.397 and 0.295 entity recognition and normalization F1-scores respectively. In Task 5, we obtained a 0.840 F1-score using a pre-trained BERT model. @@ -405,7 +405,7 @@ <fixed-case>FORCE</fixed-case>: A Benchmark Dataset for Foodborne Disease Outbreak and Recall Event Extraction from News - SudeshnaJanaTata Consultancy Services Limited, India + SudeshnaJanaTata Consultancy Services Limited, India ManjiraSinhaIndian Institute of Technology Kharagpur TirthankarDasgupta 163-169 @@ -415,17 +415,17 @@ Overview of #<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case> 2024 – Task 2: Cross-Lingual Few-Shot Relation Extraction for Pharmacovigilance in <fixed-case>F</fixed-case>rench, <fixed-case>G</fixed-case>erman, and <fixed-case>J</fixed-case>apanese - LisaRaithelTechnische Universität Berlin - PhilippeThomasGerman Research Center for AI + LisaRaithelTechnische Universität Berlin + PhilippeThomasGerman Research Center for AI BhuvaneshVerma RolandRollerGerman Research Center for AI Hui-SyuanYehLIMSI-CNRS / Université Paris-Sud - ShuntaroYadaNara Institute of Science and Technology, Japan + ShuntaroYadaNara Institute of Science and Technology, Japan CyrilGrouinCNRS - ShokoWakamiyaNara Institute of Science and Technology - EijiAramakiNara Institute of Science and Technology, Japan + ShokoWakamiyaNara Institute of Science and Technology + EijiAramakiNara Institute of Science and Technology, Japan SebastianMöller - PierreZweigenbaumLISN, CNRS, Université Paris-Saclay + PierreZweigenbaumLISN, CNRS, Université Paris-Saclay 170-182 This paper provides an overview of Task 2 from the Social Media Mining for Health 2024 shared task (#SMM4H 2024), which focused on Named Entity Recognition (NER, Subtask 2a) and the joint task of NER and Relation Extraction (RE, Subtask 2b) for detecting adverse drug reactions (ADRs) in German, Japanese, and French texts written by patients. Participants were challenged with a few-shot learning scenario, necessitating models that can effectively generalize from limited annotated examples. Despite the diverse strategies employed by the participants, the overall performance across submissions from three teams highlighted significant challenges. The results underscored the complexity of extracting entities and relations in multi-lingual contexts, especially from the noisy and informal nature of user-generated content. Further research is required to develop robust systems capable of accurately identifying and associating ADR-related information in low-resource and multilingual settings. 2024.smm4h-1.39 @@ -435,25 +435,25 @@ Overview of the 9th Social Media Mining for Health Applications (#<fixed-case>SMM</fixed-case>4<fixed-case>H</fixed-case>) Shared Tasks at <fixed-case>ACL</fixed-case> 2024 – Large Language Models and Generalizability for Social Media <fixed-case>NLP</fixed-case> DongfangXuCedars-Sinai Medical Center GuillermoGarciaCedars-Sinai Medical Center - LisaRaithelTechnische Universität Berlin - PhilippeThomasGerman Research Center for AI + LisaRaithelTechnische Universität Berlin + PhilippeThomasGerman Research Center for AI RolandRollerGerman Research Center for AI - EijiAramakiNara Institute of Science and Technology, Japan - ShokoWakamiyaNara Institute of Science and Technology - ShuntaroYadaNara Institute of Science and Technology, Japan - PierreZweigenbaumLISN, CNRS, Université Paris-Saclay + EijiAramakiNara Institute of Science and Technology, Japan + ShokoWakamiyaNara Institute of Science and Technology + ShuntaroYadaNara Institute of Science and Technology, Japan + PierreZweigenbaumLISN, CNRS, Université Paris-Saclay KarenO’ConnorUniversity of Pennsylvania, University of Pennsylvania SaiSamineniCedars-Sinai Medical Center SophiaHernandezUniversity of Pittsburgh, Pittsburgh YaoGe - SwatiRajwal - SudeshnaDasEmory University - AbeedSarkerEmory University + SwatiRajwal + SudeshnaDasEmory University + AbeedSarkerEmory University AriKleinUniversity of Pennsylvania - AnaSchmidtF. Hoffmann-La Roche AG + AnaSchmidtF. Hoffmann-La Roche AG VishakhaSharmaRoche Diagnostics - RaulRodriguez-EstebanF. Hoffmann-La Roche Ltd - JuanBandaStanford University + RaulRodriguez-EstebanF. Hoffmann-La Roche Ltd + JuanBandaStanford University IvanAmaroCedars-Sinai Medical Center DavyWeissenbacher GracielaGonzalez-HernandezCedars-Sinai Medical Center diff --git a/data/xml/2024.splurobonlp.xml b/data/xml/2024.splurobonlp.xml index 34a41129e6..f8e9d107f5 100644 --- a/data/xml/2024.splurobonlp.xml +++ b/data/xml/2024.splurobonlp.xml @@ -35,8 +35,8 @@ Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game PhilippSadlerUniversity of Potsdam - SherzodHakimovUniversität Potsdam - DavidSchlangenUniversity of Potsdam + SherzodHakimovUniversität Potsdam + DavidSchlangenUniversity of Potsdam 17-29 In this work, we evaluate the adaptability of neural agents towards assumed partner behaviors in a collaborative reference game. In this game, success is achieved when a knowledgeable guide can verbally lead a follower to the selection of a specific puzzle piece among several distractors. We frame this language grounding and coordination task as a reinforcement learning problem and measure to which extent a common reinforcement training algorithm (PPO) is able to produce neural agents (the guides) that perform well with various heuristic follower behaviors that vary along the dimensions of confidence and autonomy. We experiment with a learning signal that in addition to the goal condition also respects an assumed communicative effort. Our results indicate that this novel ingredient leads to communicative strategies that are less verbose (staying silent in some of the steps) and that with respect to that the guide’s strategies indeed adapt to the partner’s level of confidence and autonomy. 2024.splurobonlp-1.2 @@ -47,7 +47,7 @@ YoshikoKawabataNA MaiOmuraNational Institute for Japanese Language and Linguistics HikariKonishiNA - MasayukiAsaharaNational Institute for Japanese Language and Linguistics, Japan + MasayukiAsaharaNational Institute for Japanese Language and Linguistics, Japan JohaneTakeuchiNA 30-35 We constructed a database of Japanese expressions based on route information. Using 20 maps as stimuli, we requested descriptions of routes between two points on each map from 40 individuals per route, collecting 1600 route information reference expressions. We determined whether the expressions were based solely on relative reference expressions by using landmarks on the maps. In cases in which only relative reference expressions were used, we labeled the presence or absence of information regarding the starting point, waypoints, and destination. Additionally, we collected clarity ratings for each expression using a survey. diff --git a/data/xml/2024.teachingnlp.xml b/data/xml/2024.teachingnlp.xml index 82ca8540fd..c971612c92 100644 --- a/data/xml/2024.teachingnlp.xml +++ b/data/xml/2024.teachingnlp.xml @@ -24,7 +24,7 @@ Documenting the Unwritten Curriculum of Student Research - ShomirWilsonPennsylvania State University + ShomirWilsonPennsylvania State University 1-3 Graduate and undergraduate student researchers in natural language processing (NLP) often need mentoring to learn the norms of research. While methodological and technical knowledge are essential, there is also a “hidden curriculum” of experiential knowledge about topics like work strategies, common obstacles, collaboration, conferences, and scholarly writing. As a professor, I have written a set of guides that cover typically unwritten customs and procedures for academic research. I share them with advisees to help them understand research norms and to help us focus on their specific questions and interests. This paper describes these guides, which are freely accessible on the web (https://shomir.net/advice), and I provide recommendations to faculty who are interested in creating similar materials for their advisees. 2024.teachingnlp-1.1 @@ -32,7 +32,7 @@ Example-Driven Course Slides on Natural Language Processing Concepts - NataliePardeUniversity of Illinois Chicago + NataliePardeUniversity of Illinois Chicago 4-6 Natural language processing (NLP) is a fast-paced field and a popular course topic in many undergraduate and graduate programs. This paper presents a comprehensive suite of example-driven course slides covering NLP concepts, ranging from fundamental building blocks to modern state-of-the-art approaches. In contributing these slides, I hope to alleviate burden for those starting out as faculty or in need of course material updates. The slides are publicly available for external use and are updated regularly to incorporate new advancements. 2024.teachingnlp-1.2 @@ -40,8 +40,8 @@ Industry vs Academia: Running a Course on Transformers in Two Setups - IrinaNikishina - MariaTikhonovaHigher School of Economics + IrinaNikishina + MariaTikhonovaHigher School of Economics ViktoriiaChekalina AlexeyZaytsevBIMSA ArtemVazhentsevSkolkovo Institute of Science and Technology and Artificial Intelligence Research Institute @@ -54,7 +54,7 @@ Striking a Balance between Classical and Deep Learning Approaches in Natural Language Processing Pedagogy AdityaJoshiUNSW - JakeRenzella + JakeRenzella PushpakBhattacharyyaIndian Institute of Technology, Bombay, Dhirubhai Ambani Institute Of Information and Communication Technology SauravJha XiangyuZhang @@ -65,7 +65,7 @@ Co-Creational Teaching of Natural Language Processing - JohnMcCraeNational University of Ireland Galway + JohnMcCraeNational University of Ireland Galway 33-42 Traditional lectures have poorer outcomes compared to active learning methodologies, yet many natural language processing classes in higher education still follow this outdated methodology. In this paper, we present, co-creational teaching, a methodology that encourages partnership between staff and lecturers and show how this can be applied to teach natural language processing. As a fast-moving and dynamic area of study with high interest from students, natural language processing is an ideal subject for innovative teaching methodologies to improve student outcomes. We detail our experience with teaching natural language processing through partnership with students and provide detailed descriptions of methodologies that can be used by others in their teaching, including considerations of diverse student populations. 2024.teachingnlp-1.5 @@ -73,13 +73,13 @@ Collaborative Development of Modular Open Source Educational Resources for Natural Language Processing - MatthiasAßenmacherLudwig-Maximilians-Universität München + MatthiasAßenmacherLudwig-Maximilians-Universität München AndreasStephan LeonieWeissweilerLMU Munich - ErionÇanoUniversität Paderborn and Universität Vienna + ErionÇanoUniversität Paderborn and Universität Vienna IngoZieglerCopenhagen University MarwinHärttrich - BerndBischlLMU + BerndBischlLMU BenjaminRothUniversität Vienna ChristianHeumannLudwig-Maximilians-Universität München HinrichSchütze @@ -90,11 +90,11 @@ From Hate Speech to Societal Empowerment: A Pedagogical Journey Through Computational Thinking and <fixed-case>NLP</fixed-case> for High School Students - Alessandra TeresaCignarellaaequa-tech srl + Alessandra TeresaCignarellaaequa-tech srl ElisaChierchiello ChiaraFerrando - SimonaFrendaUniversity of Turin - Soda MaremLo + SimonaFrendaUniversity of Turin + Soda MaremLo AndreaMarra 54-65 The teaching laboratory we have created integrates methodologies to address the topic of hate speech on social media among students while fostering computational thinking and AI education for societal impact. We provide a foundational understanding of hate speech and introduce computational concepts using matrices, bag of words, and practical exercises in platforms like Colaboratory. Additionally, we emphasize the application of AI, particularly in NLP, to address real-world challenges. Through retrospective evaluation, we assess the efficacy of our approach, aiming to empower students as proactive contributors to societal betterment. With this paper we present an overview of the laboratory’s structure, the primary materials used, and insights gleaned from six editions conducted to the present date. @@ -103,7 +103,7 @@ Tightly Coupled Worksheets and Homework Assignments for <fixed-case>NLP</fixed-case> - LauraBiesterMiddlebury College + LauraBiesterMiddlebury College WinstonWuUniversity of Hawaii at Hilo 66-68 In natural language processing courses, students often struggle to debug their code. In this paper, we present three homework assignments that are tightly coupled with in-class worksheets. The worksheets allow students to confirm their understanding of the algorithms on paper before trying to write code. Then, as students complete the coding portion of the assignments, the worksheets aid students in the debugging process as test cases for the code, allowing students to seamlessly compare their results to those from the correct execution of the algorithm. @@ -112,13 +112,13 @@ Teaching <fixed-case>LLM</fixed-case>s at <fixed-case>C</fixed-case>harles <fixed-case>U</fixed-case>niversity: Assignments and Activities - JindřichHelclEdinburgh University, University of Edinburgh - ZdeněkKasner - OndřejDušekCharles University, Prague - TomaszLimisiewiczCharles University Prague + JindřichHelclEdinburgh University, University of Edinburgh + ZdeněkKasner + OndřejDušekCharles University, Prague + TomaszLimisiewiczCharles University Prague DominikMacháčekCharles University TomášMusilCharles University, Prague - JindřichLibovickýCharles University Prague + JindřichLibovickýCharles University Prague 69-72 This paper presents teaching materials, particularly assignments and ideas for classroom activities, from a new course on large language modelsThe assignments include experiments with LLM inference for weather report generation and machine translation.The classroom activities include class quizzes, focused research on downstream tasks and datasets, and an interactive “best paper” session aimed at reading and comprehension of research papers. 2024.teachingnlp-1.9 @@ -126,7 +126,7 @@ Empowering the Future with Multilinguality and Language Diversity - En-Shiun AnnieLee + En-Shiun AnnieLee KoseiUemura Syed MekaelWasti MasonShipton @@ -138,7 +138,7 @@ A Course Shared Task on Evaluating <fixed-case>LLM</fixed-case> Output for Clinical Questions YufangHouTechnische Universität Darmstadt and IBM Research Ireland - Thy ThyTran + Thy ThyTran Doan Nam LongVu YiwenCaoTechnical University of Darmstadt KaiLiTechnical University of Darmstadt @@ -159,7 +159,7 @@ Teaching Natural Language Processing in Law School - DanielBraunUniversity of Twente + DanielBraunUniversity of Twente 85-90 Fuelled by technical advances, the interest in Natural Language Processing in the legal domain has rapidly increased over the last months and years. The design, usage, and testing of domain-specific systems, but also assessing these systems from a legal perspective, needs competencies at the intersection of law and Natural Language Processing. While the demand for such competencies is high among students, only a few law schools, particularly in Europe, teach such competencies. In this paper, we present the design for a Natural Language Processing course for postgraduate law students that is based on the principle of constructive alignment and has proven to be successful over the last three years. 2024.teachingnlp-1.13 @@ -175,7 +175,7 @@ <fixed-case>BELT</fixed-case>: Building Endangered Language Technology - MichaelGinnUniversity of Colorado at Boulder + MichaelGinnUniversity of Colorado at Boulder DavidSaavedra-Beltrán CamiloRobayoUniversidad Nacional de Colombia AlexisPalmerUniversity of Colorado, Boulder @@ -187,7 +187,7 @@ Training an <fixed-case>NLP</fixed-case> Scholar at a Small Liberal Arts College: A Backwards Designed Course Proposal GrushaPrasadColgate University - ForrestDavisColgate University + ForrestDavisColgate University 105-118 The rapid growth in natural language processing (NLP) over the last couple yearshas generated student interest and excitement in learning more about the field. In this paper, we present two types of students that NLP courses might want to train. First, an “NLP engineer” who is able to flexibly design, build and apply new technologies in NLP for a wide range of tasks. Second, an “NLP scholar” who is able to pose, refine and answer questions in NLP and how it relates to the society, while also learning to effectively communicate these answers to a broader audience. While these two types of skills are not mutually exclusive — NLP engineers should be able to think critically, and NLP scholars should be able to build systems — we think that courses can differ in the balance of these skills. As educators at Small Liberal Arts Colleges, the strengths of our students and our institution favors an approach that is better suited to train NLP scholars. In this paper we articulate what kinds of skills an NLP scholar should have, and then adopt a backwards design to propose course components that can aid the acquisition of these skills. 2024.teachingnlp-1.16 @@ -198,7 +198,7 @@ AriaRayBrown JuliusSteuer MariusMosbachMcGill University and Mila - Quebec Artificial Intelligence Institute - DietrichKlakowSaarland University + DietrichKlakowSaarland University 119-127 We present a novel tool designed for teaching and interfacing the information-theoretic modeling abilities of large language models. The Surprisal Toolkit allows students from diverse linguistic and programming backgrounds to learn about measures of information theory and natural language processing (NLP) through an online interactive tool. In addition, the interface provides a valuable research mechanism for obtaining measures of surprisal. We implement the toolkit as part of a classroom tutorial in three different learning scenarios and discuss the overall receptive student feedback. We suggest this toolkit and similar applications as resourceful supplements to instruction in NLP topics, especially for the purpose of balancing conceptual understanding with technical instruction, grounding abstract topics, and engaging students with varying coding abilities. 2024.teachingnlp-1.17 diff --git a/data/xml/2024.textgraphs.xml b/data/xml/2024.textgraphs.xml index 7b7b8f9dd9..42e97c8610 100644 --- a/data/xml/2024.textgraphs.xml +++ b/data/xml/2024.textgraphs.xml @@ -27,10 +27,10 @@ Learning Human Action Representations from Temporal Context in Lifestyle Vlogs - OanaIgnat - SantiagoCastroUniversity of Michigan + OanaIgnat + SantiagoCastroUniversity of Michigan WeijiLiTesla - RadaMihalceaUniversity of Michigan + RadaMihalceaUniversity of Michigan 1-18 We address the task of human action representation and show how the approach to generating word representations based on co-occurrence can be adapted to generate human action representations by analyzing their co-occurrence in videos. To this end, we formalize the new task of human action co-occurrence identification in online videos, i.e., determine whether two human actions are likely to co-occur in the same interval of time.We create and make publicly available the Co-Act (Action Co-occurrence) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring.We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. 2024.textgraphs-1.1 @@ -38,12 +38,12 @@ <fixed-case>C</fixed-case>on<fixed-case>G</fixed-case>ra<fixed-case>T</fixed-case>: Self-Supervised Contrastive Pretraining for Joint Graph and Text Embeddings - WilliamBrannonMassachusetts Institute of Technology + WilliamBrannonMassachusetts Institute of Technology WonjuneKangMassachusetts Institute of Technology SuyashFulayMassachusetts Institute of Technology HangJiang BrandonRoyMassachusetts Institute of Technology and Brown University - DebRoyMassachusetts Institute of Technology + DebRoyMassachusetts Institute of Technology JadKabbaraMassachusetts Institute of Technology 19-39 Learning on text-attributed graphs (TAGs), in which nodes are associated with one or more texts, has been the subject of much recent work. However, most approaches tend to make strong assumptions about the downstream task of interest, are reliant on hand-labeled data, or fail to equally balance the importance of both text and graph representations. In this work, we propose Contrastive Graph-Text pretraining (ConGraT), a general, self-supervised approach for jointly learning separate representations of texts and nodes in a TAG. Our method trains a language model (LM) and a graph neural network (GNN) to align their representations in a common latent space using a batch-wise contrastive learning objective inspired by CLIP. We further propose an extension to the CLIP objective that leverages graph structure to incorporate information about inter-node similarity. Extensive experiments demonstrate that ConGraT outperforms baselines on various downstream tasks, including node and text category classification, link prediction, and language modeling. Finally, we present an application of our method to community detection in social graphs, which enables finding more textually grounded communities, rather than purely graph-based ones. @@ -53,7 +53,7 @@ Uniform Meaning Representation Parsing as a Pipelined Approach JayeolChunBrandeis University - NianwenXueBrandeis University + NianwenXueBrandeis University 40-52 Uniform Meaning Representation (UMR) is the next phase of semantic formalism following Abstract Meaning Representation (AMR), with added focus on inter-sentential relations allowing the representational scope of UMR to cover a full document. This, in turn, greatly increases the complexity of its parsing task with the additional requirement of capturing document-level linguistic phenomena such as coreference, modal and temporal dependencies. In order to establish a strong baseline despite the small size of recently released UMR v1.0 corpus, we introduce a pipeline model that does not require any training. At the core of our method is a two-track strategy of obtaining UMR’s sentence and document graphs separately, with the document-level triples being compiled at the token level and the sentence graph being converted from AMR graphs. By leveraging alignment between AMR and its sentence, we are able to generate the first automatic English UMR parses. 2024.textgraphs-1.3 @@ -61,12 +61,12 @@ Financial Product Ontology Population with Large Language Models - ChanatipSaetiaKasikorn Business Technology Group + ChanatipSaetiaKasikorn Business Technology Group JirathaPhruetthiset TawunratChalothornKASIKORN Business-Technology Group MonchaiLertsutthiwong - SupawatTaerungruangChiang Mai University - PakpoomBuabthongNakhon Ratchasima Rajabhat University + SupawatTaerungruangChiang Mai University + PakpoomBuabthongNakhon Ratchasima Rajabhat University 53-60 Ontology population, which aims to extract structured data to enrich domain-specific ontologies from unstructured text, typically faces challenges in terms of data scarcity and linguistic complexity, particularly in specialized fields such as retail banking. In this study, we investigate the application of large language models (LLMs) to populate domain-specific ontologies of retail banking products from Thai corporate documents. We compare traditional span-based approaches to LLMs-based generative methods, with different prompting techniques. Our findings reveal that while span-based methods struggle with data scarcity and the complex linguistic structure, LLMs-based generative approaches substantially outperform, achieving a 61.05% F1 score, with the most improvement coming from providing examples in the prompts. This improvement highlights the potential of LLMs for ontology population tasks, offering a scalable and efficient solution for structured information extraction in especially in low-resource language settings. 2024.textgraphs-1.4 @@ -86,7 +86,7 @@ Towards Understanding Attention-based Reasoning through Graph Structures in Medical Codes Classification NoonGoldstein - SaadullahAmin + SaadullahAmin GünterNeumannGerman Research Center for AI 78-92 A common approach to automatically assigning diagnostic and procedural clinical codes to health records is to solve the task as a multi-label classification problem. Difficulties associated with this task stem from domain knowledge requirements, long document texts, large and imbalanced label space, reflecting the breadth and dependencies between medical diagnoses and procedures. Decisions in the healthcare domain also need to demonstrate sound reasoning, both when they are correct and when they are erroneous. Existing works address some of these challenges by incorporating external knowledge, which can be encoded into a graph-structured format. Incorporating graph structures on the output label space or between the input document and output label spaces have shown promising results in medical codes classification. Limited focus has been put on utilizing graph-based representation on the input document space. To partially bridge this gap, we represent clinical texts as graph-structured data through the UMLS Metathesaurus; we explore implicit graph representation through pre-trained knowledge graph embeddings and explicit domain-knowledge guided encoding of document concepts and relational information through graph neural networks. Our findings highlight the benefits of pre-trained knowledge graph embeddings in understanding model’s attention-based reasoning. In contrast, transparent domain knowledge guidance in graph encoder approaches is overshadowed by performance loss. Our qualitative analysis identifies limitations that contribute to prediction errors. @@ -96,8 +96,8 @@ Leveraging Graph Structures to Detect Hallucinations in Large Language Models NoaNonkes - SergeiAgaronian - EvangelosKanoulasUniversity of Amsterdam and University of Amsterdam + SergeiAgaronian + EvangelosKanoulasUniversity of Amsterdam and University of Amsterdam RoxanaPetcuUniversity of Amsterdam and University of Amsterdam 93-104 Large language models are extensively applied across a wide range of tasks, such as customer support, content creation, educational tutoring, and providing financial guidance. However, a well-known drawback is their predisposition to generate hallucinations. This damages the trustworthiness of the information these models provide, impacting decision-making and user confidence. We propose a method to detect hallucinations by looking at the structure of the latent space and finding associations within hallucinated and non-hallucinated generations. We create a graph structure that connects generations that lie closely in the embedding space. Moreover, we employ a Graph Attention Network which utilizes message passing to aggregate information from neighboring nodes and assigns varying degrees of importance to each neighbor based on their relevance. Our findings show that 1) there exists a structure in the latent space that differentiates between hallucinated and non-hallucinated generations, 2) Graph Attention Networks can learn this structure and generalize it to unseen generations, and 3) the robustness of our method is enhanced when incorporating contrastive learning. When evaluated against evidence-based benchmarks, our model performs similarly without access to search-based methods. @@ -118,18 +118,18 @@ <fixed-case>T</fixed-case>ext<fixed-case>G</fixed-case>raphs 2024 Shared Task on Text-Graph Representations for Knowledge Graph Question Answering AndreySakhovskiyKazan Federal University MikhailSalnikovSkolkovo Institute of Science and Technology - IrinaNikishina - AidaUsmanovaLeuphana Universität Lüneburg + IrinaNikishina + AidaUsmanovaLeuphana Universität Lüneburg AngelieKraftUniversität Hamburg CedricMöllerUniversität Hamburg DebayanBanerjeeUniversität Hamburg - JunboHuangUniversität Hamburg - LongquanJiangUniversität Hamburg - RanaAbdullahUniversität Hamburg + JunboHuangUniversität Hamburg + LongquanJiangUniversität Hamburg + RanaAbdullahUniversität Hamburg XiYanUniversität Hamburg - DmitryUstalovJetBrains - ElenaTutubalinaKazan Federal University - RicardoUsbeckLeuphana Universität Lüneburg + DmitryUstalovJetBrains + ElenaTutubalinaKazan Federal University + RicardoUsbeckLeuphana Universität Lüneburg AlexanderPanchenkoSkoltech 116-125 This paper describes the results of the Knowledge Graph Question Answering (KGQA) shared task that was co-located with the TextGraphs 2024 workshop. In this task, given a textual question and a list of entities with the corresponding KG subgraphs, the participating system should choose the entity that correctly answers the question. Our competition attracted thirty teams, four of which outperformed our strong ChatGPT-based zero-shot baseline. In this paper, we overview the participating systems and analyze their performance according to a large-scale automatic evaluation. To the best of our knowledge, this is the first competition aimed at the KGQA problem using the interaction between large language models (LLMs) and knowledge graphs. @@ -149,13 +149,13 @@ <fixed-case>HW</fixed-case>-<fixed-case>TSC</fixed-case> at <fixed-case>T</fixed-case>ext<fixed-case>G</fixed-case>raphs-17 Shared Task: Enhancing Inference Capabilities of <fixed-case>LLM</fixed-case>s with Knowledge Graphs - WeiTangHuawei Technologies Ltd. - XiaosongQiaoHuawei Technologies Ltd. + WeiTangHuawei Technologies Ltd. + XiaosongQiaoHuawei Technologies Ltd. XiaofengZhaoHuawei Technologies Ltd. MinZhangHuawei Technologies Ltd. ChangSu YuangLiHuawei Technologies Ltd. - YingluLiHuawei Technologies Ltd. + YingluLiHuawei Technologies Ltd. YilunLiu FeiyuYao ShiminTaoHuawei Technologies Ltd. @@ -182,9 +182,9 @@ VishnudevKuruvanthodiInternational Business Machines MohabElkarefInternational Business Machines ShinnosukeTanakaInternational Business Machines - JamesBarry - GeethMel - CampbellWatsonInternational Business Machines + JamesBarry + GeethMel + CampbellWatsonInternational Business Machines 142-148 This paper presents the approach of the NLPeople team for the Text-Graph Representations for KGQA Shared Task at TextGraphs-17. The task involved selecting an answer for a given question from a list of candidate entities. We show that prompting Large Language models (LLMs) to break down a natural language question into a series of sub-questions, allows models to understand complex questions. The LLMs arrive at the final answer by answering the intermediate questions using their internal knowledge and without needing additional context. Our approach to the task uses an ensemble of prompting strategies to guide how LLMs interpret various types of questions. Our submission achieves an F1 score of 85.90, ranking 1st among the other participants in the task. 2024.textgraphs-1.13 @@ -193,7 +193,7 @@ Skoltech at <fixed-case>T</fixed-case>ext<fixed-case>G</fixed-case>raphs-17 Shared Task: Finding <fixed-case>GPT</fixed-case>-4 Prompting Strategies for Multiple Choice Questions MariaLysyuk - PavelBraslavskiNazarbayev University + PavelBraslavskiNazarbayev University 149-153 In this paper, we present our solution to the TextGraphs-17 Shared Task on Text-Graph Representations for KGQA. GPT-4 alone, with chain-of-thought reasoning and a given set of answers, achieves an F1 score of 0.78. By employing subgraph size as a feature, Wikidata answer description as an additional context, and question rephrasing technique, we further strengthen this result. These tricks help to answer questions that were not initially answered and to eliminate irrelevant, identical answers. We have managed to achieve an F1 score of 0.83 and took 2nd place, improving the score by 0.05 over the baseline. An open implementation of our method is available on GitHub. 2024.textgraphs-1.14 @@ -201,7 +201,7 @@ <fixed-case>J</fixed-case>elly<fixed-case>B</fixed-case>ell at <fixed-case>T</fixed-case>ext<fixed-case>G</fixed-case>raphs-17 Shared Task: Fusing Large Language Models with External Knowledge for Enhanced Question Answering - JuliaBelikova + JuliaBelikova EvegeniyBeliakin VasilyKonovalov 154-160 diff --git a/data/xml/2024.wassa.xml b/data/xml/2024.wassa.xml index cb7c31920a..daf55ccd51 100644 --- a/data/xml/2024.wassa.xml +++ b/data/xml/2024.wassa.xml @@ -24,7 +24,7 @@ Enhanced Financial Sentiment Analysis and Trading Strategy Development Using Large Language Models KemalKirtacUniversity College London, University of London - GuidoGermanoUniversity College London, University of London + GuidoGermanoUniversity College London, University of London 1-10 This study examines a novel methodology for enhanced financial sentiment analysis and trading strategy development using large language models (LLMs) such as OPT, BERT, FinBERT, LLAMA 3, and RoBERTa. Utilizing a dataset of 965,375 U.S. financial news articles from 2010 to 2023, our research demonstrates that the GPT-3-based OPT significantly outperforms other models, achieving a prediction accuracy of 74.4% for stock market returns. Our findings reveal that the advanced capabilities of LLMs, particularly OPT, surpass traditional sentiment analysis methods such as the Loughran-McDonald dictionary model in predicting and explaining stock returns. For instance, a self-financing strategy based on OPT scores achieves a Sharpe ratio of 3.05 over our sample period, compared to a Sharpe ratio of 1.23 for the strategy based on the dictionary model. This study highlights the superior performance of LLMs in financial sentiment analysis, encouraging further research into integrating artificial intelligence and LLMs in financial markets. 2024.wassa-1.1 @@ -34,7 +34,7 @@ <fixed-case>SEC</fixed-case>: Context-Aware Metric Learning for Efficient Emotion Recognition in Conversation BarbaraGendronUniversity of Lorraine - GaëlGuibonUniversity of Lorraine + GaëlGuibonUniversity of Lorraine 11-22 The advent of deep learning models has made a considerable contribution to the achievement of Emotion Recognition in Conversation (ERC). However, this task still remains an important challenge due to the plurality and subjectivity of human emotions. Previous work on ERC provides predictive models using mostly graph-based conversation representations. In this work, we propose a way to model the conversational context that we incorporate into a metric learning training strategy, with a two-step process. This allows us to perform ERC in a flexible classification scenario and end up with a lightweight yet efficient model. Using metric learning through a Siamese Network architecture, we achieve 57.71 in macro F1 score for emotion classification in conversation on DailyDialog dataset, which outperforms the related work. This state-of-the-art result is promising in terms of the use of metric learning for emotion recognition, yet perfectible compared to the micro F1 score obtained. 2024.wassa-1.2 @@ -43,9 +43,9 @@ Modeling Complex Interactions in Long Documents for Aspect-Based Sentiment Analysis - ZehongYan - WynneHsuNational University of Singapore - Mong-LiLeeNational University of Singapore + ZehongYan + WynneHsuNational University of Singapore + Mong-LiLeeNational University of Singapore DavidBartram-Shaw 23-34 The growing number of online articles and reviews necessitates innovative techniques for document-level aspect-based sentiment analysis. Capturing the context in which an aspect is mentioned is crucial. Existing models have focused on relatively short reviews and may fail to consider distant contextual information. This is especially so in longer documents where an aspect may be referred to in multiple ways across dispersed sentences. This work introduces a hierarchical Transformer-based architecture that encodes information at different level of granularities with attention aggregation mechanisms to learn the local and global aspect-specific document representations. For empirical validation, we curate two datasets of long documents: one on social issues, and another covering various topics involving trust-related issues. Experimental results show that the proposed architecture outperforms state-of-the-art methods for document-level aspect-based sentiment classification. We also demonstrate the potential applicability of our approach for long document trust prediction. @@ -55,9 +55,9 @@ Hierarchical Adversarial Correction to Mitigate Identity Term Bias in Toxicity Detection - JohannesSchäferUniversität Hildesheim + JohannesSchäferUniversität Hildesheim UlrichHeidUniversität Hildesheim and Universität Stuttgart - RomanKlingerOtto-Friedrich Universität Bamberg + RomanKlingerOtto-Friedrich Universität Bamberg 35-51 Corpora that are the fundament for toxicity detection contain such expressions typically directed against a target individual or group, e.g., people of a specific gender or ethnicity. Prior work has shown that the target identity mention can constitute a confounding variable. As an example, a model might learn that Christians are always mentioned in the context of hate speech. This misguided focus can lead to a limited generalization to newly emerging targets that are not found in the training data. In this paper, we hypothesize and subsequently show that this issue can be mitigated by considering targets on different levels of specificity. We distinguish levels of (1) the existence of a target, (2) a class (e.g., that the target is a religious group), or (3) a specific target group (e.g., Christians or Muslims). We define a target label hierarchy based on these three levels and then exploit this hierarchy in an adversarial correction for the lowest level (i.e. (3)) while maintaining some basic target features. This approach does not lower the toxicity detection performance but increases the generalization to targets not being available at training time. 2024.wassa-1.4 @@ -76,9 +76,9 @@ <fixed-case>LL</fixed-case>a<fixed-case>MA</fixed-case>-Based Models for Aspect-Based Sentiment Analysis - JakubŠmídUniversity of West Bohemia + JakubŠmídUniversity of West Bohemia PavelPribanUniversity of West Bohemia - PavelKralUniversity of West Bohemia + PavelKralUniversity of West Bohemia 63-70 While large language models (LLMs) show promise for various tasks, their performance in compound aspect-based sentiment analysis (ABSA) tasks lags behind fine-tuned models. However, the potential of LLMs fine-tuned for ABSA remains unexplored. This paper examines the capabilities of open-source LLMs fine-tuned for ABSA, focusing on LLaMA-based models. We evaluate the performance across four tasks and eight English datasets, finding that the fine-tuned Orca 2 model surpasses state-of-the-art results in all tasks. However, all models struggle in zero-shot and few-shot scenarios compared to fully fine-tuned ones. Additionally, we conduct error analysis to identify challenges faced by fine-tuned models. 2024.wassa-1.6 @@ -99,7 +99,7 @@ Entity-Level Sentiment: More than the Sum of Its Parts EgilRønningstad - RomanKlingerOtto-Friedrich Universität Bamberg + RomanKlingerOtto-Friedrich Universität Bamberg LiljaØvrelidDept. of Informatics, University of Oslo ErikVelldalUniversity of Oslo 84-96 @@ -130,8 +130,8 @@ Know Thine Enemy: Adaptive Attacks on Misinformation Detection Using Reinforcement Learning - PiotrPrzybyła - EuanMcGillUniversitat Pompeu Fabra + PiotrPrzybyła + EuanMcGillUniversitat Pompeu Fabra HoracioSaggionUniversitat Pompeu Fabra and Universitat Pompeu Fabra 125-140 We present XARELLO: a generator of adversarial examples for testing the robustness of text classifiers based on reinforcement learning. Our solution is adaptive, it learns from previous successes and failures in order to better adjust to the vulnerabilities of the attacked model. This reflects the behaviour of a persistent and experienced attacker, which are common in the misinformation-spreading environment. We evaluate our approach using several victim classifiers and credibility-assessment tasks, showing it generates better-quality examples with less queries, and is especially effective against the modern LLMs. We also perform a qualitative analysis to understand the language patterns in the misinformation text that play a role in the attacks. @@ -154,11 +154,11 @@ Guiding Sentiment Analysis with Hierarchical Text Clustering: Analyzing the <fixed-case>G</fixed-case>erman <fixed-case>X</fixed-case>/<fixed-case>T</fixed-case>witter Discourse on Face Masks in the 2020 <fixed-case>COVID</fixed-case>-19 Pandemic SilvanWehrli - ChisomEzekannaghaRobert Koch Institute - GeorgesHattabCentre for Artificial Intelligence in Public Health Research (ZKI-PH), Robert Koch-Institute - TamaraBoenderVrije Universiteit Amsterdam - BertArnrichHasso Plattner Institute - ChristopherIrrgangRobert Koch Institute + ChisomEzekannaghaRobert Koch Institute + GeorgesHattabCentre for Artificial Intelligence in Public Health Research (ZKI-PH), Robert Koch-Institute + TamaraBoenderVrije Universiteit Amsterdam + BertArnrichHasso Plattner Institute + ChristopherIrrgangRobert Koch Institute 153-167 Social media are a critical component of the information ecosystem during public health crises. Understanding the public discourse is essential for effective communication and misinformation mitigation. Computational methods can aid these efforts through online social listening. We combined hierarchical text clustering and sentiment analysis to examine the face mask-wearing discourse in Germany during the COVID-19 pandemic using a dataset of 353,420 German X (formerly Twitter) posts from 2020. For sentiment analysis, we annotated a subsample of the data to train a neural network for classifying the sentiments of posts (neutral, negative, or positive). In combination with clustering, this approach uncovered sentiment patterns of different topics and their subtopics, reflecting the online public response to mask mandates in Germany. We show that our approach can be used to examine long-term narratives and sentiment dynamics and to identify specific topics that explain peaks of interest in the social media discourse. 2024.wassa-1.13 @@ -167,7 +167,7 @@ Emotion Identification for <fixed-case>F</fixed-case>rench in Written Texts: Considering Modes of Emotion Expression as a Step Towards Text Complexity Analysis - AlineÉtienne + AlineÉtienne DelphineBattistelliUniversité Paris Nanterre GwénoléLecorvéOrange 168-185 @@ -178,9 +178,9 @@ Comparing Tools for Sentiment Analysis of <fixed-case>D</fixed-case>anish Literature from Hymns to Fairy Tales: Low-Resource Language and Domain Challenges - PascaleFeldkamp - JanKostkanAarhus University - EaOvergaard + PascaleFeldkamp + JanKostkanAarhus University + EaOvergaard MiaJacobsen YuriBizzoni 186-199 @@ -203,7 +203,7 @@ Subjectivity Detection in <fixed-case>E</fixed-case>nglish News using Large Language Models MohammadShokri VivekSharmaCUNY John Jay College of Criminal Justice and The Graduate Center, CUNY - ElenaFilatovaCUNY City Tech + ElenaFilatovaCUNY City Tech ShwetaJainCUNY John Jay College of Criminal Justice SarahLevitanCUNY Hunter College 215-226 @@ -214,10 +214,10 @@ Monitoring Depression Severity and Symptoms in User-Generated Content: An Annotation Scheme and Guidelines - FalwahAlhamed + FalwahAlhamed RebeccaBendayan - JuliaIveQueen Mary, University of London - LuciaSpeciaImperial College London + JuliaIveQueen Mary, University of London + LuciaSpeciaImperial College London 227-233 Depression is a highly prevalent condition recognized by the World Health Organization as a leading contributor to global disability. Many people suffering from depression express their thoughts and feelings using social media, which thus becomes a source of data for research in this domain. However, existing annotation schemes tailored to studying depression symptoms in social media data remain limited. Reliable and valid annotation guidelines are crucial for accurately measuring mental health conditions for those studies. This paper addresses this gap by presenting a novel depression annotation scheme and guidelines for detecting depression symptoms and their severity in social media text. Our approach leverages validated depression questionnaires and incorporates the expertise of psychologists and psychiatrists during scheme refinement. The resulting annotation scheme achieves high inter-rater agreement, demonstrating its potential for suitable depression assessment in social media contexts. 2024.wassa-1.18 @@ -227,7 +227,7 @@ <fixed-case>R</fixed-case>ide<fixed-case>KE</fixed-case>: Leveraging Low-resource <fixed-case>T</fixed-case>witter User-generated Content for Sentiment and Emotion Detection on Code-switched <fixed-case>RHS</fixed-case> Dataset. NaomeEtori - MariaGiniUniversity of Minnesota , Twin Ciities + MariaGiniUniversity of Minnesota , Twin Ciities 234-249 Social media has become a crucial open-access platform enabling individuals to freely express opinions and share experiences. These platforms contain user-generated content facilitating instantaneous communication and feedback. However, leveraging low-resource language data from Twitter can be challenging due to the scarcity and poor quality of content with significant variations in language use, such as slang and code-switching. Automatically identifying tweets in low-resource languages can also be challenging because Twitter primarily supports high-resource languages; low-resource languages often lack robust linguistic and contextual support. This paper analyzes Kenyan code-switched data from Twitter using four transformer-based pretrained models for sentiment and emotion classification tasks using supervised and semi-supervised methods. We detail the methodology behind data collection, the annotation procedure, and the challenges encountered during the data curation phase. Our results show that XLM-R outperforms other models; for sentiment analysis, XLM-R supervised model achieves the highest accuracy (69.2%) and F1 score (66.1%), XLM-R semi-supervised (67.2% accuracy, 64.1% F1 score). In emotion analysis, DistilBERT supervised leads in accuracy (59.8%) and F1 score (31%), mBERT semi-supervised (accuracy (59% and F1 score 26.5%). AfriBERTa models show the lowest accuracy and F1 scores. This indicates that the semi-supervised method’s performance is constrained by the small labeled dataset. 2024.wassa-1.19 @@ -236,11 +236,11 @@ <fixed-case>POL</fixed-case>ygraph: <fixed-case>P</fixed-case>olish Fake News Dataset - DanielDzienisiewiczAdam Mickiewicz University of Poznan - FilipGralińskiAdam Mickiewicz University, Adam Mickiewicz University, Applica.ai and Applica.ai - PiotrJabłoński - MarekKubisAdam Mickiewicz University of Poznan - PawełSkórzewskiAdam Mickiewicz University of Poznan + DanielDzienisiewiczAdam Mickiewicz University of Poznan + FilipGralińskiAdam Mickiewicz University, Adam Mickiewicz University, Applica.ai and Applica.ai + PiotrJabłoński + MarekKubisAdam Mickiewicz University of Poznan + PawełSkórzewskiAdam Mickiewicz University of Poznan PiotrWierzchonAdam mickiewicz University 250-263 This paper presents the POLygraph dataset, a unique resource for fake news detection in Polish. The dataset, created by an interdisciplinary team, is composed of two parts: the “fake-or-not” dataset with 11,360 pairs of news articles (identified by their URLs) and corresponding labels, and the “fake-they-say” dataset with 5,082 news articles (identified by their URLs) and tweets commenting on them. Unlike existing datasets, POLygraph encompasses a variety of approaches from source literature, providing a comprehensive resource for fake news detection. The data was collected through manual annotation by expert and non-expert annotators. The project also developed a software tool that uses advanced machine learning techniques to analyze the data and determine content authenticity. The tool and dataset are expected to benefit various entities, from public sector institutions to publishers and fact-checking organizations. Further dataset exploration will foster fake news detection and potentially stimulate the implementation of similar models in other languages. The paper focuses on the creation and composition of the dataset, so it does not include a detailed evaluation of the software tool for content authenticity analysis, which is planned at a later stage of the project. @@ -261,7 +261,7 @@ Impact of Decoding Methods on Human Alignment of Conversational <fixed-case>LLM</fixed-case>s ShazFurniturewala - KokilJaidkaNational University of Singapore + KokilJaidkaNational University of Singapore YashvardhanSharmaBITS Pilani 273-279 To be included into chatbot systems, Large language models (LLMs) must be aligned with human conversational conventions. However, being trained mainly on web-scraped data gives existing LLMs a voice closer to informational text than actual human speech. In this paper, we examine the effect of decoding methods on the alignment between LLM-generated and human conversations, including Beam Search, Top K Sampling, and Nucleus Sampling. We present new measures of alignment in substance, style, and psychometric orientation, and experiment with two conversation datasets. Our results provide subtle insights: better alignment is attributed to fewer beams in Beam Search and lower values of P in Nucleus Sampling. We also find that task-oriented and open-ended datasets perform differently in terms of alignment, indicating the significance of taking into account the context of the interaction. @@ -274,8 +274,8 @@ NaoyaFujikawa Quang ToanNguyen KazuhiroIto - ShokoWakamiya - EijiAramaki + ShokoWakamiya + EijiAramaki 280-293 Loneliness, a significant public health concern, is closely connected to both physical and mental well-being. Hence, detection and intervention for individuals experiencing loneliness are crucial. Identifying loneliness in text is straightforward when it is explicitly stated but challenging when it is implicit. Detecting implicit loneliness requires a manually annotated dataset because whereas explicit loneliness can be detected using keywords, implicit loneliness cannot be. However, there are no freely available datasets with clear annotation guidelines for implicit loneliness. In this study, we construct a freely accessible Japanese loneliness dataset with annotation guidelines grounded in the psychological definition of loneliness. This dataset covers loneliness intensity and the contributing factors of loneliness. We train two models to classify whether loneliness is expressed and the intensity of loneliness. The model classifying loneliness versus non-loneliness achieves an F1-score of 0.833, but the model for identifying the intensity of loneliness has a low F1-score of 0.400, which is likely due to label imbalance and a shortage of a certain label in the dataset. We validate performance in another domain, specifically X (formerly Twitter), and observe a decrease. In addition, we propose improvement suggestions for domain adaptation. 2024.wassa-1.23 @@ -286,12 +286,12 @@ Estimation of Happiness Changes through Longitudinal Analysis of Employees’ Texts JunkoHayashi KazuhiroItoNara Institute of Science and Technology, Japan - MasaeManabeKyoto University - YasushiWatanabeKyoto University, Tokyo Institute of Technology + MasaeManabeKyoto University + YasushiWatanabeKyoto University, Tokyo Institute of Technology MasatakaNakayamaKyoto University YukikoUchida - ShokoWakamiyaNara Institute of Science and Technology - EijiAramakiNara Institute of Science and Technology, Japan + ShokoWakamiyaNara Institute of Science and Technology + EijiAramakiNara Institute of Science and Technology, Japan 294-304 Measuring happiness as a determinant of well-being is increasingly recognized as crucial. While previous studies have utilized free-text descriptions to estimate happiness on a broad scale, limited research has focused on tracking individual fluctuations in happiness over time owing to the challenges associated with longitudinal data collection. This study addresses this issue by obtaining longitudinal data from two workplaces over two and six months respectively.Subsequently, the data is used to construct a happiness estimation model and assess individual happiness levels.Evaluation of the model performance using correlation coefficients shows variability in the correlation values among individuals.Notably, the model performs satisfactorily in estimating 9 of the 11 users’ happiness scores, with a correlation coefficient of 0.4 or higher. To investigate the factors affecting the model performance, we examine the relationship between the model performance and variables such as sentence length, lexical diversity, and personality traits. Correlations are observed between these features and model performance. 2024.wassa-1.24 @@ -300,8 +300,8 @@ Subjectivity Theory vs. Speaker Intuitions: Explaining the Results of a Subjectivity Regressor Trained on Native Speaker Judgements - ElenaSavinovaRadboud University - JetHoek + ElenaSavinovaRadboud University + JetHoek 305-315 In this paper, we address the issue of explainability in a transformer-based subjectivity regressor trained on native English speakers’ judgements. The main goal of this work is to test how the regressor’s predictions, and therefore native speakers’ intuitions, relate to theoretical accounts of subjectivity. We approach this goal using two methods: a top-down manual selection of theoretically defined subjectivity features and a bottom-up extraction of top subjective and objective features using the LIME explanation method. The explainability of the subjectivity regressor is evaluated on a British news dataset containing sentences taken from social media news posts and from articles on the websites of the same news outlets. Both methods provide converging evidence that theoretically defined subjectivity features, such as emoji, evaluative adjectives, exclamations, questions, intensifiers, and first person pronouns, are prominent predictors of subjectivity scores. Thus, our findings show that the predictions of the regressor, and therefore native speakers’ perceptions of subjectivity, align with subjectivity theory. However, an additional comparison of the effects of different subjectivity features in author text and the text of cited sources reveals that the distinction between author and source subjectivity might not be as salient for naïve speakers as it is in the theory. 2024.wassa-1.25 @@ -312,8 +312,8 @@ Comparing Pre-trained Human Language Models: Is it Better with Human Context as Groups, Individual Traits, or Both? NikitaSoni NiranjanBalasubramanianState University of New York, Stony Brook - H. AndrewSchwartzStony Brook University (SUNY) - DirkHovyBocconi University + H. AndrewSchwartzStony Brook University (SUNY) + DirkHovyBocconi University 316-328 Pre-trained language models consider the context of neighboring words and documents but lack any author context of the human generating the text. However, language depends on the author’s states, traits, social, situational, and environmental attributes, collectively referred to as human context (Soni et al., 2024). Human-centered natural language processing requires incorporating human context into language models. Currently, two methods exist: pre-training with 1) group-wise attributes (e.g., over-45-year-olds) or 2) individual traits. Group attributes are simple but coarse — not all 45-year-olds write the same way — while individual traits allow for more personalized representations, but require more complex modeling and data. It is unclear which approach benefits what tasks. We compare pre-training models with human context via 1) group attributes, 2) individual users, and 3) a combined approach on five user- and document-level tasks. Our results show that there is no best approach, but that human-centered language modeling holds avenues for different methods. 2024.wassa-1.26 @@ -346,7 +346,7 @@ KemalKurniawanUniversity of Melbourne MeladelMisticaThe University of Melbourne TimothyBaldwinMohamed bin Zayed University of Artificial Intelligence and The University of Melbourne - Jey HanLauThe University of Melbourne + Jey HanLauThe University of Melbourne 362-368 This paper explores the task of automatic prediction of text spans in a legal problem description that support a legal area label. We use a corpus of problem descriptions written by laypeople in English that is annotated by practising lawyers. Inherent subjectivity exists in our task because legal area categorisation is a complex task, and lawyers often have different views on a problem. Experiments show that training on majority-voted spans outperforms training on disaggregated ones. 2024.wassa-1.29 @@ -378,7 +378,7 @@ Chinchunmei at <fixed-case>WASSA</fixed-case> 2024 Empathy and Personality Shared Task: Boosting <fixed-case>LLM</fixed-case>’s Prediction with Role-play Augmentation and Contrastive Reasoning Calibration TianLi - NicolayRusnachenko + NicolayRusnachenko HuizhiLiangNewcastle University, UK 385-392 This paper presents the Chinchunmei team’s contributions to the WASSA2024 Shared-Task 1: Empathy Detection and Emotion Classification. We participated in Tracks 1, 2, and 3 to predict empathetic scores based on dialogue, article, and essay content. We choose Llama3-8b-instruct as our base model. We developed three supervised fine-tuning schemes: standard prediction, role-play, and contrastive prediction, along with an innovative scoring calibration method called Contrastive Reasoning Calibration during inference. Pearson Correlation was used as the evaluation metric across all tracks. For Track 1, we achieved 0.43 on the devset and 0.17 on the testset. For Track 2 emotion, empathy, and polarity labels, we obtained 0.64, 0.66, and 0.79 on the devset and 0.61, 0.68, and 0.58 on the testset. For Track 3 empathy and distress labels, we got 0.64 and 0.56 on the devset and 0.33 and 0.35 on the testset. @@ -411,7 +411,7 @@ Empaths at <fixed-case>WASSA</fixed-case> 2024 Empathy and Personality Shared Task: Turn-Level Empathy Prediction Using Psychological Indicators ShazFurniturewala - KokilJaidkaNational University of Singapore + KokilJaidkaNational University of Singapore 404-411 For the WASSA 2024 Empathy and Personality Prediction Shared Task, we propose a novel turn-level empathy detection method that decomposes empathy into six psychological indicators: Emotional Language, Perspective-Taking, Sympathy and Compassion, Extroversion, Openness, and Agreeableness. A pipeline of text enrichment using a Large Language Model (LLM) followed by DeBERTA fine-tuning demonstrates a significant improvement in the Pearson Correlation Coefficient and F1 scores for empathy detection, highlighting the effectiveness of our approach. Our system officially ranked 7th at the CONV-turn track. 2024.wassa-1.35 @@ -454,7 +454,7 @@ HuiyuYang LitingHuang TianLi - NicolayRusnachenko + NicolayRusnachenko HuizhiLiangNewcastle University, UK 430-434 This paper presents our participation to the WASSA 2024 Shared Task on Empathy Detection and Emotion Classification and Personality Detection in Interactions. We focus on Track 2: Empathy and Emotion Prediction in Conversations Turns (CONV-turn), which consists of predicting the perceived empathy, emotion polarity and emotion intensity at turn level in a conversation. In the method, we conduct BERT and DeBERTa based finetuning, implement the CombinedLoss which consists of a structured contrastive loss and Pearson loss, adopt adversarial training using Fast Gradient Method (FGM). This method achieved Pearson correlation of 0.581 for Emotion,0.644 for Emotional Polarity and 0.544 for Empathy on the test set, with the average value of 0.590 which ranked 4th among all teams. After submission to WASSA 2024 competition, we further introduced the segmented mix-up for data augmentation, boosting for ensemble and regression experiments, which yield even better results: 0.6521 for Emotion, 0.7376 for EmotionalPolarity, 0.6326 for Empathy in Pearson correlation on the development set. The implementation and fine-tuned models are publicly-available at https://github.com/hyy-33/hyy33-WASSA-2024-Track-2. @@ -476,8 +476,8 @@ <fixed-case>E</fixed-case>mpathetic<fixed-case>FIG</fixed-case> at <fixed-case>WASSA</fixed-case> 2024 Empathy and Personality Shared Task: Predicting Empathy and Emotion in Conversations with Figurative Language GyeongeunLee ZhuWangUniversity of Illinois at Chicago - Sathya N.RaviUniversity of Illinois, Chicago - NataliePardeUniversity of Illinois Chicago + Sathya N.RaviUniversity of Illinois, Chicago + NataliePardeUniversity of Illinois Chicago 441-447 Recent research highlights the importance of figurative language as a tool for amplifying emotional impact. In this paper, we dive deeper into this phenomenon and outline our methods for Track 1, Empathy Prediction in Conversations (CONV-dialog) and Track 2, Empathy and Emotion Prediction in Conversation Turns (CONV-turn) of the WASSA 2024 shared task. We leveraged transformer-based large language models augmented with figurative language prompts, specifically idioms, metaphors and hyperbole, that were selected and trained for each track to optimize system performance. For Track 1, we observed that a fine-tuned BERT with metaphor and hyperbole features outperformed other models on the development set. For Track 2, DeBERTa, with different combinations of figurative language prompts, performed well for different prediction tasks. Our method provides a novel framework for understanding how figurative language influences emotional perception in conversational contexts. Our system officially ranked 4th in the 1st track and 3rd in the 2nd track. 2024.wassa-1.41 @@ -486,9 +486,9 @@ <fixed-case>C</fixed-case>on<fixed-case>T</fixed-case>ext at <fixed-case>WASSA</fixed-case> 2024 Empathy and Personality Shared Task: History-Dependent Embedding Utterance Representations for Empathy and Emotion Prediction in Conversations - PatríciaPereiraInstituto Superior Técnico - HelenaMonizUniversidade de Lisboa - Joao PauloCarvalhoInstituto Superior Técnico and INESC-ID + PatríciaPereiraInstituto Superior Técnico + HelenaMonizUniversidade de Lisboa + Joao PauloCarvalhoInstituto Superior Técnico and INESC-ID 448-453 Empathy and emotion prediction are key components in the development of effective and empathetic agents, amongst several other applications. The WASSA shared task on empathy empathy and emotion prediction in interactions presents an opportunity to benchmark approaches to these tasks.Appropriately selecting and representing the historical context is crucial in the modelling of empathy and emotion in conversations. In our submissions, we model empathy, emotion polarity and emotion intensity of each utterance in a conversation by feeding the utterance to be classified together with its conversational context, i.e., a certain number of previous conversational turns, as input to an encoder Pre-trained Language Model (PLM), to which we append a regression head for prediction. We also model perceived counterparty empathy of each interlocutor by feeding all utterances from the conversation and a token identifying the interlocutor for which we are predicting the empathy. Our system officially ranked 1st at the CONV-turn track and 2nd at the CONV-dialog track. 2024.wassa-1.42 From dba0dc202494d683cb9be6f81d113015c9c35834 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Wed, 1 Oct 2025 11:46:41 -0400 Subject: [PATCH 6/7] Add ORCID iDs for 2025.gem-1 --- data/xml/2025.gem.xml | 206 +++++++++++++++++++++--------------------- 1 file changed, 103 insertions(+), 103 deletions(-) diff --git a/data/xml/2025.gem.xml b/data/xml/2025.gem.xml index b7accde96b..ec2c73ef79 100644 --- a/data/xml/2025.gem.xml +++ b/data/xml/2025.gem.xml @@ -40,9 +40,9 @@ Psycholinguistic Word Features: a New Approach for the Evaluation of <fixed-case>LLM</fixed-case>s Alignment with Humans - JavierCondeUniversidad Politécnica de Madrid - Miguel GonzálezSaizUniversidad Politécnica de Madrid - MaríaGranduryUniversidad Politécnica de Madrid and Universidad Nacional de Educación a Distancia + JavierCondeUniversidad Politécnica de Madrid + Miguel GonzálezSaizUniversidad Politécnica de Madrid + MaríaGranduryUniversidad Politécnica de Madrid and Universidad Nacional de Educación a Distancia PedroReviriego GonzaloMartínezUniversidad Carlos III de Madrid MarcBrysbaertUniversiteit Gent @@ -53,7 +53,7 @@ Spatial Representation of Large Language Models in 2<fixed-case>D</fixed-case> Scene - WenyaWuWenyaWu + WenyaWuWenyaWu WeihongDengBeijing University of Post and Telecommunication 18-29 Spatial representations are fundamental to human cognition, as understanding spatial relationships between objects is essential in daily life. Language serves as an indispensable tool for communicating spatial information, creating a close connection between spatial representations and spatial language. Large language models (LLMs), theoretically, possess spatial cognition due to their proficiency in natural language processing. This study examines the spatial representations of LLMs by employing traditional spatial tasks used in human experiments and comparing the models’ performance to that of humans. The results indicate that LLMs resemble humans in selecting spatial prepositions to describe spatial relationships and exhibit a preference for vertically oriented spatial terms. However, the human tendency to better represent locations along specific axes is absent in the performance of LLMs. This finding suggests that, although spatial language is closely linked to spatial representations, the two are not entirely equivalent. @@ -64,9 +64,9 @@ The Fellowship of the <fixed-case>LLM</fixed-case>s: Multi-Model Workflows for Synthetic Preference Optimization Dataset Generation SameeArif SualehaFaridUniversity of Michigan - Ann Arbor - Abdul HameedAzeemi - AwaisAtharEuropean Bioinformatics Institute - European Molecular Biology Laboratory (EMBL-EBI) - Agha AliRazaLahore University of Management Sciences + Abdul HameedAzeemi + AwaisAtharEuropean Bioinformatics Institute - European Molecular Biology Laboratory (EMBL-EBI) + Agha AliRazaLahore University of Management Sciences 30-45 This paper presents a novel methodology for generating synthetic Preference Optimization (PO) datasets using multi-model workflows. We evaluate the effectiveness and potential of these workflows in automating and enhancing the dataset generation process. PO dataset generation requires two modules: (1) \textit{response evaluation}, and (2) \textit{response generation}. In the \textit{response evaluation} module, the responses from Large Language Models (LLMs) are evaluated and ranked - a task typically carried out by human annotators that we automate using LLMs. We assess the response evaluation module in a 2 step process. In step 1, we assess LLMs as evaluators using three distinct prompting strategies. In step 2, we apply the winning prompting strategy to compare the performance of LLM-as-a-Judge, LLMs-as-a-Jury, and LLM Debate. Our evaluation shows that GPT-4o-as-a-Judge is more consistent across all datasets. For the \textit{response generation} module, we use the identified LLM evaluator configuration and compare different configurations of the LLM Feedback Loop. We use the win rate to determine the best multi-model configuration for generation. Experimenting with various configurations, we find that the LLM Feedback Loop, with Llama as the generator and Gemma as the reviewer, achieves a notable 71.8% and 73.8% win rate over single-model Llama and Gemma, respectively. After identifying the best configurations for both modules, we generate our PO datasets using the above pipeline. 2025.gem-1.4 @@ -82,7 +82,7 @@ Constantin MarcSeibold Kaleb ESmith JulianFriedrich - JensKleesiek + JensKleesiek 46-59 Large Language Models (LLMs) hold significant potential for improving healthcare applications, with biomedically adapted models promising enhanced performance on medical tasks. However, the effectiveness of biomedical domain adaptation for clinical tasks remains uncertain. In this study, we conduct a direct comparison of 12 biomedically adapted models and their general-domain base counterparts across six clinical tasks. Our results reveal that 11 out of 12 biomedical models exhibit performance declines, challenging prior findings that reported positive effects of biomedical adaptation. Notably, previous positive results primarily relied on multiple-choice evaluations, which may not reflect performance in real-world clinical applications. To promote reproducibility and further research, we open-source our evaluation pipeline, providing a resource for the development of models with practical benefits in healthcare settings. 2025.gem-1.5 @@ -90,8 +90,8 @@ <fixed-case>HEDS</fixed-case> 3.0: The Human Evaluation Data Sheet Version 3.0 - AnyaBelzDublin City University - CraigThomsonDublin City University and University of Aberdeen + AnyaBelzDublin City University + CraigThomsonDublin City University and University of Aberdeen 60-81 This paper presents a new version of the Human Evaluation Datasheet (HEDS), numbered 3.0 This update is the result of our experience using HEDS in the context of numerous recent human evaluation experiments, including reproduction studies, and of feedback collected from other researchers. Our main overall goal was to improve clarity, and to enable users to complete the datasheet more consistently and comparably. The HEDS 3.0 package consists of the digital data sheet, documentation, and code for exporting completed data sheets as latex files, all available from the HEDS 3.0 GitHub. 2025.gem-1.6 @@ -101,8 +101,8 @@ <fixed-case>ARGENT</fixed-case>: Automatic Reference-free Evaluation for Open-Ended Text Generation without Source Inputs XinyueZhang AgatheZecevic - SebastianZeki - AngusRobertsKing’s College London, University of London + SebastianZeki + AngusRobertsKing’s College London, University of London 82-98 With increased accessibility of machine-generated texts, the need for their evaluation has also grown. There are broadly two types of text generation tasks. In open-ended generation tasks (OGTs), the model generates de novo text without any input on which to base it, such as story generation. In reflective generation tasks (RGTs), the model output is generated to reflect an input sequence, such as in machine translation. There are many studies on RGT evaluation, where the metrics typically compare one or more gold-standard references to the model output. Evaluation of OGTs has received less attention and is more challenging: since the task does not aim to reflect an input, there are usually no reference texts. In this paper, we propose a new perspective that unifies OGT evaluation with RGT evaluation, based on which we develop an automatic, reference-free generative text evaluation model (ARGENT), and review previous literature from this perspective. Our experiments demonstrate the effectiveness of these methods across informal, formal, and domain-specific texts. We conduct a meta-evaluation to compare existing and proposed metrics, finding that our approach aligns more closely with human judgement. 2025.gem-1.8 @@ -111,9 +111,9 @@ Are <fixed-case>LLM</fixed-case>s (Really) Ideological? An <fixed-case>IRT</fixed-case>-based Analysis and Alignment Tool for Perceived Socio-Economic Bias in <fixed-case>LLM</fixed-case>s JasminWachter - MichaelRadloffAlpen-Adria Universität Klagenfurt + MichaelRadloffAlpen-Adria Universität Klagenfurt MajaSmolej - KatharinaKinder-KurlandaAlpen-Adria Universität Klagenfurt + KatharinaKinder-KurlandaAlpen-Adria Universität Klagenfurt 99-120 We introduce an Item Response Theory (IRT)-based framework to detect and quantify ideological bias in large language models (LLMs) without relying on subjective human judgments. Unlike prior work, our two-stage approach distinguishes between response avoidance and expressed bias by modeling ‘Prefer Not to Answer’ (PNA) behaviors and calibrating ideological leanings based on open-ended responses. We fine-tune two LLM families to represent liberal and conservative baselines, and validate our approach using a 105-item ideological test inventory. Our results show that off-the-shelve LLMs frequently avoid engagement with ideological prompts, calling into question previous claims of partisan bias. This framework provides a statistically grounded and scalable tool for LLM alignment and fairness assessment. The general methodolody can also be applied to other forms of bias and languages. 2025.gem-1.9 @@ -122,7 +122,7 @@ Knockout <fixed-case>LLM</fixed-case> Assessment: Using Large Language Models for Evaluations through Iterative Pairwise Comparisons Isik BaranSandanKarlsruher Institut für Technologie - Tu AnhDinhKarlsruher Institut für Technologie + Tu AnhDinhKarlsruher Institut für Technologie JanNiehues 121-128 Large Language Models (LLMs) have shown to be effective evaluators across various domains such as machine translations or the scientific domain. Current LLM-as-a-Judge approaches rely mostly on individual assessments or a single round of pairwise assessments, preventing the judge LLM from developing a global ranking perspective.To address this, we present Knockout Assessment, an LLM-as-a-Judge method using a knockout tournament system with iterative pairwise comparisons. Experiments across three LLMs on two datasets show that knockout assessment improves scoring accuracy, increasing Pearson correlation with expert evaluations by 0.07 on average for university-level exam scoring and machine translation evaluations, aligning LLM assessments more closely with human scoring. @@ -132,8 +132,8 @@ Free-text Rationale Generation under Readability Level Control Yi-ShengHsuUniversität Potsdam - NilsFeldhus - SherzodHakimovUniversität Potsdam + NilsFeldhus + SherzodHakimovUniversität Potsdam 129-150 Free-text rationales justify model decisions in natural language and thus become likable and accessible among approaches to explanation across many tasks. However, their effectiveness can be hindered by misinterpretation and hallucination. As a perturbation test, we investigate how large language models (LLMs) perform rationale generation under the effects of readability level control, i.e., being prompted for an explanation targeting a specific expertise level, such as sixth grade or college. We find that explanations are adaptable to such instruction, though the observed distinction between readability levels does not fully match the defined complexity scores according to traditional readability metrics. Furthermore, the generated rationales tend to feature medium level complexity, which correlates with the measured quality using automatic metrics. Finally, our human annotators confirm a generally satisfactory impression on rationales at all readability levels, with high-school-level readability being most commonly perceived and favored. 2025.gem-1.11 @@ -151,10 +151,10 @@ Can <fixed-case>LLM</fixed-case>s Detect Intrinsic Hallucinations in Paraphrasing and Machine Translation? EvangeliaGogoulouRISE Research Institutes of Sweden - ShorouqZahraUppsala University and RISE Research Institutes of Sweden AB + ShorouqZahraUppsala University and RISE Research Institutes of Sweden AB LianeGuillouAveni - LuiseDürlichUppsala University - JoakimNivreUppsala University + LuiseDürlichUppsala University + JoakimNivreUppsala University 161-177 A frequently observed problem with LLMs is their tendency to generate output that is nonsensical, illogical, or factually incorrect, often referred to broadly as “hallucination”. Building on the recently proposed HalluciGen task for hallucination detection and generation, we evaluate a suite of open-access LLMs on their ability to detect intrinsic hallucinations in two conditional generation tasks: translation and paraphrasing. We study how model performance varies across tasks and languages and we investigate the impact of model size, instruction-tuning, and prompt choice. We find that performance varies across models but is consistent across prompts. Finally, we find that NLI models perform comparably well, suggesting that LLM-based detectors are not the only viable option for this specific task. 2025.gem-1.13 @@ -162,7 +162,7 @@ Evaluating <fixed-case>LLM</fixed-case>s with Multiple Problems at once - ZhengxiangWangState University of New York at Stony Brook + ZhengxiangWangState University of New York at Stony Brook JordanKodnerState University of New York, Stony Brook OwenRambowStony Brook University 178-199 @@ -196,10 +196,10 @@ Measure only what is measurable: towards conversation requirements for evaluating task-oriented dialogue systems - EmielVan MiltenburgTilburg University + EmielVan MiltenburgTilburg University AnouckBraggaarTilburg University EmmelynCroesNA - FlorianKunnemanUtrecht University + FlorianKunnemanUtrecht University ChristineLiebrechtTilburg University GabriellaMartijnNA 231-238 @@ -220,10 +220,10 @@ Are Bias Evaluation Methods Biased ? - LinaBerrayana - SeanRooneyInternational Business Machines - LuisGarcés-EriceInternational Business Machines - IoanaGiurgiuInternational Business Machines + LinaBerrayana + SeanRooneyInternational Business Machines + LuisGarcés-EriceInternational Business Machines + IoanaGiurgiuInternational Business Machines 249-261 The creation of benchmarksto evaluate the safety of Large Language Models is one of the key activities within the trusted AI community. These benchmarks allow models to be compared for different aspects of safety such as toxicity, bias, harmful behavior etc. Independent benchmarks adopt different approacheswith distinct data sets and evaluation methods. We investigate how robust such benchmarks are by using different approachesto rank a set of representative models for bias andcompare how similar are the overall rankings. We show that different but widely used bias evaluations methods result in disparate model rankings. We conclude with recommendations for the community in the usage of such benchmarks. 2025.gem-1.22 @@ -231,9 +231,9 @@ <fixed-case>IRS</fixed-case>um: One Model to Rule Summarization and Retrieval - SotaroTakeshitaUniversität Mannheim - Simone PaoloPonzettoUniversität Mannheim - KaiEckertMannheim University of Applied Sciences + SotaroTakeshitaUniversität Mannheim + Simone PaoloPonzettoUniversität Mannheim + KaiEckertMannheim University of Applied Sciences 262-275 Applications that store a large number of documents often have summarization and retrieval functionalities to help users digest large amounts of information efficiently. Currently, such systems need to run two task-specific models, for summarization and retrieval, redundantly on the same set of documents. An efficient approach to amend this redundancy would be to reuse hidden representations produced during the summary generation for retrieval. However, our experiment shows that existing models, including recent large language models, do not produce retrieval-friendly embeddings during summarization due to a lack of a contrastive objective during their training. To this end, we introduce a simple, cost-effective training strategy which integrates a contrastive objective into standard summarization training without requiring additional annotations. We empirically show that our model can perform on par or even outperform in some cases compared to the combination of two task-specific models while improving throughput and FLOPs by up to 17% and 20%, respectively. 2025.gem-1.23 @@ -242,7 +242,7 @@ Modeling the One-to-Many Property in Open-Domain Dialogue with <fixed-case>LLM</fixed-case>s Jing YangLee - Kong AikLeeHong Kong Polytechnic University + Kong AikLeeHong Kong Polytechnic University Woon-SengGanNA 276-290 Open-domain Dialogue (OD) exhibits a one-to-many (o2m) property, whereby multiple appropriate responses exist for a single dialogue context. Despite prior research showing that modeling this property boosts response diversity, most modern LLM-based dialogue agents do not explicitly do so. In this work, we model the o2m property of OD in LLMs by decomposing OD generation into two key tasks: Multi-Response Generation (MRG) and Preference-based Selection (PS), which entail generating a set of n semantically and lexically diverse high-quality responses for a given dialogue context, followed by selecting a single response based on human preference, respectively. To facilitate MRG and PS, we introduce o2mDial, a dialogue corpus explicitly designed to capture the o2m property by featuring multiple plausible responses for each context. Leveraging o2mDial, we propose new in-context learning and instruction-tuning strategies, as well as novel evaluation metrics for MRG, alongside a model-based approach for PS. Empirical results demonstrate that applying the proposed two-stage framework to smaller LLMs for OD generation enhances overall response diversity while maintaining contextual coherence, improving response quality by up to 90%, bringing them closer to the performance of larger models. @@ -262,9 +262,9 @@ Metric assessment protocol in the context of answer fluctuation on <fixed-case>MCQ</fixed-case> tasks EkaterinaGoliakova XavierRenard - Marie-JeanneLesotSorbonne Université + Marie-JeanneLesotSorbonne Université ThibaultLaugelLIP6, Sorbonne Université/CNRS and AXA - ChristopheMarsalaLIP6 + ChristopheMarsalaLIP6 MarcinDetynieckiAXA, CNRS and LIP6 302-319 Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. Highest association on the protocol is demonstrated by a novel metric, worst accuracy. @@ -274,7 +274,7 @@ (Towards) Scalable Reliable Automated Evaluation with Large Language Models BertilBraun - MartinForell + MartinForell 320-336 Evaluating the quality and relevance of textual outputs from Large Language Models (LLMs) remains challenging and resource-intensive.Existing automated metrics often fail to capture the complexity and variability inherent in LLM-generated outputs.Moreover, these metrics typically rely on explicit reference standards, limiting their use mostly to domains with objective benchmarks.This work introduces a novel evaluation framework designed to approximate expert-level assessments of LLM-generated content.The proposed method employs pairwise comparisons of outputs by multiple LLMs, reducing biases from individual models.An Elo rating system is used to generate stable and interpretable rankings.Adjustable agreement thresholds—from full unanimity to majority voting—allow flexible control over evaluation confidence and coverage.The method’s effectiveness is demonstrated through evaluating competency profiles extracted from scientific abstracts.Preliminary results show that automatically derived rankings correlate well with expert judgments, significantly reducing the need for extensive human intervention.By offering a scalable, consistent, and domain-agnostic evaluation layer, the framework supports more efficient and reliable quality assessments of LLM outputs across diverse applications. 2025.gem-1.28 @@ -283,7 +283,7 @@ Clustering Zero-Shot Uncertainty Estimations to Assess <fixed-case>LLM</fixed-case> Response Accuracy for Yes/No <fixed-case>Q</fixed-case>&<fixed-case>A</fixed-case> Christopher T.Franck - AmyVennos + AmyVennos W. GrahamMuellerLeidos DanielDakotaLeidos and Indiana University 337-353 @@ -294,7 +294,7 @@ Using <fixed-case>LLM</fixed-case> Judgements for Sanity Checking Results and Reproducibility of Human Evaluations in <fixed-case>NLP</fixed-case> RudaliHuidrom - AnyaBelzDublin City University + AnyaBelzDublin City University 354-365 Human-like evaluation by LLMs of NLP systems is currently attracting a lot of interest, and correlations with human reference evaluations are often remarkably strong. However, this is not always the case, for unclear reasons which means that without also meta-evaluating against human evaluations (incurring the very cost automatic evaluation is intended to avoid), we don’t know if an LLM-as-judge evaluation is reliable or not. In this paper, we explore a type of evaluation scenario where this may not matter, because it comes with a built-in reliability check. We apply different LLM-as-judge methods to sets of three comparable human evaluations: (i) an original human evaluation, and (ii) two reproductions of it which produce contradicting reproducibility results. We find that in each case, the different LLM-as-judge methods (i) strongly agree with each other, and (ii) strongly agree with the results of one reproduction, while strongly disagreeing with the other. In combination, we take this to mean that a set of LLMs can be used to sanity check contradictory reproducibility results if the LLMs agree with each other, and the agreement of the LLMs with one set of results, and the disagreement with the other, are both strong. 2025.gem-1.30 @@ -306,7 +306,7 @@ SriramVenkatapathy MohitBansalUniversity of North Carolina at Chapel Hill NanyunPengUniversity of California, Los Angeles - Haw-ShiuanChangDepartment of Computer Science, University of Massachusetts at Amherst + Haw-ShiuanChangDepartment of Computer Science, University of Massachusetts at Amherst 366-384 Evaluating creative text such as human-written stories using language models has always been a challenging task – owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (Wei et al., 2023) (CoT) generates free-text explanations that help guide a model’s predictions and Self-Consistency (Wang et al., 2022) (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating ‘fluent-looking’ explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose Chain-of-Keywords (CoKe), which generates a sequence of keywords before generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less # of parameters. 2025.gem-1.31 @@ -324,7 +324,7 @@ Győző ZijianYangHungarian Research Centre for Linguistics EnikőHéjaHungarian Research Centre for Linguistics TamásVáradiNyelvtudományi Kutatóközpont - GáborPrószékyHungarian Research Centre for Linguistics, Pazmany Peter Catholic University and MorphoLogic + GáborPrószékyHungarian Research Centre for Linguistics, Pazmany Peter Catholic University and MorphoLogic 385-403 In this study, we introduce the Hungarian Generative Model Evaluation (HuGME) benchmark, a new framework designed to assess the linguistic proficiency of large language models (LLMs) in Hungarian. HuGME evaluates models across a diverse set of linguistic and reasoning skills, including bias, toxicity, faithfulness, relevance, summarization, prompt alignment, readability, spelling, grammaticality, and domain-specific knowledge through tasks like TruthfulQA and MMLU. We applied HuGME to a range of Hungarian LLMs, including those developed in-house as well as several publicly available models that claim Hungarian language proficiency. This paper presents the comparative results of these evaluations, shedding light on the capabilities of current LLMs in processing the Hungarian language. Through our analysis, we aim to both showcase the current state of Hungarian linguistic processing in LLMs and provide a foundational resource for future advancements in the field. 2025.gem-1.32 @@ -332,8 +332,8 @@ Judging the Judges: Evaluating Alignment and Vulnerabilities in <fixed-case>LLM</fixed-case>s-as-Judges - Aman SinghThakur - KartikChoudhary + Aman SinghThakur + KartikChoudhary Venkat SrinikRamayapally SankaranVaidyanathanDepartment of Computer Science, University of Massachusetts at Amherst DieuwkeHupkesFacebook @@ -346,8 +346,8 @@ Analyzing the Sensitivity of Vision Language Models in Visual Question Answering MonikaShah SudarshanBalaji - SomdebSarkhelAdobe Research - SanoritaDeyUniversity of Maryland, Baltimore County + SomdebSarkhelAdobe Research + SanoritaDeyUniversity of Maryland, Baltimore County DeepakVenugopalUniversity of Memphis 431-438 We can think of Visual Question Answering as a (multimodal) conversation between a human and an AI system. Here, we explore the sensitivity of Vision Language Models (VLMs) through the lens of cooperative principles of conversation proposed by Grice. Specifically, even when Grice’s maxims of conversation are flouted, humans typically do not have much difficulty in understanding the conversation even though it requires more cognitive effort. Here, we study if VLMs are capable of handling violations to Grice’s maxims in a manner that is similar to humans. Specifically, we add modifiers to human-crafted questions and analyze the response of VLMs to these modifiers. We use three state-of-the-art VLMs in our study, namely, GPT-4o, Claude-3.5-Sonnet and Gemini-1.5-Flash on questions from the VQA v2.0 dataset. Our initial results seem to indicate that the performance of VLMs consistently diminish with the addition of modifiers which indicates our approach as a promising direction to understand the limitations of VLMs. @@ -356,12 +356,12 @@ Investigating the Robustness of Retrieval-Augmented Generation at the Query Level - SezenPerçinTechnische Universität München - XinSuIntel + SezenPerçinTechnische Universität München + XinSuIntel Qutub ShaSyed PhillipHowardThoughtworks AlekseiKuvshinovTechnical University Munich - LeoSchwinnTechnical University of Munich + LeoSchwinnTechnical University of Munich Kay-UlrichSchollIntel 439-457 Large language models (LLMs) are very costly and inefficient to update with new information. To address this limitation, retrieval-augmented generation (RAG) has been proposed as a solution that dynamically incorporates external knowledge during inference, improving factual consistency and reducing hallucinations. Despite its promise, RAG systems face practical challenges-most notably, a strong dependence on the quality of the input query for accurate retrieval. In this paper, we investigate the sensitivity of different components in the RAG pipeline to various types of query perturbations. Our analysis reveals that the performance of commonly used retrievers can degrade significantly even under minor query variations. We study each module in isolation as well as their combined effect in an end-to-end question answering setting, using both general-domain and domain-specific datasets. Additionally, we propose an evaluation framework to systematically assess the query-level robustness of RAG pipelines and offer actionable recommendations for practitioners based on the results of more than 1092 experiments we performed. @@ -376,7 +376,7 @@ OmidGhahroodiSharif University of Technology SomayehBakhshaei ArashAmini - RezaKazemiSharif University of Technology, Sharif University of Technology + RezaKazemiSharif University of Technology, Sharif University of Technology Mahdieh SoleymaniBaghshah 458-470 This paper presents a comprehensive evaluation framework for aligning Persian Large Language Models (LLMs) with critical ethical dimensions, including safety, fairness, and social norms. It addresses the gaps in existing LLM evaluation frameworks by adapting them to Persian linguistic and cultural contexts. This benchmark creates three types of Persian-language benchmarks: (i) translated data, (ii) new data generated synthetically, and (iii) new naturally collected data. We translate Anthropic Red Teaming data, AdvBench, HarmBench, and DecodingTrust into Persian. Furthermore, we create ProhibiBench-fa, SafeBench-fa, FairBench-fa, and SocialBench-fa as new datasets to address harmful and prohibited content in indigenous culture. Moreover, we collect extensive dataset as GuardBench-fa to consider Persian cultural norms. By combining these datasets, our work establishes a unified framework for evaluating Persian LLMs, offering a new approach to culturally grounded alignment evaluation. A systematic evaluation of Persian LLMs is performed across the three alignment aspects: safety (avoiding harmful content), fairness (mitigating biases), and social norms (adhering to culturally accepted behaviors). We present a publicly available leaderboard that benchmarks Persian LLMs with respect to safety, fairness, and social norms. @@ -392,8 +392,8 @@ BurakAytanNA BusraTufanNA AbdullahTopraksoyNA - EsraDarıcıMiddle East Technical University - CagriToramanMETU, Middle East Technical University + EsraDarıcıMiddle East Technical University + CagriToramanMETU, Middle East Technical University 471-487 The reliance on translated or adapted datasets from English or multilingual resources introduces challenges regarding linguistic and cultural suitability. This study addresses the need for robust and culturally appropriate benchmarks by evaluating the quality of 17 commonly used Turkish benchmark datasets. Using a comprehensive framework that assesses six criteria, both human and LLM-judge annotators provide detailed evaluations to identify dataset strengths and shortcomings.Our results reveal that 70% of the benchmark datasets fail to meet our heuristic quality standards. The correctness of the usage of technical terms is the strongest criterion, but 85% of the criteria are not satisfied in the examined datasets. Although LLM judges demonstrate potential, they are less effective than human annotators, particularly in understanding cultural common sense knowledge and interpreting fluent, unambiguous text. GPT-4o has stronger labeling capabilities for grammatical and technical tasks, while Llama3.3-70B excels at correctness and cultural knowledge evaluation. Our findings emphasize the urgent need for more rigorous quality control in creating and adapting datasets for low-resource languages. 2025.gem-1.41 @@ -401,8 +401,8 @@ Big Escape Benchmark: Evaluating Human-Like Reasoning in Language Models via Real-World Escape Room Challenges - ZinanTangShanghai Artificial Intelligence Laboratory - QiYaoSun + ZinanTangShanghai Artificial Intelligence Laboratory + QiYaoSun 488-503 Large Language Models (LLMs) have recently demonstrated remarkable reasoning capabilities across a wide range of tasks. While many benchmarks have been developed on specific academic subjects, coding, or constrained visual tasks, they often fail to fully capture the breadth, diversity, and dynamic nature of real-world human reasoning. Further, the creation of high-quality, complex multimodal reasoning benchmarks typically requires significant manual effort and expert annotation, which is costly and time-consuming.To address these limitations, we introduce Big Escape Bench, a novel multimodal reasoning benchmark derived from popular reality shows and television programs. Big Escape Bench leverages unique characteristics of TV content, providing a rich source of challenging and realistic multimodal reasoning problems. Key advantages include: questions guaranteed to be human-solvable and of moderate difficulty; problems reflecting diverse, real-world scenarios and knowledge domains; high inherent quality due to content generated by professional program teams.Notably, we develop an automated pipeline to construct the data from these programs into a standardized benchmark format, significantly reducing the manual effort compared to traditional dataset construction. We have conducted extensive experiments to evaluate state-of-the-art (SOTA) LLMs and Multimodal Large Language Models (MLLMs) on Big Escape Bench. Our results reveal a surprising performance gap: while the questions are easily solved by human viewers (about 60% in accuracy), the performance of even the most advanced models (best 40.50% in accuracy) is significantly lower than human-level accuracy. This highlights that despite recent progress, MLLMs still face substantial challenges in robustly performing the kind of diverse, dynamic, and context-dependent reasoning that is trivial for humans in routine situations. Big Escape Bench serves as a valuable tool for identifying current limitations of MLLMs and fostering future research towards more human-like multimodal reasoning. 2025.gem-1.42 @@ -410,7 +410,7 @@ Event-based evaluation of abstractive news summarization - HuilingYou + HuilingYou SamiaTouilebUniversity of Bergen LiljaØvrelidDept. of Informatics, University of Oslo ErikVelldalUniversity of Oslo @@ -432,9 +432,9 @@ <fixed-case>P</fixed-case>apers<fixed-case>P</fixed-case>lease: A Benchmark for Evaluating Motivational Values of Large Language Models Based on <fixed-case>ERG</fixed-case> Theory JunhoMyungKorea Advanced Institute of Science and Technology - Yeon SuParkKorea Advanced Institute of Science & Technology + Yeon SuParkKorea Advanced Institute of Science & Technology SunwooKimKorea Advanced Institute of Science & Technology - ShinYoo + ShinYoo AliceOhGoogle and Korea Advanced Institute of Science and Technology 522-531 Evaluating the performance and biases of large language models (LLMs) through role-playing scenarios is becoming increasingly common, as LLMs often exhibit biased behaviors in these contexts. Building on this line of research, we introduce PapersPlease, a benchmark consisting of 3,700 moral dilemmas designed to investigate LLMs’ decision-making in prioritizing various levels of human needs. In our setup, LLMs act as immigration inspectors deciding whether to approve or deny entry based on the short narratives of people. These narratives are constructed using the Existence, Relatedness, and Growth (ERG) theory, which categorizes human needs into three hierarchical levels. Our analysis of six LLMs reveals statistically significant patterns in decision-making, suggesting that LLMs encode implicit preferences. Additionally, our evaluation of the impact of incorporating social identities into the narratives shows varying responsiveness based on both motivational needs and identity cues, with some models exhibiting higher denial rates for marginalized identities. All data is publicly available at https://github.com/yeonsuuuu28/papers-please. @@ -444,7 +444,7 @@ Shallow Preference Signals: Large Language Model Aligns <fixed-case>E</fixed-case>ven Better with Truncated Data? XuanQiTsinghua University, Tsinghua University - JiahaoQiuPrinceton University + JiahaoQiuPrinceton University XinzheJuan YueWuPrinceton University MengdiWangPrinceton University @@ -455,8 +455,8 @@ Improving Large Language Model Confidence Estimates using Extractive Rationales for Classification - Jane Arlethdela Cruz - IrisHendrickx + Jane Arlethdela Cruz + IrisHendrickx MarthaLarson 549-560 The adoption of large language models (LLMs) in high-stake scenarios continues to be a challenge due to lack of effective confidence calibration. Although LLMs are capable of providing convincing self-explanations and verbalizing confidence in NLP tasks, they tend to exhibit overconfidence when using generative or free-text rationales (e.g. Chain-of-Thought), where reasoning steps tend to lack verifiable grounding.In this paper, we investigate whether adding explanations in the form of extractive rationales –snippets of the input text that directly support the predictions, can improve the confidence calibration of LLMs in classification tasks.We examine two approaches for integrating these rationales: (1) a one-stage rationale-generation with prediction and (2) a two-stage rationale-guided confidence calibration.We evaluate these approaches on a disaster tweet classification task using four different off-the-shelf LLMs. Our results show that extracting rationales both before and after prediction can improve the confidence estimates of the LLMs. Furthermore, we find that replacing valid extractive rationales with irrelevant ones significantly lowers model confidence, highlighting the importance of rationale quality.This simple yet effective method improves LLM verbalized confidence and reduces overconfidence in possible hallucination. @@ -465,7 +465,7 @@ <fixed-case>R</fixed-case>epro<fixed-case>H</fixed-case>um #0729-04: Human Evaluation Reproduction Report for “<fixed-case>M</fixed-case>em<fixed-case>S</fixed-case>um: Extractive Summarization of Long Documents Using Multi-Step Episodic <fixed-case>M</fixed-case>arkov Decision Processes” - SimeonJunkerUniversität Bielefeld + SimeonJunkerUniversität Bielefeld 561-567 Human evaluation is indispensable in natural language processing (NLP), as automatic metrics are known to not always align well with human judgments.However, the reproducibility of human evaluations can be problematic since results are susceptible to many factors, the details of which are often missing from the respective works.As part of the ReproHum project, this work aims to reproduce the human evaluation of a single criterion in the paper “MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes” (Gu et al, 2022).The results of our reproduction differ noticeably from those of the original study. To explain this discrepancy, we discuss differences in the experimental setup, as well as more general characteristics of the selected domain and the generated summaries. 2025.gem-1.50 @@ -482,7 +482,7 @@ <fixed-case>R</fixed-case>epro<fixed-case>H</fixed-case>um #0031-01: Reproducing the Human Evaluation of Readability from “It is <fixed-case>AI</fixed-case>’s Turn to Ask Humans a Question” - DanielBraunUnversity of Marburg + DanielBraunUnversity of Marburg 576-582 The reproducibility of results is the foundation on which scientific credibility is built. In Natural Language Processing (NLP) research, human evaluation is often seen as the gold standard of evaluation. This paper presents the reproduction of a human evaluation of a Natural Language Generation (NLG) system that generates pairs of questions and answers based on children’s stories that was originally conducted by Yao et al. (2022). Specifically, it replicates the evaluation of readability, one of the most commonly evaluated criteria for NLG systems. The results of the reproduction are aligned with the original findings and all major claims of the original paper are confirmed. 2025.gem-1.52 @@ -490,7 +490,7 @@ <fixed-case>R</fixed-case>epro<fixed-case>H</fixed-case>um #0033-05: Human Evaluation of Factuality from A Multidisciplinary Perspective - Andra-MariaFlorescuUniversity of Bucharest + Andra-MariaFlorescuUniversity of Bucharest MariusMicluța-CâmpeanuUniversity of Bucharest Stefana ArinaTabuscaUniversity of Bucharest Liviu PDinuUniversity of Bucharest @@ -501,8 +501,8 @@ <fixed-case>R</fixed-case>epro<fixed-case>H</fixed-case>um: #0744-02: Investigating the Reproducibility of Semantic Preservation Human Evaluations - MohammadArvan - NataliePardeUniversity of Illinois at Chicago + MohammadArvan + NataliePardeUniversity of Illinois at Chicago 590-600 Reproducibility remains a fundamental challenge for human evaluation in Natural Language Processing (NLP), particularly due to the inherent subjectivity and variability of human judgments. This paper presents a reproduction study of the human evaluation protocol introduced by Hosking and Lapata (2021), which assesses semantic preservation in paraphrase generation models. By faithfully reproducing the original experiment with careful adaptation and applying the Quantified Reproducibility Assessment framework (Belz and Thomson, 2024a; Belz, 2022), we demonstrate strong agreement with the original findings, confirming the semantic preservation ranking among four paraphrase models. Our analyses reveal moderate inter-annotator agreement and low variability in key results, underscoring a good degree of reproducibility despite practical deviations in participant recruitment and platform. These findings highlight the feasibility and challenges of reproducing human evaluation studies in NLP. We discuss implications for improving methodological rigor, transparent reporting, and standardized protocols to bolster reproducibility in future human evaluations. The data and analysis scripts are publicly available to support ongoing community efforts toward reproducible evaluation in NLP and beyond. 2025.gem-1.54 @@ -510,10 +510,10 @@ <fixed-case>R</fixed-case>epro<fixed-case>H</fixed-case>um #0669-08: Reproducing Sentiment Transfer Evaluation - KristýnaOnderková, Charles University Prague - MateuszLangoCharles University and Poznan University of Technology + KristýnaOnderková, Charles University Prague + MateuszLangoCharles University and Poznan University of Technology PatríciaSchmidtová - OndrejDusekCharles University, Prague + OndrejDusekCharles University, Prague 601-608 We describe a reproduction of a human annotation experiment that was performed to evaluate the effectiveness of text style transfer systems (Reif et al., 2021). Despite our efforts to closely imitate the conditions of the original study, the results obtained differ significantly from those in the original study. We performed a statistical analysis of the results obtained, discussed the sources of these discrepancies in the study design, and quantified reproducibility. The reproduction followed the common approach to reproduction adopted by the ReproHum project. 2025.gem-1.55 @@ -521,9 +521,9 @@ <fixed-case>R</fixed-case>epro<fixed-case>H</fixed-case>um #0067-01: A Reproduction of the Evaluation of Cross-Lingual Summarization - SupryadiTianjin University - ChuangLiuNational Supercomputing Center in Tianjin - DeyiXiongTianjin University + SupryadiTianjin University + ChuangLiuNational Supercomputing Center in Tianjin + DeyiXiongTianjin University 609-614 Human evaluation is crucial as it offers a nuanced understanding that automated metrics often miss. By reproducing human evaluation, we can gain a better understanding of the original results. This paper is part of the ReproHum project, where our goal is to reproduce human evaluations from previous studies. We report the reproduction results of the human evaluation of cross-lingual summarization conducted by (CITATION). By comparing the original and reproduction studies, we find that our overall evaluation findings are largely consistent with those of the previous study. However, there are notable differences in evaluation scores between the two studies for certain model outputs. These discrepancies highlight the importance of carefully selecting evaluation methodologies and human annotators. 2025.gem-1.56 @@ -531,8 +531,8 @@ <fixed-case>R</fixed-case>epro<fixed-case>H</fixed-case>um #0729-04: Partial reproduction of the human evaluation of the <fixed-case>M</fixed-case>em<fixed-case>S</fixed-case>um and <fixed-case>N</fixed-case>eu<fixed-case>S</fixed-case>um summarisation systems - SimonMille - MichelaLorandiDublin City University + SimonMille + MichelaLorandiDublin City University 615-621 In this paper, we present our reproduction of part of the human evaluation originally carried out by Gu et al. (2022), as part of Track B of ReproNLP 2025. Four human annotators were asked to rank two candidate summaries according to their overall quality, given a reference summary shown alongside the two candidate summaries at evaluation time. We describe the original experiment and provide details about the steps we followed to carry out the reproduction experiment, including the implementation of some missing pieces of code. Our results, in particular the high coefficients of variation and low inter-annotator agreement, suggest a low level of reproducibility in the original experiment despite identical pairwise ranks. However, given the very small sample size (two systems, one rating), we remain cautious about drawing definitive conclusions. 2025.gem-1.57 @@ -541,7 +541,7 @@ Curse of bilinguality: Evaluating monolingual and bilingual language models on <fixed-case>C</fixed-case>hinese linguistic benchmarks YuwenZhouUniversity of Groningen - YevgenMatusevychUniversity of Groningen + YevgenMatusevychUniversity of Groningen 622-630 We investigate cross-lingual transfer effects in large language models (LLMs) trained on two high-resource languages, English and Chinese. Four monolingual Chinese and four bilingual English–Chinese models are evaluated on two Chinese linguistic benchmarks. The monolingual models consistently outperform the bilingual ones on 12 out of 55 tasks, while the reverse is true for only 4 tasks, highlighting the prevalence of negative (rather than positive) transfer from English to Chinese. Additionally, we carry out a feature attribution analysis in a monolingual and a bilingual model, showing that the differences in their performance may be explained by more predictable attribution patterns in the monolingual model. Our findings have implications for the ongoing effort of training bilingual LLMs. 2025.gem-1.58 @@ -554,7 +554,7 @@ JulianRodemann MeimingweiLi ChristianHeumannLudwig-Maximilians-Universität München - MatthiasAßenmacherLudwig-Maximilians-Universität München + MatthiasAßenmacherLudwig-Maximilians-Universität München 631-654 Open-ended text generation has become a prominent task in natural language processing due to the rise of powerful (large) language models. However, evaluating the quality of these models and the employed decoding strategies remains challenging due to trade-offs among widely used metrics such as coherence, diversity, and perplexity. This paper addresses the specific problem of multicriteria evaluation for open-ended text generation, proposing novel methods for both relative and absolute rankings of decoding methods. Specifically, we employ benchmarking approaches based on partial orderings and present a new summary metric to balance existing automatic indicators, providing a more holistic evaluation of text generation quality. Our experiments demonstrate that the proposed approaches offer a robust way to compare decoding strategies and serve as valuable tools to guide model selection for open-ended text generation tasks. We suggest future directions for improving evaluation methodologies in text generation and make our code, datasets, and models publicly available. 2025.gem-1.59 @@ -562,10 +562,10 @@ Bridging the <fixed-case>LLM</fixed-case> Accessibility Divide? Performance, Fairness, and Cost of Closed versus Open <fixed-case>LLM</fixed-case>s for Automated Essay Scoring - KeziaOketch + KeziaOketch John P.LalorUniversity of Notre Dame - YiYangHong Kong University of Science and Technology - AhmedAbbasiUniversity of Notre Dame + YiYangHong Kong University of Science and Technology + AhmedAbbasiUniversity of Notre Dame 655-669 Closed large language models (LLMs) such as GPT-4 have set state-of-the-art results across a number of NLP tasks and have become central to NLP and machine learning (ML)-driven solutions. Closed LLMs’ performance and wide adoption has sparked considerable debate about their accessibility in terms of availability, cost, and transparency. In this study, we perform a rigorous comparative analysis of eleven leading LLMs, spanning closed, open, and open-source LLM ecosystems, across text assessment and generation within automated essay scoring, as well as a separate evaluation on abstractive text summarization to examine generalization. Our findings reveal that for few-shot learning-based assessment of human generated essays, open LLMs such as Llama 3 and Qwen 2.5 perform comparably to GPT-4 in terms of predictive performance, with no significant differences in disparate impact scores when considering age- or race-related fairness. For summarization, we find that open models also match GPT-4 in ROUGE and METEOR scores on the CNN/DailyMail benchmark, both in zero- and few-shot settings. Moreover, Llama 3 offers a substantial cost advantage, being up to 37 times more cost-efficient than GPT-4. For generative tasks, we find that essays generated by top open LLMs are comparable to closed LLMs in terms of their semantic composition/embeddings and ML assessed scores. Our findings challenge the dominance of closed LLMs and highlight the democratizing potential of open LLMs, suggesting they can effectively bridge accessibility divides while maintaining competitive performance and fairness. 2025.gem-1.60 @@ -590,9 +590,9 @@ Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models - SherzodHakimovUniversität Potsdam + SherzodHakimovUniversität Potsdam LaraPfennigschmidtNA - DavidSchlangenUniversity of Potsdam + DavidSchlangenUniversity of Potsdam 728-740 This study utilizes the game Codenames as a benchmarking tool to evaluate large language models (LLMs) with respect to specific linguistic and cognitive skills. LLMs play each side of the game, where one side generates a clue word covering several target words and the other guesses those target words. We designed various experiments by controlling the choice of words (abstract vs. concrete words, ambiguous vs. monosemic) or the opponent (programmed to be faster or slower in revealing words). Recent commercial and open-weight models were compared side-by-side to find out factors affecting their performance. The evaluation reveals details about their strategies, challenging cases, and limitations of LLMs. 2025.gem-1.63 @@ -602,7 +602,7 @@ Evaluating Intermediate Reasoning of Code-Assisted Large Language Models for Mathematics Zena AlKhaliliUniversität des Saarlandes NickHowellUniversität des Saarlandes - DietrichKlakow + DietrichKlakow 741-758 Assisting LLMs with code generation improved their performanceon mathematical reasoning tasks.However, the evaluation of code-assisted LLMs is generally restricted to execution correctness, lacking a rigorous evaluation of their generated programs.In this work, we bridge this gap by conducting an in-depth analysis of code-assisted LLMs generated programs in response to math reasoning tasks, with a focus on evaluating the soundness of the underlying reasoning processes. For this purpose, we assess the generations of five LLMs, on several math datasets, both manually and automatically, and propose a taxonomy of generated programs based on their logical soundness.Our findings show that the capabilities of models significantly impact the logic implemented to solve the problem. Closed-source LLMs ground their programs in mathematical concepts, whereas open-source models often resort to unsound reasoning, relying on memorized information and exhaustive searches. Furthermore, increasing the difficulty of problems decreases sound generations for all models, revealing a critical shortcoming of LLMs on complex mathematics, contrary to what accuracy metrics suggest.Our work highlights the need for more holistic evaluations of code-assisted LLMs beyond execution accuracy metrics, toward a better understanding of LLMs’ limits in the math domain. 2025.gem-1.64 @@ -612,8 +612,8 @@ From Calculation to Adjudication: Examining <fixed-case>LLM</fixed-case> Judges on Mathematical Reasoning Tasks AndreasStephan DaweiZhuAmazon - MatthiasAßenmacherLudwig-Maximilians-Universität München - XiaoyuShenAmazon + MatthiasAßenmacherLudwig-Maximilians-Universität München + XiaoyuShenAmazon BenjaminRothUniversität Vienna 759-773 To reduce the need for human annotations, large language models (LLMs) have been proposed as judges of the quality of other candidate models. The performance of LLM judges is typically evaluated by measuring the correlation with human judgments on generative tasks such as summarization or machine translation. In contrast, we study LLM judges on mathematical reasoning tasks. These tasks require multi-step reasoning, and the correctness of their solutions is verifiable, enabling a more objective evaluation. We perform a detailed performance analysis and find that easy samples are easy to judge, and difficult samples are difficult to judge. Our analysis uncovers a strong correlation between judgment performance and the candidate model task performance, indicating that judges tend to favor higher-quality models even if their answer is incorrect. As a consequence, we test whether we can predict the behavior of LLM judges using simple features such as part-of-speech tags and find that we can correctly predict 70%-75% of judgments. We conclude this study by analyzing practical use cases, showing that LLM judges consistently detect the on-average better model but largely fail if we use them to improve task performance. @@ -624,8 +624,8 @@ <fixed-case>P</fixed-case>ersona<fixed-case>T</fixed-case>win: A Multi-Tier Prompt Conditioning Framework for Generating and Evaluating Personalized Digital Twins SihanChenCMU, Carnegie Mellon University John P.LalorUniversity of Notre Dame - YiYangHong Kong University of Science and Technology - AhmedAbbasiUniversity of Notre Dame + YiYangHong Kong University of Science and Technology + AhmedAbbasiUniversity of Notre Dame 774-788 While large language models (LLMs) afford new possibilities for user modeling and approximation of human behaviors, they often fail to capture the multidimensional nuances of individual users. In this work, we introduce \texttt{PersonaTwin}, a multi-tier prompt conditioning framework that builds adaptive digital twins by integrating demographic, behavioral, and psychometric data. Using a comprehensive data set in the healthcare context of more than 8,500 individuals, we systematically benchmark \texttt{PersonaTwin} against standard LLM outputs, and our rigorous evaluation unites state-of-the-art text similarity metrics with dedicated demographic parity assessments, ensuring that generated responses remain accurate and unbiased. Experimental results show that our framework produces simulation fidelity on par with oracle settings. Moreover, downstream models trained on persona-twins approximate models trained on individuals in terms of prediction and fairness metrics across both GPT-4o-based and Llama-based models. Together, these findings underscore the potential for LLM digital twin-based approaches in producing realistic and emotionally nuanced user simulations, offering a powerful tool for personalized digital user modeling and behavior analysis. 2025.gem-1.66 @@ -633,10 +633,10 @@ Coreference as an indicator of context scope in multimodal narrative - NikolaiIlinykhGöteborg University - ShalomLappin + NikolaiIlinykhGöteborg University + ShalomLappin Asad B.SayeedUniversity of Gothenburg - SharidLoáicigaUniversity of Gothenburg, Sweden + SharidLoáicigaUniversity of Gothenburg, Sweden 789-807 We demonstrate that large multimodal language models differ substantially from humans in the distribution of coreferential expressions in a visual storytelling task. We introduce a number of metrics to quantify the characteristics of coreferential patterns in both human- and machine-written texts. Humans distribute coreferential expressions in a way that maintains consistency across texts and images, interleaving references to different entities in a highly varied way. Machines are less able to track mixed references, despite achieving perceived improvements in generation quality. Materials, metrics, and code for our study are available at https://github.com/GU-CLASP/coreference-context-scope. 2025.gem-1.67 @@ -644,8 +644,8 @@ <fixed-case>PATCH</fixed-case>! <fixed-case>P</fixed-case>sychometrics-<fixed-case>A</fixed-case>ssis<fixed-case>T</fixed-case>ed <fixed-case>B</fixed-case>en<fixed-case>CH</fixed-case>marking of Large Language Models against Human Populations: A Case Study of Proficiency in 8th Grade Mathematics - QixiangFang - DanielOberskiUtrecht University + QixiangFang + DanielOberskiUtrecht University DongNguyenUtrecht University 808-823 Many existing benchmarks of large (multimodal) language models (LLMs) focus on measuring LLMs’ academic proficiency, often with also an interest in comparing model performance with human test takers’. While such benchmarks have proven key to the development of LLMs, they suffer from several limitations, including questionable measurement quality (e.g., Do they measure what they are supposed to in a reliable way?), lack of quality assessment on the item level (e.g., Are some items more important or difficult than others?) and unclear human population reference (e.g., To whom can the model be compared?). In response to these challenges, we propose leveraging knowledge from psychometrics—a field dedicated to the measurement of latent variables like academic proficiency—into LLM benchmarking. We make four primary contributions. First, we reflect on current LLM benchmark developments and contrast them with psychometrics-based test development. Second, we introduce PATCH: a novel framework for Psychometrics-AssisTed benCHmarking of LLMs. PATCH addresses the aforementioned limitations. In particular, PATCH enables valid comparison between LLMs and human populations.Third, we demonstrate PATCH by measuring several LLMs’ proficiency in 8th grade mathematics against 56 human populations. We show that adopting a psychometrics-based approach yields evaluation outcomes that diverge from those based on current benchmarking practices. Fourth, we release 4 high-quality datasets to support measuring and comparing LLM proficiency in grade school mathematics and science with human populations. @@ -655,7 +655,7 @@ <fixed-case>MCQF</fixed-case>ormat<fixed-case>B</fixed-case>ench: Robustness Tests for Multiple-Choice Questions HirooTakizawaGraduate University for Advanced Studies - SakuSugawaraNational Institute of Informatics + SakuSugawaraNational Institute of Informatics AkikoAizawaNational Institute of Informatics 824-846 Multiple-choice questions (MCQs) are often used to evaluate large language models (LLMs). They measure LLMs’ general common sense and reasoning abilities, as well as their knowledge in specific domains such as law and medicine. However, the robustness of LLMs to various question formats in MCQs has not been thoroughly evaluated. While there are studies on the sensitivity of LLMs to input variations, research into their responsiveness to different question formats is still limited. In this study, we propose a method to construct tasks to comprehensively evaluate the robustness against format changes of MCQs by decomposing the answering process into several steps. Using this dataset, we evaluate nine LLMs, such as Llama3-70B and Mixtral-8x7B. We find the lack of robustness to differences in the format of MCQs. It is crucial to consider whether the format of MCQs influences their evaluation scores when assessing LLMs using MCQ datasets. @@ -666,10 +666,10 @@ (Dis)improved?! How Simplified Language Affects Large Language Model Performance across Languages MiriamAnschützTechnische Universität München AnastasiyaDamaratskaya - Chaeeun JoyLee - ArthurSchmalz + Chaeeun JoyLee + ArthurSchmalz EdoardoMosca - GeorgGrohTechnical University Munich + GeorgGrohTechnical University Munich 847-861 Simplified language enhances the accessibility and human understanding of texts. However, whether it also benefits large language models (LLMs) remains underexplored. This paper extensively studies whether LLM performance improves on simplified data compared to its original counterpart. Our experiments span six datasets and nine automatic simplification systems across three languages. We show that English models, including GPT-4o-mini, show a weak generalization and exhibit a significant performance drop on simplified data. This introduces an intriguing paradox: simplified data is helpful for humans but not for LLMs. At the same time, the performance in non-English languages sometimes improves, depending on the task and quality of the simplifier. Our findings offer a comprehensive view of the impact of simplified language on LLM performance and uncover severe implications for people depending on simple language. 2025.gem-1.70 @@ -688,10 +688,10 @@ Finance Language Model Evaluation (<fixed-case>FL</fixed-case>a<fixed-case>ME</fixed-case>) GlennMatlin - MikaOkamotoGeorgia Tech Research Institute and Georgia Institute of Technology + MikaOkamotoGeorgia Tech Research Institute and Georgia Institute of Technology HuzaifaPardawalaGeorgia Institute of Technology YangYang - SudheerChavaGeorgia Institute of Technology + SudheerChavaGeorgia Institute of Technology 880-926 Language Models (LMs) have demonstrated impressive capabilities with core Natural Language Processing (NLP) tasks. The effectiveness of LMs for highly specialized knowledge-intensive tasks in finance remains difficult to assess due to major gaps in the methodologies of existing evaluation frameworks, which have caused an erroneous belief in a far lower bound of LMs’ performance on common Finance NLP (FinNLP) tasks. To demonstrate the potential of LMs for these FinNLP tasks, we present the first holistic benchmarking suite for Financial Language Model Evaluation (FLaME). We are the first research paper to comprehensively study LMs against ‘reasoning-reinforced’ LMs, with an empirical study of 23 foundation LMs over 20 core NLP tasks in finance. We open-source our framework software along with all data and results. 2025.gem-1.72 @@ -700,7 +700,7 @@ s<fixed-case>P</fixed-case>hin<fixed-case>X</fixed-case>: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting SanchitAhujaResearch, Microsoft - KumarTanmay + KumarTanmay Hardik HansrajbhaiChauhanMicrosoft BarunPatraMicrosoft KritiAggarwalHippocraticAI @@ -718,10 +718,10 @@ Single- vs. Dual-Prompt Dialogue Generation with <fixed-case>LLM</fixed-case>s for Job Interviews in Human Resources - JoachimDe BaerUniversiteit Gent + JoachimDe BaerUniversiteit Gent A. SezaDoğruözGhent University - ThomasDemeesterGhent University - imec - ChrisDevelderUniversiteit Gent + ThomasDemeesterGhent University - imec + ChrisDevelderUniversiteit Gent 947-957 Optimizing language models for use in conversational agents requires large quantities of example dialogues. Increasingly, these dialogues are synthetically generated by using powerful large language models (LLMs), especially in domains where obtaining authentic human data is challenging. One such domain is human resources (HR). In this context, we compare two LLM-based dialogue generation methods for producing HR job interviews, and assess which method generates higher-quality dialogues, i.e., those more difficult to distinguish from genuine human discourse. The first method uses a single prompt to generate the complete interview dialog. The second method uses two agents that converse with each other. To evaluate dialogue quality under each method, we ask a judge LLM to determine whether AI was used for interview generation, using pairwise interview comparisons. We empirically find that, at the expense of a sixfold increase in token count, interviews generated with the dual-prompt method achieve a win rate 2 to 10 times higher than those generated with the single-prompt method. This difference remains consistent regardless of whether GPT-4o or Llama 3.3 70B is used for either interview generation or quality judging. 2025.gem-1.74 @@ -730,8 +730,8 @@ Natural Language Counterfactual Explanations in Financial Text Classification: A Comparison of Generators and Evaluation Metrics KarolDobiczek - PatrickAltmeyer - Cynthia C. S.LiemDelft University of Technology + PatrickAltmeyer + Cynthia C. S.LiemDelft University of Technology 958-972 The use of large language model (LLM) classifiers in finance and other high-stakes domains calls for a high level of trustworthiness and explainability. We focus on counterfactual explanations (CE), a form of explainable AI that explains a model’s output by proposing an alternative to the original input that changes the classification. We use three types of CE generators for LLM classifiers and assess the quality of their explanations on a recent dataset consisting of central bank communications. We compare the generators using a selection of quantitative and qualitative metrics. Our findings suggest that non-expert and expert evaluators prefer CE methods that apply minimal changes; however, the methods we analyze might not handle the domain-specific vocabulary well enough to generate plausible explanations. We discuss shortcomings in the choice of evaluation metrics in the literature on text CE generators and propose refined definitions of the fluency and plausibility qualitative metrics. 2025.gem-1.75 @@ -748,12 +748,12 @@ <fixed-case>U</fixed-case>-<fixed-case>MATH</fixed-case>: A University-Level Benchmark for Evaluating Mathematical Skills in Large Language Models - KonstantinChernyshev + KonstantinChernyshev VitaliyPolshkovToloka AI VladStepanovGradarius (Castle Point Learning Systems) AlexMyasnikov EkaterinaArtemovaToloka AI - AlexeiMiasnikovStevens Institute of Technology + AlexeiMiasnikovStevens Institute of Technology SergeiTilgaToloka AI 974-1001 Current evaluations of mathematical skills in Large Language Models are constrained by benchmarks lacking scope, particularly for multi-modal problems — frequently relying on school-level, niche Olympiad-style, simple quiz-format, or relatively small datasets.To address this, we introduce **U-MATH**, a novel benchmark comprising **1,100** unpublished open-ended university-level problems sourced from current US curricula, with **20%** incorporating visual elements. Given the free-form nature of U-MATH problems, we employ LLM judges for solution evaluation and release \boldsymbol{\mu}**-MATH**, a meta-evaluation benchmark composed of **1,084** U-MATH-derived tasks enabling precise assessment of these judges.Benchmarking leading LLMs reveals marked limitations in multi-modal reasoning, with maximum accuracy reaching 93.1% on textual tasks but only 58.5% on visual ones. Furthermore, solution judgment proves challenging, requiring the most advanced models to achieve meaningfully high performance, even still peaking at an imperfect F1-score of 90.1%. @@ -762,9 +762,9 @@ The 2025 <fixed-case>R</fixed-case>epro<fixed-case>NLP</fixed-case> Shared Task on Reproducibility of Evaluations in <fixed-case>NLP</fixed-case>: Overview and Results - AnyaBelzDublin City University - CraigThomsonDublin City University and University of Aberdeen - JavierGonzález CorbelleUniversidad de Santiago de Compostela + AnyaBelzDublin City University + CraigThomsonDublin City University and University of Aberdeen + JavierGonzález CorbelleUniversidad de Santiago de Compostela MaloRuelle 1002-1016 This paper presents an overview of, and the results from, the 2025 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’25) which followed on from four previous shared tasks on reproducibility of evaluations, ReproNLP’24, ReproNLP’23, ReproGen’22 and ReproGen’21. This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of the topic across the two fields. We describe the ReproNLP’25 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results, including for the first time additional, ‘sanity-check’ evaluations by LLMs. From 8c3bed7530fc596c847a161da1964437f0c637c6 Mon Sep 17 00:00:00 2001 From: Matt Post Date: Wed, 1 Oct 2025 11:51:04 -0400 Subject: [PATCH 7/7] Add small note to match_names --- bin/ingest_orcids.py | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/bin/ingest_orcids.py b/bin/ingest_orcids.py index ab4e38b9a7..f8900e93f7 100755 --- a/bin/ingest_orcids.py +++ b/bin/ingest_orcids.py @@ -159,7 +159,20 @@ def get_author_name_xml(author_xml): continue def match_names(yaml_name_tuple, xml_name_tuple): - """Match a YAML name tuple to the XML name tuple""" + """Match a YAML name tuple to the XML name tuple. + + Basic sanity check on name matching: we ensure that the YAML last name + ends the XML string that concatenates names in both directions. + + e.g., + + YAML: "Post" + XML: ("Matt Post", "Post Matt") + match: True + + We do both directions because of issues with Chinese names which have inconsistent + conventions. + """ xml_name_forward = f"{xml_name_tuple[0]} {xml_name_tuple[1]}" xml_name_reverse = f"{xml_name_tuple[1]} {xml_name_tuple[0]}"