2025
- (19 September 2025)
Paper accepted by EMNLP 2025 findings – HARE- an entity and relation centric evaluation framework for histopathology reports – at https://arxiv.org/abs/2507.09097
Large Vision-Language Models (LVLMs) have demonstrated promising performance in chest X-ray (CXR) analysis. To enhance human-computer interaction, several studies have incorporated radiologists’ eye gaze, typically through heatmaps or textual prompts. However, these methods often overlook the sequential order of eye movements, which could provide valuable insights by highlighting both the areas of interest and the order in which they are examined. In this work, we propose a novel approach called RadEyeVideo that integrates radiologists’ eye-fixation data as a video sequence, capturing both the temporal and spatial dynamics of their gaze. We evaluate this method in CXR report generation and disease diagnosis using three general-domain, open-source LVLMs with video input capabilities. When prompted with eye-gaze videos, model performance improves by up to 24.6% in the report generation task and on average 15.2% for both tasks using scaled evaluation metrics. Notably, RadEyeVideo enhanced an open-domain LVLM model, LLaVA-OneVision, to surpass task-specific medical LVLMs such as MAIRA-2 and CheXagent, trained on large Chest X-ray data. This work highlights that domain expert’s knowledge (eye-gaze information in this case), when effectively integrated with LVLMs, can significantly enhance general-domain models’ capabilities in clinical tasks. RadEyeVideo is a step toward a scalable human-centered approach of utilizing LVLMs in medical image analytics.
Read the full paper at https://arxiv.org/abs/2507.09097.
- (8 August 2025)
Tokenizer work published - Infusing clinical knowledge into language models by subword optimisation and embedding initialisation - now with Computers in Biology and Medicine.
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at The study proposes a novel tokenisation method utilising global representations of tokens based on domain-specific concepts (e.g., drugs, diseases) from ontologies like UMLS or task-specific corpora. At training or inference, word and sentence-level optimisation is used to select the optimal token representations. It proposes an embedding initialisation approach for new tokens, eliminating the need for pre-training the language models. The Model built using K-Tokeniser achieves a notable 13% increase in Micro F1 score for automated clinical coding. It requires only 50% of training data for concept extraction and less than 20% for automated coding to outperform the baseline clinicalBERT model.
Read it at 10.1016/j.compbiomed.2025.110747.
- (28 May 2025)
Paper accepted by ACL 2025 findings – Look & Mark - Leveraging Radiologist Eye Fixations and Bounding boxes in Multimodal Large Language Models for Chest X-ray Report Generation – at https://arxiv.org/abs/2505.22222
This study introduces Look & Mark (L&M), a novel approach to radiology report generation that integrates radiologist fixation cues (Look) with bounding box annotations (Mark) to guide multimodal Large Language Models (LLMs). By combining these complementary grounding strategies, L&M significantly improves the clinical relevance of generated reports, reduces hallucinations, and enhances model alignment with real-world diagnostic workflows. Importantly, L&M achieves these gains without requiring extensive fine-tuning, leveraging in-context learning to adapt both general-purpose and domain-specific models alike.
Our experiments demonstrate that L&M significantly boosts performance across both lexical and clinical evaluation metrics, with the largest gains observed in clinical metrics such as RaTEScore and RadGraph-XL. For instance, CXR-LLaVA achieved a 1.2% improvement in overall metrics (A.AVG) compared to baseline prompting, while LLaVA-Med demonstrated a remarkable 9.2% boost. General-purpose models also benefited significantly, with LLaVA-OV achieving an 87.3% clinical average (C.AVG), the highest among all tested models, even surpassing domain-specific models trained explicitly for chest X-ray report generation.
Read the full paper at https://arxiv.org/abs/2505.22222.
- (28 May 2025)
Paper accepted by ACL 2025 findings – BioHopR - A Benchmark for Multi-Hop, Multi-Answer Reasoning in Biomedicine – at https://arxiv.org/abs/2505.22240
This paper introduces BioHopR, a benchmark for evaluating multi-hop, multi-answer reasoning in the biomedical domain. Built on the PrimeKG knowledge graph, BioHopR captures the complexity of real-world biomedical queries through one-tomany and many-to-many relationships, rigorously assessing reasoning over 1-hop and 2-hop tasks. Evaluation results highlight that O3-mini, a proprietary model with a reasoning step, outperforms open-source models including biomedical models like HuatuoGPT-o1. Across all models, the performance drop from 1-hop to 2-hop tasks underscores the difficulty of aligning intermediate reasoning steps, especially in bridging entities.
Read the full paper at https://arxiv.org/abs/2505.22240.
- (12 May 2025)
Paper accepted by Journal of Medical Internet Research – Evaluation and Bias Analysis of Large Language Models in Generating Synthetic Electronic Health Records - Comparative Study – at https://doi.org/10.2196/65317
Larger models, such as Yi-34B, Qwen-14B, and Llama 2 to 13 B, showed improved performance in generating more comprehensive EHRs, as reflected in higher EPS values. However, this increased performance was accompanied by a notable escalation in both gender and racial biases, highlighting a performance-bias trade-off. The study identified 4 key findings as follows (1) as model size increased, EHR generation improved, but demographic biases also became more pronounced; (2) biases were observed across all models, not just the larger ones; (3) gender bias closely aligned with real-world disease prevalence, while racial bias was evident in only a subset of diseases; and (4) racial biases varied, with some diseases showing overrepresentation of White or Black populations and underrepresentation of Hispanic and Asian groups. These findings underline the need for effective bias mitigation strategies and the development of benchmarks to ensure fairness in artificial intelligence applications for health care.
Read the full paper at https://doi.org/10.2196/65317.
Acknowledgement: Logo designed by Yuchen Wu