all updates • KnowLab

All updates along the years

2024

(20 June 2024) New preprint - Infusing clinical knowledge into tokenisers for language models - on arXiv:2406.14312

This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro F1 score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.

Read it at arXiv:2406.14312.
(18 June 2024) Paper accepted by MICCAI 2024 - Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns - on arxiv

Yunsoo and folk’s paper on utilising eye gaze data for chext x-ray analysis has now been accepted by MICCAI 2024.

Read the preprint version at arXiv:2404.02370.
(5 June 2024) Paper accepted by NAACL 2024 Workshop Clinical NLP - Chain-of-Though (CoT) prompting strategies for medical error detection and correction - on arXiv:2406.09103

Zhaolong and folk’s work has been accepted by NAACL 2024 Workshop Clinical NLP. This paper describes our submission to the MEDIQA-CORR 2024 shared task for automatically detecting and correcting medical errors in clinical notes. We report results for three methods of few-shot In-Context Learning (ICL) augmented with Chain-of-Thought (CoT) and reason prompts using a large language model (LLM). In the first method, we manually analyse a subset of train and validation dataset to infer three CoT prompts by examining error types in the clinical notes. In the second method, we utilise the training dataset to prompt the LLM to deduce reasons about their correctness or incorrectness. The constructed CoTs and reasons are then augmented with ICL examples to solve the tasks of error detection, span identification, and error correction. Finally, we combine the two methods using a rule-based ensemble method. Across the three sub-tasks, our ensemble method achieves a ranking of 3rd for both sub-task 1 and 2, while securing 7th place in sub-task 3 among all submissions.

Read it at arXiv:2406.09103.
(5 June 2024) New preprint - RadBARTsum - Domain Specific Adaption of Denoising Sequence-to-Sequence Models for Abstractive Radiology Report Summarization - on arXiv:2406.03062

Radiology report summarization is a crucial task that can help doctors quickly identify clinically significant findings without the need to review detailed sections of reports. This study proposes RadBARTsum, a domain-specific and ontology facilitated adaptation of the BART model for abstractive radiology report summarization. The approach involves two main steps 1) re-training the BART model on a large corpus of radiology reports using a novel entity masking strategy to improving biomedical domain knowledge learning, and 2) fine-tuning the model for the summarization task using the Findings and Background sections to predict the Impression section. Experiments are conducted using different masking strategies. Results show that the re-training process with domain knowledge facilitated masking improves performances consistently across various settings. This work contributes a domain-specific generative language model for radiology report summarization and a method for utilising medical knowledge to realise entity masking language model. The proposed approach demonstrates a promising direction of enhancing the efficiency of language models by deepening its understanding of clinical knowledge in radiology reports.

Read it at arXiv:2406.03062.
(20 May 2024) Our survey paper titled A Unified Review of Deep Learning for Automated Medical Coding has just been accepted by ACM Surveys

Automated medical coding, an essential task for healthcare operation and delivery, makes unstructured data manageable by predicting medical codes from clinical documents. Recent advances in deep learning and natural language processing have been widely applied to this task. However, deep learning-based medical coding lacks a unified view of the design of neural network architectures. This review proposes a unified framework to provide a general understanding of the building blocks of medical coding models and summarizes recent advanced models under the proposed framework. Our unified framework decomposes medical coding into four main components, i.e., encoder modules for text feature extraction, mechanisms for building deep encoder architectures, decoder modules for transforming hidden representations into medical codes, and the usage of auxiliary information. Finally, we introduce the benchmarks and real-world usage and discuss key research challenges and future directions.

Read the paper at doi:10.1145/3664615.
(16 May 2024) New preprint - Retrieving and Refining - A Hybrid Framework with Large Language Models for Rare Disease Identification - on arxiv

The infrequency and heterogeneity of clinical presentations in rare diseases often lead to underdiagnosis and their exclusion from structured datasets. This necessitates the utilization of unstructured text data for comprehensive analysis. However, the manual identification from clinical reports is an arduous and intrinsically subjective task. This study proposes a novel hybrid approach that synergistically combines a traditional dictionary-based natural language processing (NLP) tool with the powerful capabilities of large language models (LLMs) to enhance the identification of rare diseases from unstructured clinical notes. We comprehensively evaluate various prompting strategies on six large language models (LLMs) of varying sizes and domains (general and medical). This evaluation encompasses zero-shot, few-shot, and retrieval-augmented generation (RAG) techniques to enhance the LLMs’ ability to reason about and understand contextual information in patient reports. The results demonstrate effectiveness in rare disease identification, highlighting the potential for identifying underdiagnosed patients from clinical notes.

Read it at arXiv:2405.10440.
(15 May 2024) News - Our research has been cited by World Health Organization and a news outlet

Our work “Automated Clinical Coding - What, Why, and Where We Are.” npj Digital Medicine 5, 159 (2022) has been cited by a policy paper from the World Health Organization Road traffic death coding quality in the WHO Mortality Database and an news article Despite AI advancements, human oversight remains essential.
(1 May 2024) Dr Honghan Wu joins the University of Glasgow as a Professor of Health Informatics and Data Science at the School of Health and Wellbeing. Honghan is continuous to hold an honorary associate professor at the University College London to continue his funded research projects and supervisions.
(5 April 2024) New preprint - Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns - on arxiv

Recent advancements in Computer Assisted Diagnosis have shown promising performance in medical imaging tasks, particularly in chest X-ray analysis. However, the interaction between these models and radiologists has been primarily limited to input images. This work proposes a novel approach to enhance human-computer interaction in chest X-ray analysis using Vision-Language Models (VLMs) enhanced with radiologists’ attention by incorporating eye gaze data alongside textual prompts. Our approach leverages heatmaps generated from eye gaze data, overlaying them onto medical images to highlight areas of intense radiologist’s focus during chest X-ray evaluation. We evaluate this methodology in tasks such as visual question answering, chest X-ray report automation, error detection, and differential diagnosis. Our results demonstrate the inclusion of eye gaze information significantly enhances the accuracy of chest X-ray analysis. Also, the impact of eye gaze on fine-tuning was confirmed as it outperformed other medical VLMs in all tasks except visual question answering. This work marks the potential of leveraging both the VLM’s capabilities and the radiologist’s domain knowledge to improve the capabilities of AI models in medical imaging, paving a novel way for Computer Assisted Diagnosis with a human-centred AI.

Read it at arXiv:2404.02370.
(19 February 2024) New preprint - Enhancing Patient Outcome Prediction through Deep Learning with Sequential Diagnosis Codes from structural EHR - A systematic review - on researchgate

Dr Tuankasfee Hama led a systematic review to identify and summarise existing deep learning studies predicting patient outcome using sequences of diagnosis codes, as a key part of their predictors. Additionally, this study also investigates the challenge of generalisability and explainability of the predictive models.

Briefly, the main conclusion is the application of deep learning in sequence of diagnosis has demonstrated remarkable promise in predicting patient outcomes. Using multiple types of features and integration of time intervals was found to improve the predictive performance. Addressing challenges related to generalisation and explainability will be instrumental in unlocking the full potential of DL for enhancing healthcare outcomes and patient care.

Read it at here.
(15 February 2024) Tiny paper - Hallucination Benchmark in Medical Visual Question Answering accepted by ICLR 2024

Many congratulations to Jinge Wu and Yunsoo Kim, PhD students at KnowLab, on the acceptance of a paper to ICLR 2024 - one of the top AI/ML conferences! not just an acceptance but also a super positive review from the area chair - “This particular work is worth presenting at a notable level, as it introduces a dataset that people in the field should be aware of – it is a substantial contribution that can spur further advancements in the VLM field.”

Read it at arXiv:2401.05827.
(11 January 2024) New preprint - Hallucination Benchmark in Medical Visual Question Answering - on arxiv

The recent success of large language and vision models on vision question answering (VQA), particularly their applications in medicine (Med-VQA), has shown a great potential of realizing effective visual assistants for healthcare. However, these models are not extensively tested on the hallucination phenomenon in clinical settings. Here, we created a hallucination benchmark of medical images paired with question-answer sets and conducted a comprehensive evaluation of the state-of-the-art models. The study provides an in-depth analysis of current models limitations and reveals the effectiveness of various prompting strategies.

Read it at arXiv:2401.05827.

2023

(20 December 2023) New preprint - Benchmarking and Analyzing In-context Learning, Fine-tuning and Supervised Learning for Biomedical Knowledge Curation - a focused study on chemical entities of biological interest - on arxiv

We had a task to implement an automated approach for knowledge curation for a biomedical ontology - ChEBI (Chemical Entities of Biological Interest). We asked ourselves the above question and decided to compare and analyze three NLP paradigms for curation tasks - in-context learning, fine tuning, and supervised learning. We broke down the general question into four specific questions. After comprehensive experiments and analysis on 3 GPT models (including gpt4), one domain specific PubmedBERT, 6 embedding models for supervised learning in 15 experiment setups on >1.8m triples, we believe we obtained good evidence for answering them properly.

Read it at arXiv:2312.12989 or this short LinkedIn post.
(20 December 2023) New preprint - Exploring Multimodal Large Language Models for Radiology Report Error-checking - on arxiv

Given all the exciting developments of generativeai and foundation models, in the context of radiology, JINGE WU and Yunsoo Kim set out to ask the question “whether these models can be good assistants in spotting errors in radiology reports by cross-checking radiographs”? To answer this question, they conducted a study on using multimodal large language models (LLMs) for assisting radiologists to check errors in their reports. 1,000 reports with “synthetic errors” were created using two real-world Chest X-ray datasets. Two types of tasks were introduced - binary (is there an error) vs multiclass (what types of errors) classifications.

Read it at arXiv:2312.13103 or this short LinkedIn post.
(7 December 2023) New paper - Applying contrastive pre-training for depression and anxiety risk prediction in type 2 diabetes patients based on heterogeneous electronic health records - a primary healthcare case study - published on JAMIA

Due to heterogeneity and limited medical data in primary healthcare services (PHS) in China, assessing the psychological risk of type 2 diabetes mellitus (T2DM) patients in PHS is difficult. Using unsupervised contrastive pre-training, we proposed a deep learning framework named depression and anxiety prediction (DAP) to predict depression and anxiety in T2DM patients. The main aim was to use good quality EHR data (with 85,085 T2DM in-patients) from a secondary care service provider, the First Affiliated Hospital of Nanjing Medical University and pre-train a foundation model with transfer learning capabilities via unsupervised contrastive learning. This was then further fine-tuned for depression prediction using 149,596 T2DM patients’ EHRs in the Nanjing Health Information Platform (NHIP). Experiments showed the approach had great utilities in predicting post-discharge depression and anxiety in T2DM patients at PHS, with much higher performances compared with baseline models.

Read it at DOI:10.1093/jamia/ocad228
(23 November 2023) New paper - Term-BLAST-Like Alignment Tool for Concept Recognition in Noisy Clinical Texts - published on Bioinformatics

Texts from electronic health records (EHRs) frequently contain spelling errors, abbreviations, and other non-standard ways of representing clinical concepts.Here, we present a method inspired by the BLAST algorithm for biosequence alignment that screens texts for potential matches on the basis of matching k-mer counts and scores candidates based on conformance to typical patterns of spelling errors derived from 2.9 million clinical notes. Our method, the Term-BLAST-like alignment tool (TBLAT) leverages a gold standard corpus for typographical errors to implement a sequence alignment-inspired method for efficient entity linkage. We present a comprehensive experimental comparison of TBLAT with five widely-used tools. Experimental results show an increase of 10% in recall on scientific publications and 20% increase in recall on EHR records (when compared against the next best method), hence supporting a significant enhancement of the entity linking task. The method can be used stand-alone or as a complement to existing approaches.

Read it at DOI:10.1093/bioinformatics/btad716
(27 October 2023) Two papers published on Frontiers in Digital Health

[1] Casey, Arlene, Emma Davidson, Claire Grover, Richard Tobin, Andreas Grivas, Huayu Zhang, Patrick Schrempf, Alison Q. O’Neil, Liam Lee, Michael Walsh, Freya Pellie, Karen Ferguson, Vera Cvoro, Honghan Wu, Heather Whalley, Grant Mair, William Whiteley, Beatrice Alex. “Understanding the performance and reliability of NLP tools - a comparison of four NLP tools predicting stroke phenotypes in radiology reports.” Frontiers in Digital Health 5 (2023), 1184919. Read it here

[2] Zhang, Huayu, Arlene Casey, Imane Guellil, Víctor Suárez-Paniagua, Clare Macrae, Charis Marwick, Honghan Wu, Bruce Guthrie, and Beatrice Alex. “FLAP - A framework for linking free-text addresses to the Ordnance Survey Unique Property Reference Number database.” Frontiers in Digital Health 5 (2023), 1186208. Read it here
(25 September 2023) Three new starters at KnowLab!

Yunsoo Kim is a new PhD students working on multimodal large language model in health data. His research interest also includes applications of the models in diagnosis and prognosis of neurodiseases such as dementia.

Yusuf Abdulle is a Research Assistant based at the Institute of Health Informatics at University College London. He is currently working on using Graph Neural Networks and Knowledge Graphs to work on early diagnosis of rare neurodegenerative diseases.

Yue Gao is a PhD student from Beijing University of Posts and Telecommunications. Funded by CSC, Yue is visiting KnowLab for one year doing research on human in the loop AI models for automated clinical coding.
(11 September 2023) New grant! KnowLab is awarded £649,218 by MRC for Quantifying and Mitigating Bias affecting and induced by AI in Medicine!

Artificial Intelligence (AI) has demonstrated exciting potential in improving healthcare. However, these technologies come with a big caveat. They do not work effectively for minority groups. A recent study published in Science shows a widely used AI tool in the US concludes Black patients are healthier than equally sick Whites. Using this tool, a health system would favour White people when allocating resources, such as hospital beds. AI models like this would do more harm than good for health equity. Funded by Medical Research Council, KnowLab is leading a 30-month research project focusing on using data science and machine learning to quantify and mitigate data embedded and AI induced bias and inequality. Clearly, this is a challenge too grand to be tackled by a single institute. We will be working closely with BHF Data Science Centre, University of Edinburgh, University of Birmingham, Nanjing Medical University (China), and wider communities including Health Dat Research UK, the Alan Turing Institute and beyond.

Check the Project Page at UKRI
(23 July 2023) Hard exudate plays an important role in grading diabetic retinopathy (DR) as a critical indicator. Therefore, the accurate segmentation of hard exudates is of clinical importance. However, the percentage of hard exudates in the whole fundus image is relatively small, and their shapes are often irregular and the contrasts are usually not high enough. Hence, they are prone to misclassifications e.g., misclassified as part of the optic disc structure or cotton wool spots, which results in the low segmentation accuracy and efficiency. This paper proposes a novel neural network RMCA U-net to accurately segmentation hard exudate in fundus images. The network features a U-shape framework combined with a residual structure to obtain the subtle features of hard exudate. A multi-scale feature fusion (MSFF) module and an improved channel attention (CA) module are designed and involved to effectively segmentation sparse small lesions. The proposed method in this paper has been trained and evaluated on three data sets - IDRID, Kaggle and one local data set. Experiments are shown and indicate that RMCA U-net of this paper is superior to the other convolutional neural networks. The method in this paper is increased by 6% higher in PR-MAP than U-net on the IDRID dataset, increased by 10% in Recall than U-net on the Kaggle dataset and increased by 20% in F1-score than U-net on the local dataset.

Read it at DOI:10.1016/j.eswa.2023.120987
(15 July 2023) This paper presents our contribution to the RadSum23 shared task organized as part of the BioNLP 2023. We compared state-of-the-art generative language models in generating high-quality summaries from radiology reports. A two-stage fine-tuning approach was introduced for utilizing knowledge learnt from different datasets. We evaluated the performance of our method using a variety of metrics, including BLEU, ROUGE, bertscore, CheXbert, and RadGraph. Our results revealed the potentials of different models in summarizing radiology reports and demonstrated the effectiveness of the two-stage fine-tuning approach. We also discussed the limitations and future directions of our work, highlighting the need for better understanding the architecture design’s effect and optimal way of fine-tuning accordingly in automatic clinical summarizations.

Read it at DOI:10.18653/v1/2023.bionlp-1.54
(5 May 2023) Ontology-driven and weakly supervised rare disease identification from clinical notes published on BMC Medical Informatics and Decision Making

Superb work from Dr Hong Dong and colleagues, demonstrating how weak supervised NLP + Ontology techniques can greatly facilitate the identification of rare disease mentions from electronic health records with >90% accuracy. This uses training data that need no human annotations!

Read it at DOI:10.1186/s12911-023-02181-9
(5 May 2023) New paper titled Prediction of disease comorbidity using explainable artificial intelligence and machine learning techniques - A systematic review published on International Journal of Medical Informatics

Mohanad M. Alsaleh - a PhD student at UCL - did a great systematic review on explainable AI methods for predicting comorbidity from electronic health records. It finds “(a) The use of explainable artificial intelligence (XAI) can improve predictions of comorbidities by providing a transparent understanding of the reasoning behind predictions and helping healthcare providers make informed decisions. (b) There is a great potential to uncover novel disease associations and better understand the mechanisms of diseases by integrating genetic and electronic health record (EHR) data, leading to improved quality of care and earlier diagnoses. (c) The use of AI in healthcare can improve patient outcomes and reduce healthcare costs by identifying disease risks and making personalised treatment plans.”

Read it at DOI:10.1016/j.ijmedinf.2023.105088
(7 April 2023) New paper - a systematic review on antidepressant and antipsychotic drug prescribing and diabetes outcomes

As part of her PhD research, Charlotte led this systematic review to investigate the association between antidepressant or antipsychotic drug prescribing and type 2 diabetes outcomes. It concludes - Studies of antidepressant and antipsychotic drug prescribing in relation to diabetes outcomes are scarce, with shortcomings and mixed findings. Until further evidence is available, people with diabetes prescribed antidepressants and antipsychotics should receive monitoring and appropriate treatment of risk factors and screening for complications as recommended in general diabetes guidelines.

Read it at DOI:10.1016/j.diabres.2023.110649
(20 March 2023) <h3>Workshop paper on “Ontology-driven Self-supervision for Adverse Childhood Experiences Identification using Social Media Datasets” now published!</h3> Adverse Childhood Experiences (ACEs) are defined as a collection of highly stressful, and potentially traumatic, events or circumstances that occur throughout childhood and/or adolescence. They have been shown to be associated with increased risks of mental health diseases or other abnormal behaviours in later lives. In this paper, Jinge and colleages present an ontology-driven self-supervised approach (derive concept embeddings using an auto-encoder from baseline NLP results) for producing a publicly available resource that would support large-scale machine learning (e.g., training transformer based large language models) on social media corpus. Jinge presented this paper in 2022 summer at the 1st Workshop on Scarce Data in Artificial Intelligence for Healthcare, which was with IJCAI 2022 in Vienna. Check - Paper, Github Repo
(7 March 2023) npj Digital Medical’s editorial on automating clinical coding echoes our prospective

Recently, npj Digital Medicine’s editor Dr Kvedar and colleagues have published a great editorial on automating clinical coding (link), pointing out the main challenges including technological and implementation levels; clinical documents are redundant and complex, code sets like the ICD-10 are rapidly evolving, training sets are not comprehensive of codes; capturing the logic and rules of coding decisions. Great to see our prospectives on the automated coding research challenges and future directions were echoed in the editorial!
(21 February 2023) New paper - The impact of inconsistent human annotations on AI driven clinical decision making now published by npj Digital Medicine

Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgements, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings.

Aneeta Sylolypavan did her MSc with us addressing this hugely important research question using real-world ICU datasets with annotated data from 11 ICU consultants. The results suggest that (a) there may not always be a “super expert” in acute clinical settings; and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for determining consensus achieves optimal models in most cases.

Read the paper from here

(23 January 2023)

KnowLab is funded by NIHR/HDR UK to use health data science + machine learning for addressing the NHS winter pressure - Using rare disease phenotype models to identify people at risk of COVID-19 related adverse outcome

KnowLab is proud to be part of 16 projects funded by HDR UK and funded by NIHR which will use data-driven approaches to pin-point pressures in the health care system, understand their causes and develop ways to overcome or avoid them. Particularly, we will use machine learning and rare disease phenotype models to uncover much-needed information on the added risks of severe COVID-19 in people who are clinically more vulnerable and come from disadvantaged socioeconomic backgrounds. This can then inform policy responses to provide better management and treatment for these most vulnerable groups who might have been overlooked. The team are well placed to derive quick actionable findings for the winter pressures as they have been working with CVD-COVID-UK/COVID-IMPACT on rare diseases since October 2021.

HDR UK Press Release on the funded projects

HDR UK News on this project

Herald Scotland News

2022

(21 December 2022) Our UK’s clinical NLP landscaping (a survey) paper is now published with npj Digital Medicine

Aiming to survey the landscape of Clinical NLP in the UK, we used a relatively extraordinary approach - start with finding all relevant funded projects and extract their interlinked information. Then, conducted community analysis and literature review. We described WHO (key players of funders, universities, companies, researchers), WHAT (techs, applications, disease areas, clinical questions, datasets), WHERE (the community developments, tech trends & maturity), GAPS (barriers to unleash the full power of NLP in health). While on the community level we focused on the UK, analyses and discussions on the research, tech and developments were beyond the country boundary. In particular, we compared tech, data, regulatory similarities and differences of the US and the UK. This is one of the key outputs of HDR UK funded National Text Analytics Project.

Read it at DOI:10.1038/s41746-022-00730-6
(20 December 2022) KnowLab co-edits a new cross journal collection with BMC Series on Ethics of Artificial Intelligence in Health and Medicine.

As the implementation of artificial intelligence (AI)-based innovations in health and care services become more and more common, it is increasingly pressing to address the ethical challenges associated with AI in healthcare to find appropriate solutions. In the cross-journal BMC collection Ethics of Artificial Intelligence in Health and Medicine, we urge the research communities, industry, policy makers and other stakeholders to join forces in tackling the grand challenges of realising Ethical and fair AI in health and medicine. Check our blog article with BMC Series on the topic for what & why. Please spread the words and contribute to the collection, current deadline is 31 Oct 2023.
(22 November 2022) KnowLab is awarded £5,000 from UCL Global Engagement Funding.

The funding is to extend and deepen our collaboration with iris.ai - a Norway based start-up behind the award-winning AI engine for scientific text understanding. The funded project is titled - Towards self-updatable knowledge base for evidence based medicine - join force with iris.ai (Norway) and beyond.
(27 October 2022) The Alan Turing Heath Equity Interest Group - https://www.turing.ac.uk/research/interest-groups/health-equity, co-organised by KnowLab and colleagues, is now official online!

In an era where AI is expected to improve our daily life - particularly in health, “How can we ensure that developments and applications of data science and AI improve everyone’s health?” This is a pressing and very challenging question. Please join forces with a multidisciplinary group to form a formidable synergistic force tackling one of the biggest challenges of AI in medicine.
(22 October 2022) Dr Hang Dong’s great perspective piece on automated coding using NLP and knowledge-driven approaches has now been published on npj Digital Medicine. The work illustrates how NLP and AI can help improve the efficiency of clinical coding in healthcare - i.e., assign ICD/SNOMED codes to hospital visits, which currently is a very inefficient/erroneous process in NHS and, for that matter, in many other health systems across the world.
(5 October 2022) Study on The Impact of Inconsistent Human Annotations on AI driven Clinical Decision Making.

Annotation inconsistencies commonly occur when even highly experienced clinical experts annotate the same phenomenon (e.g., medical image, diagnostics, or prognostic status), due to inherent expert bias, judgements, and slips, among other factors. While their existence is relatively well-known, the implications of such inconsistencies are largely understudied in real-world settings.

Aneeta Sylolypavan did her MSc with us addressing this hugely important research question using real-world ICU datasets with annotated data from 11 ICU consultants. The results suggest that (a) there may not always be a “super expert” in acute clinical settings; and (b) standard consensus seeking (such as majority vote) consistently leads to suboptimal models. Further analysis, however, suggests that assessing annotation learnability and using only ‘learnable’ annotated datasets for determining consensus achieves optimal models in most cases.

The manuscript is now under review with npj Digital Medicine and preprint is available at doi:10.21203/rs.3.rs-1937575/v1.
(18 August 2022) Our systematic review titled “Artificial intelligence models for predicting cardiovascular diseases in people with type 2 diabetes - a systematic review”, led by Minhong Wang, has been accepted by Intelligence-Based Medicine. This study identified and reviewed existing AI models for predicting risk of cardiovascular diseases in people with type 2 diabetes. We found that compared to risk scores developed using conventional methods, AI approaches have the potential to achieve more accurate predictions than risk scores developed using conventional methods. However, none of the reviewed models is directly reusable or reproducible, due to incomplete reporting and lack of transparency. Clinically, none of the AI models includes interventions that may affect risks such as medications and lifestyle changes. There were no indications in the studies on whether the prediction models might be able to adapt to include these factors.
(29 July 2022) Our collaboration study titled “Prediction of Five-Year Cardiovascular Disease Risk in People with Type 2 Diabetes Mellitus - Derivation in Nanjing, China and External Validation in Scotland, UK”, led by Cheng Wan from Nanjing Medical University, has been published by Global Heart. This study shows it is feasible to generate a risk prediction model using routinely collected Chinese hospital data. This indicates there is a great potential to make use of the large-scale and relatively easy accessible route data for identifying those at risk of CVD and help significantly improve CVD prevention in people with diabetes.
(11 July 2022) Our health inequality studies - one led by Isabel Straw and one with Minhong, Aneeta and Prof Sarah Wild (University of Edinburgh) - are featured in a science piece on i news.
(16 June 2022) Our collaboration study titled “Spine-GFlow - A Hybrid Learning Framework for Robust Multi-tissue Segmentation in Lumbar MRI without Manual Annotation”, led by Dr Teng Zhang from Hong Kong University, has been accepted by Computerized Medical Imaging and Graphics. Results of this study show that our method, without requiring manual annotation, has achieved a segmentation performance comparable to a model trained with full supervision (mean Dice 0.914 vs 0.916).
(10 June 2022) Out work, titled COVID-19 trajectories among 57 million adults in England - a cohort study using electronic health records, is now out with Lancet Digital Health. Our analyses illustrate the wide spectrum of disease trajectories as shown by differences in incidence, survival, and clinical pathways. We have provided a modular analytical framework that can be used to monitor the impact of the pandemic and generate evidence of clinical and policy relevance using multiple EHR sources.
(2 May 2022) Our work Quantifying Health Inequalities Induced by Data and AI Models has been accepted by IJCAI-ECAI2022 ‘AI for Good track’. This work introduced a generic allocation-deterioration framework for detecting and quantifying AI induced inequality. Extensive experiments were carried out to quantify health inequalities (a) embedded in two real-world ICU datasets of HiRID and MIMIC III; (b) induced by AI models trained for two resource allocation scenarios. The results showed that compared to men, women had up to 33% poorer deterioration in markers of prognosis when admitted to HiRID ICUs. All four AI models assessed were shown to induce significant inequalities (2.45% to 43.2%) for non-White compared to White patients. The models exacerbated data embedded inequalities significantly in 3 out of 8 assessments, one of which was >9 times worse. preprint, slides, recording, repo.
(26 April 2022) Study led by Isabel Straw, Investigating for bias in healthcare algorithms - a sex-stratified analysis of supervised machine learning models in liver disease prediction, demonstrates a previously unobserved sex disparity present in published machine learning models. It suggests “To ensure sex-based inequalities do not manifest in medical AI, an evaluation of demographic performance disparities must be integrated into model development.” The work has been published on BMJ Health & Care Informatics.
(22 April 2022) Dr Honghan Wu joined the editorial board of BMC Digital Health. BMC Digital Health considers research on all aspects of the development and implementation of digital technology in both medicine and public health, such as mobile health applications, virtual healthcare and wearable technology, as well as the role of social media and other communications technology in digital health.
(25 March 2022) Study led by Huayu, Increased COVID-19 mortality rate in rare disease patients - a retrospective cohort study in participants of the Genomics England 100,000 Genomes project, has shown rare disease patients, especially ones affected by neurology and neurodevelopmental disorders, in the Genomics England cohort had increased risk of COVID-19 related death during the first wave of the pandemic in UK. This work has now been accepted by Orphanet Journal of Rare Diseases.
(20 March 2022) Clinical coding is the task of transforming medical information in a patient’s health records into structured codes like ICD-10 for diagnosis, which is cognitive, time-consuming task and error-prone. In this preprint, titled Automated Clinical Coding - What, Why, and Where We Are? , Hang introduces the idea of automated clinical coding and summarises its challenges from the perspective of Artificial Intelligence (AI) and Natural Language Processing (NLP), based on the literature, our project experience over the past two and half years (late 2019 - early 2022), and discussions with clinical coding experts in Scotland and the UK.
(18 January 2022) KnowLab was awarded an enabling grant (£29k) from British Council to strengthen academic exchanges and deepen our collaborations with the two Nanjing based universities of Nanjing Medical University (Prof Yun Liu’s group) and Southeast University (Dr Xiang Zhang’s group). At UCL side, in addition to KnowLab colleagues, we have Prof Paul Taylor and Dr Holger Kunz. For research focuses, we will focus on Artificial Intelligence in Medicine - tackling challenges of low generalisability and health inequality. This will involve both teaching and research activities.
(15 January 2022) Great work by Shaoxiong Ji and colleagues from Aalto University on reviewing automated coding from free-text clinical notes using a unified/abstract architecture view - now on arXiv. Great that KnowLab is part of this.
(13 January 2022) Minhong’s project “COOLNeo-an automated COOLing therapy for NEOnates” has been awarded £9,960.00 in funding from the ACCELERATE Innovation Team Challenge, finacially supported throught the Wellcome Trust Translational Partnership Award linked to Translational Research Office.
(8 January 2022) New members! A very warm welcome to Xuezhe Wang and Zhaolong Wu to join KnowLab for doing their MSc projects. Both are MSc students based in Institute of Health Informatics, UCL. Xuezhe will be working on graphic neural networks and Zhaolong will be doing clinical natural language processing.

2021

(3 December 2021) Led by Dr Rebecca Bendayan at King’s College London, our work Investigating the Association between Physical Health Comorbidities and Disability in Individuals with Severe Mental Illness is now accepted by European Psychiatry. We found individuals with Severe Mental Illness and musculoskeletal, skin/dermatological, respiratory endocrine, neurological, haematological or circulatory disorders are at higher risk of disability compared to those that do not have those comorbidities. There is a great and urgent need to provide targeted prevention and intervention programs for these vulnerable people.
(28 November 2021) Huayu’s work on rare disease is now out with the Lancet as a conference abstract. Common conditions are widely recognised as risk factors for COVID-19 death, BUT effects of Rare Diseases are largely unknown. This study on Genomics England data shows significant increased mortality risks (OR 3·47) among rare disease individuals.
(23 November 2021) New members! A very warm welcome to Yun-Hsuan Chang and Hengrui Zhang to join KnowLab for doing their MSc projects. Both are MSc students based in Institute of Health Informatics, UCL. Yun will be working on Parkinson’s Disease Modelling using mutimodal data and Hengrui will do deep learning models for automated coding from discharge summaries. Both projects are exciting!
(9 October 2021) Great to know that Emma and Claire’s work on COVID subtype identification work has featured in Health Data Research UK’s website as a case study. Health Data Research UK is UK’s national institute of health data science.
(5 October 2021) NLP of radiology reports has wide applications. However, the current literature has suboptimal reporting quality. This impedes comparison, reproducibility, and replication. Check our systematic review on reporting quality of NLP on radiology reports on BMC Medical Imaging.
(26 September 2021) We are pleased to announce the grand open of our KnowLab Blog. We aim to irregularly share our research in a layman way. This is to reach out to the general public for disseminating what we are doing and why we are doing these.
(6 September 2021) KnowLab is thrilled to be part of a new £3.9m NIHR Research Collaboration on Artificial Intelligence and Multimorbidity, called AIM-CISC. We will lead the objective 4 work in England - use machine learning and natural language processing on multimodal health data for better understanding of disease clusters.
(18 August 2021) Great news - Dr Honghan Wu has become a Turing Fellow at The Alan Turing Institute, UK’s national institute for data science and artificial intelligence.
(1 August 2021) Dr Minhong Wang takes up a new position as a research fellow at IHI, UCL (Top 5 in the world for Public Health according to Shanghai Ranking 2021) to work on exciting projects on health data using NLP + ML! Congratulations, Minhong!
(28 July 2021) A warm welcome to Nickil Maveli who will be working with Dr Hang Dong on the automated medical coding project. Nickil will focus on (a) tackling the shortcomings of BERT models that only deal with 512 tokens or fewer; (b) utilising multiple documents in the task.
(15 July 2021) Paper accepted by EMBC 2021, titled “Rare Disease Identification from Clinical Notes with Ontologies and Weak Supervision”. arXiv:2105.01995
(9 July 2021) Dr Honghan Wu joined the editorial board of BMC Medical Informatics and Decision Making.
(7 July 2021) Exciting news - Dr Honghan Wu has been promoted to Associate Professor at UCL! (effective 1 October 2021).
(5 July 2021) We are recruiting a Research Fellow in Health Data Research to be based at IHI, UCL. Part of the role will be conducting exciting collaborations with iris.ai on The AI Chemist project. Please apply here.
(15 June 2021) COVID-19 subtyping work has now been accepted by AMIA 2021 Annual Symposium. What a great work, Emma and Claire! Both are joint first authors and first year PhD students of HDR UK, Turing Institute and Wellcome CDT, who worked with KnowLab on the COVID-19 project. Axes of Prognosis - Identifying Subtypes of COVID-19 Outcomes. medRxiv. Github Repo
(15 June 2021) Led by Zina, our work “A Knowledge Distillation Ensemble Framework for Predicting Short and Long-term Hospitalisation Outcomes from Electronic Health Records Data” has been published on IEEE Journal of Biomedical and Health Informatics.
(8 June 2021) Our paper “Developing automated methods for disease subtyping in UK Biobank - an exemplar study on stroke” has been accepted by BMC Medical Informatics and Decision Making. This is our first work to combine NLP + reusable domain knowledge (encoded as rules) to derive sub-phenotypes (specific conditions like intracerebral hemorrhage stroke).
(17 May 2021) Our paper “A Systematic Review of Natural Language Processing Applied to Radiology Reports” has been accepted by BMC Medical Informatics and Decision Making. Preprint arXiv. Well done, Arlene!
(5 May 2021) Our work with Dr Adam Levine on Pathology NLP has been accepted for oral presentation at 13th Joint meeting of the BDIAP and The Pathological Society on 6-8 July 2021, titled Natural Language Processing for the Automated Extraction of Tumour Immunohistochemical Profiles from Diagnostic Histopathology Reports. What a great start to work on Pathology NLP!
(21 April 2021) Hang’s work - “Weakly supervised entity linking and ontology matching to enrich patients’ rare disease coding” has been accepted to the 2021 virtual workshop on Personal Health Knowledge Graphs. Well done, Hang!
(12 April 2021) Honghan is invited to speak at Personal Knowledge Graph Workshop 2021 about our work and thoughts on knowledge graph and particularly the “personal” aspect in health data science.
(24 March 2021) Axes of Prognosis - Identifying Subtypes of COVID-19 Outcomes - a work led by Emma and Claire now on medRxiv. Both are first year PhD students of HDR UK, Turing Institute and Wellcome CDT. Great work, Emma and Claire! Github Repo
(25 February 2021) Our paper “Explainable Automated Coding of Clinical Notes using Hierarchical Label-wise Attention Networks and Label Embedding Initialisation” has been accepted by Journal of Biomedical Informatics. Well done, Hang!
(25 February 2021) Our recent work of using deep learning to automate diagnosis coding (ICD) from discharge summaries. Explainable Automated Coding of Clinical Notes using Hierarchical Label-wise Attention Networks and Label Embedding Initialisation, preprint, GitHub Repo
(20 February 2021) Invited talk from Jose Manuel Gomez-Perez “On the Role of Knowledge Graphs and Language Models in Machine Understanding of Scientific Documents”.
(1 February 2021) Excited to kick off our project with NHS Lothian to use Natural Language Processing + Machine learning to automatically triage patients in long-waiting-list (due to the impact of COVID-19) of dermatology. This study is funded by Data-Driven Innovation.
(1 February 2021) Many congratulations to Minhong for having successfully passed her viva with minor corrections! Huge achievement, Dr Wang!
(21 January 2021) Our paper “Evaluation of NEWS2 for predicting severe COVID outcome”, now published with BMC Medicine. Key findings - NEWS2 had poor-moderate discrimination for severe outcome (ICU/death) at 14 days. But improved with blood/physio params.. Mentions in News Stories - dailymail / eurekalert / KCL
(4 January 2021) Our paper “Benchmarking network-based gene prioritization methods for cerebral small vessel disease” has been accepted by Briefings in Bioinformatics.

2020

(16 December 2020) Evaluation and Improvement of the National Early Warning Score (NEWS2) for COVID-19 a multi-hospital study. MedRxiv, accepted by BMC Medicine now.
(10 December 2020) Excess deaths in people with cardiovascular diseases during the COVID-19 pandemic, MedRxiv, now accepted by European Journal of Preventive Cardiology.
(11 November 2020) Our ensemble learning for COVID-19 work has been accepted by Journal of the American Medical Informatics Association. This study synergies seven multinational prediction models to realise a robust and high-performing prediction model. This is the first work to use ensemble learning for risk prediction of COVID-19 and the validation cohorts are one of the most diverse international COVID-19 datasets (4 cohorts with mortality rates 2.4-45%). The ensemble model consistently outperformed any single models in all aspects validated. DOI:10.1093/jamia/ocaa295, GitHub Repo
(16 October 2020) Many congratulations to Victor and Hang, both are awarded £1,000 by The iTPA Translational Innovation Competition. Details
(23 September 2020) We are hiring! Two posts (£33,797 - £40,322) in NLP / Health Data research. Deadline 14 Oct 2020, based in Usher, Edinburgh Medical School.
(12 September 2020) Our study uses ensemble learning to synergise seven multinational prediction models to realise a robust and high-performing prediction model. This is the first work to use ensemble learning for risk prediction of COVID-19 and the validation cohorts are one of the most diverse international COVID-19 datasets (4 cohorts with mortality rates 2.4-45%). The ensemble model consistently outperformed any single models in all aspects validated. preprint, GitHub Repo
(11 August 2020) Great news - Huayu Zhang’s project, titled “Towards data-driven fine management of COVID-19 hospitalization risk for rare-disease patients” is awarded £1,000 by The iTPA Translational Innovation Competition. Quite an achievement for an early stage postdoctoral researcher!
(11 June 2020) Our study shows “Adding age and a minimal set of blood parameters to NEWS2 improves the detection of patients likely to develop severe COVID-19 outcomes” - “Evaluation and Improvement of the National Early Warning Score (NEWS2) for COVID-19 - a multi-hospital study”. MedRxiv.
(11 June 2020) Our study shows “CVD services have dramatically reduced across countries, leading to potential (probably avoidable) excess mortality during and after the COVID-19 pandemic” - “Excess deaths in people with cardiovascular diseases during the COVID-19 pandemic”. MedRxiv.
(3 May 2020) Our COVID-19 risk prediction preprint out on Medrxiv - “Risk prediction for poor outcome and death in hospital in-patients with COVID-19 - derivation in Wuhan, China and external validation in London, UK”. MedRxiv.
(1 May 2020) Dr Honghan Wu started his new job as a lecturer in health informatics at IHI, UCL. He will continue his personal fellowship project on both University of Edinburgh and UCL.
(27 February 2020) Paper accepted by HealTAC 2020 - “Identifying physical health comorbidities in a cohort of individuals with severe mental illness - An application of SemEHR”
(21 February 2020) Paper accepted by ECAI 2020 - “Modeling Rare Interactions in Time Series Data Through Qualitative Change - Application to Outcome Prediction in Intensive Care Units”
(27 January 2020) Delighted to co-develop a NLP work package in the exciting Advanced Care Research Centre programme, a £20m investment dedicated to the field of ageing and care.
(20 January 2020) Great to have a short visit to Department of Orthopaedics and Traumatology, Hong Kong University, discussing the exciting opportunity of personalised pain prediction after spine treatments using multimodal data (free text + imaging).
(17 January 2020) our paper - “On Classifying Sepsis Heterogeneity in the ICU - Insight Using Machine Learning” has been published by JAMIA. https://doi.org/10.1093/jamia/ocz211
(14 January 2020) Delighted to have a kick-off meeting with Edinburgh Innovations team for our project “Towards an AI-driven Health Informatics Platform for supporting clinical decision making in Scotland – a pilot study in NHS Lothian” funded by Wellcome iTPA 2019.

2019

(6 December 2019) Knowledge driven phenotyping on medrxiv now - an automated approach to translating phenotypes defined in domain vocabularies into queries executable on heterogenous and distributed health datasets.
(28 November 2019) Using SemEHR on EHRs to answer an important clinical question – “Association of physical health multimorbidity with mortality in people with schizophrenia spectrum disorders - Using a novel semantic search system that captures physical diseases in electronic patient records” has been accepted by Schizophrenia Research. DOI:10.1016/j.schres.2019.10.061
(11 November 2019) Great to know our paper - “Semantic computational analysis of anticoagulation use in atrial fibrillation from real world data” has been accepted by PLoS One. Preprint
(22 October 2019) Our first NLP transfer learning paper for identifying phenotype mentions has been accepted by JMIR Medical Informatics a relatively new journal (started in 2013) with an inaugural impact factor of 3.188.
(20 July 2019) Delighted to know our proposal for HDR UK NLP Implementation Project has been awarded as one of the 3 National Implementation Projects Look forward to working on this exciting UK-wide collaboration
(5 April 2019) The 4th International Workshop on Knowledge Discovery in Healthcare Data will be with IJCAI2019 in Macao, China. Website, submission and dates.
(13 January 2019) Delighted to know our “Sprint” project proposal – “Building the Knowledge Graph for UK Healthcare Data Science” is awarded by HDR UK as part of Digital Innovation Hub Programme and one of the ten innovative data solutions to prove the potential of health data to transform lives

2018

(19 October 2018) Thrilled to be invited to give a talk about our CogStack EHR platform in China’s National Centre for Cardiovascular Diseases. Great to learn their excellent infrastructures, research and datasets; and grand vision! Look forward to the first CogStack deployment in China for supporting EHR based research.
(15 February 2018) Proudly begin a MRC/Rutherford Fund Fellowship of HRD UK hosted by Centre for Medical Informatics of University of Edinburgh. My research focuses on “Deriving an actionable patient phenome from healthcare data“
(10 February 2018) An application paper describing our SemEHR toolkit has been accepted by JAMIA, titled “SemEHR- A General-purpose Semantic Search System to Surface Semantic Data from Clinical Notes for Tailored Care, Trial Recruitment and Clinical Research”.

2017

(27 November 2017) Our work of using knowledge graph techniques in predicting adverse drug reactions has published by Scientific Reports.
(5 July 2017) Our data harmonisation and search toolkit for EHR – CogStack is mentioned in Annual Report of the Chief Medical Officer 2016 by the UK Government.