• Dépistage, diagnostic, pronostic

  • Ressources et infrastructures

RadSearch, a Semantic Search Model for Accurate Radiology Report Retrieval with Large Language Model Integration

Menée à partir de l'analyse de 29 730 et 13 958 rapports d'IRM ou de tomographies numériques, cette étude met en évidence l'intérêt d'un modèle de recherche sémantique pour l'extraction de rapports contenant des informations pertinentes et améliorer la précision diagnostique des grands modèles de langage

Background : Current radiology report search tools are limited to keyword searches, which lack semantic understanding of underlying clinical conditions and are prone to false positives. Semantic search models address this issue, but their development requires scalable methods for generating radiology-specific training data.

Purpose : To develop a scalable method for training semantic search models for radiology reports and to evaluate a model, RadSearch, trained using this method.

Materials and Methods : In this retrospective study, a scalable method for generating training examples for semantic search was applied to CT and MRI reports generated between December 2021 and January 2022, and was used to train the model RadSearch. RadSearch performance was evaluated using four internal test sets (including one subset) and one external test set from another large tertiary medical center, including chest, abdomen, and head CT reports generated between December 2015 and June 2023. Performance was evaluated for findings-to-impression matching, retrieving reports with the same examination type, retrieving reports relevant to free-text queries, and improving the ability of a large language model (LLM) (Llama 3.1 8B Instruct) to provide accurate diagnoses from report finding descriptions. RadSearch performance was compared with that of other embedding models specialized for symmetric (All MPNet Base) and asymmetric (MS MARCO DistilBERT Base) semantic search and a state-of-the-art semantic search model (GTE-large). A reference set of 100 diagnoses with common radiologic descriptions was used for the LLM evaluation. Findings-to-impression matching and free-text query accuracy P values were calculated using

χ2 and McNemar tests.

Results

:

The training set included 16

 690 reports; the internal test sets included 13 598, 6178, and 9954 reports; and the external test set included 13 958 reports. For simulated free-text clinical queries, RadSearch successfully retrieved reports containing the specified findings for 83.0% (498 of 600) of reports and matching location for 89.8% (521 of 580) of reports, outperforming GTE-large, with performance at 65.7% (394 of 600; P < .001) and 58.8% (341 of 580; P < .001), respectively. For 100 report finding descriptions, the baseline accuracy of Llama 3.1 8B Instruct in providing the correct diagnosis without any embedding model search assistance was 30% (30 of 100), improving to 61% (61 of 100) with RadSearch integration (P < .001), which outperformed GTE-large integration (47% [47 of 100]; P = .03).

Conclusion : A semantic search model trained with scalable methods achieved state-of-the-art performance in retrieving reports with relevant findings and improved LLM diagnostic accuracy.

Radiology , article en libre accès 2025

View the bulletin