General procedure workflow

Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews

Corrado Lanera, Clara Minto, Abhinav Sharma, Dario Gregori, Paola Berchialla, Ileana Baldi

November 2018

DOI GitHub repository View Journal Article

General procedure workflow

Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews

Corrado Lanera, Clara Minto, Abhinav Sharma, Dario Gregori, Paola Berchialla, Ileana Baldi

November 2018

DOI GitHub repository View Journal Article

Abstract

Objectives: Despite their essential role in collecting and organizing published medical literature, indexed search engines are unable to cover all relevant knowledge. Hence, current literature recommends the inclusion of clinical trial registries in systematic reviews (SRs). This study aims to provide an automated approach to extend a search on PubMed to the ClinicalTrials.gov database, relying on text mining and machine learning techniques. Study Design and Setting: The procedure starts from a literature search on PubMed. Next, it considers the training of a classifier that can identify documents with a comparable word characterization in the ClinicalTrials.gov clinical trial repository. Fourteen SRs, covering a broad range of health conditions, are used as case studies for external validation. A cross-validated support-vector machine (SVM) model was used as the classifier. Results: The sensitivity was 100% in all SRs except one (87.5%), and the specificity ranged from 97.2% to 99.9%. The ability of the instrument to distinguish on-topic from off-topic articles ranged from an area under the receiver operator characteristic curve of 93.4% to 99.9%. Conclusion: The proposed machine learning instrument has the potential to help researchers identify relevant studies in the SR process by reducing workload, without losing sensitivity and at a small price in terms of specificity.

Type

Journal article

Publication

Journal of Clinical Epidemiology, (103)

The main data used are too huge to be included in an R package or in a GitHub repository. You can click on the Dataset button above to find a folder named non_git_nor_build_derived_data/ (2.86 GB) which include the full data used. You can find the description of the content on the GitHub project’s homepage (Code button).

Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews

Extending PubMed searches to ClinicalTrials.gov through a machine learning approach for systematic reviews

Abstract

Corrado Lanera

Research Fellow (RTD-A), data scientist, and trainer