How To Leverage Machine Learning & Deep Learning For Natural Language Processing In Clinical Trials

Machine learning (ML) and deep learning (DL) are transforming natural language processing (NLP). A key use case is medical coding. ML models can significantly reduce the manual effort needed for this process, making the use of auto-coding tools more lightweight. They can also streamline dictionary updates and final coding reviews through direct coding and automatic query preparation.

So, how can we most effectively leverage ML and DL for NLP, and what outcomes can we expect to see?

The Current Medical Coding Process

Typically, the medical coding workflow involves using a rules-based automated coding tool, followed by two lines of manual review by human coders. Auto-coding tools, in simple form (i.e., without using a synonym list), can typically successfully code 50–60% of simple input terms. This can be increased with the addition of a synonym library. However, building and maintaining such a library is labor intensive.

NLP Techniques

There are three main technique groups for NLP, each with its own levels of complexity, strength, and weakness:

Symbolic methods: rule-based parsing or approximate string matching
ML: support vector machines or logistic regression
DL: recurrent neural network or large language model

When choosing the right NLP technique, it’s essential to understand the specific challenges of any use case. For medical coding, these include complex vocabulary, new words, and many possible outputs.

While DL is a subset of ML, they have different capabilities and limitations. ML typically requires heavier data preparation. DL requires less but a more complex model and, therefore, more training data.

ML for Medical Coding

ML can streamline the current medical coding process by performing a first-line coding review and removing the need for a synonym list. In this model, a simplified rules-based coding tool tackles more accessible terms. The remaining verbatim terms are then sent to the ML model, which finds the best dictionary match. A second-line medical coder then reviews this recommendation.

The ML model needs to read the input and dictionary terms, understand their meaning, and retrieve the correct entry to yield accurate suggestions. This requires NLP, which demonstrates ML is suitable for medical coding.

DL for Medical Coding

DL can address the challenges of medical coding. The model’s flexibility allows it to adapt and learn relationships across words within the input term. This means it doesn’t rely on assumptions that might be wrong for the specific case and removes the need to implement organization-specific scenarios.

Leveraging semantics allows the model to select the right dictionary entry from many choices and deal with high variability from the input terms expressing the same concept. Thanks to this technique, and others like transfer learning, it’s also possible to include a priori medical knowledge. This makes the model globally better for medical coding and allows it to handle terms never seen in training data.

For example, when seeing the input term ‘probable covid-19 infection’, the model properly codes it into ‘suspected covid-19’. It ignores ‘infection’ to focus on ‘covid-19’ as well as understanding that ‘probable’ is similar to ‘suspected’ in this situation. Another example is ‘honeydew melon allergy’ being correctly coded as a ‘fruit allergy’. While the dictionary doesn’t have a more precise entry, the DL model can use its prior knowledge of honeydew melon being a fruit.

Instead of outputting only one dictionary entry, the solution can also suggest several entries to review together with a confidence score. This can be used to bring attention to terms that have a low confidence score.

Using a DL solution, we have typically achieved higher than 90% accuracy in both adverse events and medications. This demonstrates DL is suitable for medical coding.

The Future of DL & ML in Medical Coding

There are two further areas of exploration for ML and DL in medical coding: query detection and direct coding of high-confidence terms.

ML can reduce the efforts needed in the quality control process by directly raising queries instead of just auto-coding the input terms. There’s also the opportunity to direct code some terms that are given a high confidence interval, as outlined above. This could be achieved by establishing a confidence score above which terms can be directly coded and approved without manual review.

These opportunities and the efficiencies already demonstrated make a compelling case for leveraging ML and DL for NLP in medical coding. Want to learn more? Contact us to continue exploring ML and DL in clinical data management.

Guide

A Comprehensive Guide to Adaptive Site Monitoring

Blog

Decoding ICH E6(R3): What It Means for Risk-Based Quality Management (RBQM)

Authors: Frederic Blais, Jelena Pasimisina, Melissa Thomas ICH E6(R3) provides greater clarity on proactively designing quality...

Blog

10 Steps for Practical RBQM Implementation for Your Business

Many clinical trial Sponsors and Contract Research Organizations (CROs) realize the great benefits of Risk-Based Quality...

Blog

QTLs: Where Are We, And How Much Further Can We Go?

Quality Tolerance Limits (QTLs) represent a significant advancement in clinical trial research, aimed at proactively identifying...