Romina Mendez: Employing AWS Comprehend Medical for Medical Data Extraction in Healthcare Analytics

Romina Mendez

A Step-by-Step Guide to Using Entities, RxNorm and SNOMED CT

The goal of this tutorial is to provide a guide on how to use Amazon Comprehend Medical for identifying medical entities and extracting information, including RxNorm codes, SNOMED CT concepts, and other attributes. We will cover the following key topics:

🏷️ Dataset: We will use a dataset from Kaggle related to the USMLE® Step 2 Clinical Skills examination.
🏷️ Data Extraction and Preparation: Preparation of the dataset to extract entities.
🏷️ Real-Time Processing with AWS Comprehend Medical: Explore how to use Amazon Comprehend Medical in real-time to analyze and extract medical information from unstructured text data.
🏷️ Batch Processing with AWS Comprehend Medical: Discover how to set up and execute batch processing jobs with Amazon Comprehend Medical.

🩺 Amazon Comprehend Medical

Amazon Comprehend Medical is a service designed to extract information from unstructured medical texts using natural language processing model (NLP) while ensuring compliance with HIPAA requirements. This service provide the following outputs: * Entities: Key medical elements identified in the text, such as medications, diagnoses, symptoms, and procedures. * RxNorm Codes: These codes are derived from a medical ontology that provides normalized names for medications and drugs, ensuring consistent identification and categorization of medication-related information. * SNOMED CT: This code set originates from a comprehensive medical ontology that represents clinical concepts such as diseases, procedures, and diagnoses, facilitating precise and interoperable health data.

At the time of writing this article, only English texts can be processed usign this service.

🛡️ What is HIPAA?

The HIPAA (Health Insurance Portability and Accountability Act) privacy rule sets national standards for the protection of individually identifiable health information in the United States.

This refers to data, including demographic information, that relates to: * The individual’s past, present, or future physical or mental health or condition. * The provision of healthcare to the individual. * The past, present, or future payment for healthcare provided to the individual, and that identifies the individual or can reasonably be used to identify them, where this includes common identifiers such as name, address, date of birth, and Social Security number.

📚 What are the vocabularies?

“Vocabularies” refer to structured sets of standardized terms and codes used to capture, classify, and analyze patient data. These include controlled vocabularies, terminologies, hierarchies, and ontologies, and are essential for interoperability between healthcare systems, enabling data exchange and facilitating global research. This practice dates back to the 1660s, as shown in the image below.

“Medical vocabularies date back to the Bills of Mortality in medieval London to manage outbreaks of plague and other diseases.” The Book Of Ohdsi

Image source: The Book Of Ohdsi

🩺 AWS Comprehend Medical: Methods for Data Extraction

After understanding the importance of medical vocabularies, we can explore how AWS Comprehend Medical leverages these vocabularies to extract and standardize medical data.

In the following sections, we will describe the specific methods used by AWS Comprehend Medical to process and analyze medical texts.

🩺 AWS Comprehend Medical: Detect Entities

The detect_entities_v2 method from AWS Comprehend Medical identifies and classifies various categories of medical information within a text. Below is an image illustrating the categories detected by this method.

For each of these classes, not only are the categories to which the entity belongs detected, but also other key values. These values include:

Type: The specific type of entity within a category.
Attribute: Relevant information about the entity, such as the dosage of a medication.
Trait: Additional aspects that Amazon Comprehend Medical understands about an entity based on context, such as the NEGATION trait if a medication is not being administered to the patient.

Below, you can see the additional data that can be obtained for each category.

🩺 AWS Comprehend Medical: RxNorm

RxNorm is a standardized medical ontology that provides normalized names for clinical medications, and also It serves as a comprehensive resource for identifying and categorizing drugs and their various forms. RxNorm links these standardized names to many other drug vocabularies, ensuring consistency and interoperability across different healthcare systems.

Below is an example with a medication and the related concepts in RxNorm.

🩺 AWS Comprehend Medical: SNOMED CT (Clinical Terms)

SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) is a comprehensive multilingual health terminology system. It provides a standardized set of codes, concepts, and synonyms to represent clinical information, including diseases, procedures, and diagnoses.

SNOMED CT facilitates semantic interoperability by allowing mapping between different health vocabularies, such as ICD-9 and ICD-10.

🩺 Dataset

For this tutorial, we will use a dataset from Kaggle that is associated with the USMLE® Step 2 Clinical Skills examination, this licensing exam evaluates the examinee’s ability to recognize pertinent clinical facts during interactions with standardized patients.

We will use select medical notes from this dataset to process and analyze the results obtained using the AWS Comprehend Medical service.

source: NBME - Score Clinical Patient Notes

🔧 Prerequisites

To complete this tutorial, you need to meet the following prerequisites:

AWS Credentials: You must configure the AWS_ACCESS_KEY and AWS_SECRET_KEY credentials. These are crucial for authenticating and authorizing access to AWS services.
S3 Bucket: Create an S3 bucket to store your data. In this example, we will use a bucket named dev-medical-notes located in the us-east-1 region.
Permissions: Check the IAM folder in this repository for the necessary policies and permissions to apply.

⚠️ If you’re not familiar with creating AWS credentials or setting up an S3 bucket, you can follow this guide: Create a Bucket.

🔐 AWS Credentials

For this tutorial, AWS credentials (AWS_ACCESS_KEY and AWS_SECRET_KEY) are required, these credentials are essential for authenticating and authorizing access to AWS services and they can be generated using the IAM (Identity and Access Management) service.

⚠️ Remember to keep your credentials secure and avoid sharing them to prevent unauthorized access to your AWS account.

🔐 IAM Policies and Role

For this tutorial, you need a role and a user with specific policies applied. In the GitHub repository, you’ll find a folder containing the policies that need to be applied.

📦 AWS Libraries

The main libraries we will use are:

boto3: This library allows us to connect programmatically to Amazon Web Services (AWS) services.
awswrangler: This open-source Python library integrates pandas with AWS, enabling seamless data manipulation and analysis within AWS services.

🔧 Configuration Setup

To manage AWS credentials, we will use the python-dotenv library to handle environment variables. You need to create a file named .env in the root of the project and configure your AWS credentials there. Below, you will find the format for the file.

File Name: .env

AWS_SECRET_KEY='mySecretKey'
AWS_ACCESS_KEY='myAccessKey'
AWS_ROLE='arn:aws:iam::xxxx:role/role-name'

⚠️ Considerations

To simplify this tutorial and reduce the complexity of implementing a solution, two classes were created, which are as follows:

📦S3bucket Class

To simplify the explanation of this tutorial and manage the files stored in an S3 bucket, I have created a class named S3bucket. This class will enable us to perform various common operations such as listing the files in a bucket, writing a JSON file, writing a Parquet file, and reading a JSON file.

📦ComprehendMedical Class

To make it easier to use AWS Comprehend Medical and create DataFrames from the processed data, I have developed a class named ComprehendMedical. This class is designed to streamline interactions with the service’s methods, including detect_entities_v2, infer_rx_norm, and infer_snomed_ct. Below are the primary methods of this class and their functionalities:

get_entities: This method uses the detect_entities_v2 function from AWS Comprehend Medical to identify medical entities in a given text.
get_rxnorm: This method employs the infer_rx_norm function to extract medication-related information from the text.
get_snomed: This method uses the infer_snomedct function to identify and obtain information related to standardized medical terms in the SNOMED CT system.

⚒️ Methods to Generate DataFrames Each of the above methods also has a version that returns the results in DataFrame format using pandas. These DataFrames are then saved in Parquet format, which is efficient for storage and querying, and facilitates integration with other data processing tools. The Parquet files are stored in a new 📁 folder named “stage” within the same Amazon S3 bucket.

# libraries for data processing
import json, os,io,re,uuid
from tqdm import tqdm
from datetime import datetime
from pprint import pprint
import pandas as pd 
import numpy as np

# libraries for loading environment variables
from dotenv import load_dotenv

# aws libraries
import boto3
import awswrangler as wr

# libraries for data visualization
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib as mpl

load_dotenv()
AWS_ACCESS_KEY = os.getenv("AWS_ACCESS_KEY")
AWS_SECRET_KEY = os.getenv("AWS_SECRET_KEY")
AWS_ROLE = os.getenv("AWS_ROLE")
AWS_REGION_NAME = 'us-east-1'
BUCKET_NAME = 'dev-medical-notes'

Here we are creating an object from the class called S3bucket and you can see the complete code in the following link of this class

# create an object of the class S3bucket for the bucket 'dev-medical-notes'
s3 = S3bucket(BUCKET_NAME, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION_NAME)

🟣 Data Extraction and Preparation

The first step in the process is to extract a subset of medical notes and upload them to Amazon S3 in JSON {} format. To facilitate the organization and management of these files, they will be stored in a 📁 folder named “raw,” which will be preceded by a 📅 date prefix (dt).

The 📁 “raw” folder will serve as the container for the original, unprocessed files, while the date prefix will help classify and manage the files based on when they were uploaded.

data = pd.read_csv('data/patient_notes.csv')
data['pn_history_len']=data['pn_history'].str.len()

# Plotting the distribution of the number of characters in patient notes
mpl.rcParams['font.family'] = 'serif'
plt.figure(figsize=(12,4))
sns.histplot(data['pn_history_len'], color=sns.color_palette('pastel')[4])
plt.title('Distribution number of characters in patient notes')
plt.xlabel('Number of characters')
plt.ylabel('Frequency')
plt.show()

🔍 Analysis of Clinical Note Lengths
In summary, there is significant variability in the length of clinical notes. However, most notes typically fall within a certain range. Below are the key points and future considerations for this analysis:

📊 Distribution of Note Lengths
- The majority of notes are between 800 and 1000 characters.
- Some notes are shorter, with less than 200 characters.
🧹 Data Cleaning
- It is important to ensure that the notes do not contain unnecessary characters, such as repetitions or sequences of special symbols. It is recommended to perform a preliminary cleaning of the notes to remove these characters before processing.
- It is recommended to perform a preliminary cleaning of the notes to remove these characters before processing.
🔍 Future Research Questions
Although some of these questions cannot be answered with our dataset, these are some questions we could consider for analyzing a similar dataset:
- Type of Patients: What types of patients have the shortest and longest notes?
- Relation to Severity: Is there a relationship between the length of the note and the severity of the patient's condition?
- Temporal Evolution: How has the average length of the notes changed over time?

Additionally, since AWS Comprehend Medical processes notes up to 10,000 characters, performing this analysis is ideal for optimizing the usage of this service.

Selecting Random Notes

# seelcting random notes to test the function
random_notes = [42141, 39049, 40593, 38851, 41068, 39457, 39152, 39665, 37830, 41717]

# selecting the notes from the data
data_test = data.reset_index().rename({'index':'id'},axis=1).loc[random_notes,:].to_dict('records')
# see the first 3 notes
data_test[:3]

[{'id': 42141,
  'pn_num': 95330,
  'case_num': 9,
  'pn_history': 'Ms. Madden is a 20 yo female presenting w/ the worst HA of her life, unlike anything that she has had before. It is a dull + constant pain, it has gotten progressively worse, started yesterday morn. It is a diffuse pain felt around her head and is nonpulsating. She has photophobia but no phonophobia, has nausea, and vomited 3x yesterday. No sick contacts. Felt warm earlier. No chills, fatigue, CP, SOB, abd pain, or rashes. No sx before the onset of this HA. Ibuprofen, tylenol, sleep have not helped. Walking + bending over makes the pain worse. She has had HA before once or twice a yr but they are usually very mild. \r\nMeds: OCPs\r\nFH: mother w/ migraines, dad w/ HPL\r\nSocial alcohol use, 3 or 4 marijuana joints per week, no tobacco use\r\nPMH: none significant',
  'pn_history_len': 765},
 {'id': 39049,
  'pn_num': 92131,
  'case_num': 9,
  'pn_history': '20 yo F, c/o headaches.\r\n- started yesterday, right after she woke up. 8/10, dull, constant headache, getting worse. \r\n- nothing makes it better. exacerbated by leaning forward and walking.\r\n- nausea and vomiting. vomited 3 times, green fluids, no blood. \r\n- photophobia. \r\n- mild fever. \r\nROS: none except above. occasional headaches.\r\nPMH: none. Meds: OCP. All: NKDA.\r\nPSH: none. \r\nFH: mother - migrain. father-  hyperlipidemia. \r\nSH: sexually active with boyfriend, using condoms and OCP. \r\nnot smoking, drinkes 2-3 a week. smoking marijuana 3-4 times a week.',
  'pn_history_len': 562},
 {'id': 40593,
  'pn_num': 93723,
  'case_num': 9,
  'pn_history': 'A 20 yo female presents to the clinic with c/o a headache since yesterday. Pt states headache is constant, progressive, and diffuse. Associaed with nausea, vomiting , and decreased appetite. Pain is 10/10 in intensity, non-radiating. Patient states she also felt warmer yesterday. Pt states muscle aches and runny nose.. Patient denies cough, chst pain, adbominal pain. Pt denies recent travel.\r\n\r\nAllergies: none\r\nmeds: OCPs\r\nPMHx:/PShx/hopsi: none\r\nFamily hx:amily hx (mother-migraine headache)\r\nsocial hx: \r\n',
  'pn_history_len': 511}]

# write the data to the s3 bucket
for record in tqdm(data_test):
    dt =f"dt={datetime.now().strftime('%Y%m%d')}"
    record_file_name = f"medical_record_noteId_{record['id']}.json"
    s3.write_s3_json(data=record, filename=f"raw/{dt}/{record_file_name}")

100%|██████████| 10/10 [00:02<00:00,  3.48it/s]

🧹 Retrieval and Cleaning of Notes
We will retrieve note 42141 from the S3 bucket, specifically from the folder “raw”. Using these data, we will use the re module to replace the characters and , which correspond to line breaks and tabs.
Next, we will review the dictionary with the note and proceed to modify these characters in the retrieved text.

# Read the data from the s3 bucket 
note_id = 42141
note = s3.read_s3_json(f'raw/dt=20240804/medical_record_noteId_{note_id}.json')
pprint(note)
note_clean = re.sub(r'[\n\r\t]', ' ',note['pn_history'])

{'case_num': 9,
 'id': 42141,
 'pn_history': 'Ms. Madden is a 20 yo female presenting w/ the worst HA of her life, unlike anything that she has had before. It is a dull + constant pain, it has gotten progressively worse, started yesterday morn. It is a diffuse pain felt around her head and is nonpulsating. She has photophobia but no phonophobia, has nausea, and vomited 3x yesterday. No sick contacts. Felt warm earlier. No chills, fatigue, CP, SOB, abd pain, or rashes. No sx before the onset of this HA. Ibuprofen, tylenol, sleep have not helped. Walking + bending over makes the pain worse. She has had HA before once or twice a yr but they are usually very mild.   Meds: OCPs  FH: mother w/ migraines, dad w/ HPL  Social alcohol use, 3 or 4 marijuana joints per week, no tobacco use  PMH: none significant',
 'pn_history_len': 765,
 'pn_num': 95330}

🟣 Real-Time Processing with AWS Comprehend Medical

Here we are creating an object from the class called ComprehendMedical and you can see the complete code in the following link of this class

# create an object of the class ComprehendMedical to use the comprehend medical service
aws_comprehendMedical = ComprehendMedical(
                        aws_region_name=AWS_REGION_NAME,
                        aws_access_key=AWS_ACCESS_KEY,
                        aws_secret_access=AWS_SECRET_KEY)

🩺 AWS Comprehend Medical: Entities

# get the entities from the note 
tmp_entities = aws_comprehendMedical.get_entities_dataframe(text=note_clean)
# With the function get_entities_dataframe we can get the mapped and unmapped entities
mapped_df, unmapped_df = tmp_entities
mapped_df.head(3)

unmapped_df.head(3)

# write the mapped and unmapped entities to the s3 bucket
dt =f"dt={datetime.now().strftime('%Y%m%d')}"
s3.write_s3_parquet(data=mapped_df, filename=f'stage/{dt}/entites/mapped_entities_noteId_{note_id}.parquet')
s3.write_s3_parquet(data=unmapped_df, filename=f'stage/{dt}/entites/unmapped_entities_noteId_{note_id}.parquet')

🩺 AWS Comprehend Medical: RxNorm

# get the rxnorm entities from the note
rxnorm_entities = aws_comprehendMedical.get_rxnorm_dataframe(text=note_clean)
rxnorm_entities.head()

# Some of the entities are mapped to RxNorm
for item in rxnorm_entities['RxNormConcepts'][0]:
    pprint(item)

{'Code': '5640', 'Description': 'ibuprofen', 'Score': 0.9967318773269653}
{'Code': '10255', 'Description': 'suprofen', 'Score': 0.5894578695297241}
{'Code': '4331', 'Description': 'fenoprofen', 'Score': 0.5856923460960388}
{'Code': '1312748', 'Description': 'truprofen', 'Score': 0.574164867401123}
{'Code': '17387', 'Description': 'alminoprofen', 'Score': 0.5531540513038635}

# Write the RxNorm entities to the s3 bucket
dt =f"dt={datetime.now().strftime('%Y%m%d')}"
s3.write_s3_parquet(data=rxnorm_entities, filename=f'stage/{dt}/rxnorm/rxnorm_entities_noteId_{note_id}.parquet')

🩺 AWS Comprehend Medical: SNOMED CT (Clinical Terms)

# create a dataframe with the snomed entities
snomed_ct_entities = aws_comprehendMedical.get_snomed_dataframe(text=note_clean)
snomed_ct_entities.head()

# Write the snomed-ct entities to the s3 bucket
dt =f"dt={datetime.now().strftime('%Y%m%d')}"
s3.write_s3_parquet(data=snomed_ct_entities, filename=f'stage/{dt}/snomed-ct/snomed_ct_noteId_{note_id}.parquet')

🟣 Batch Processing with AWS Comprehend Medical

To perform batch processing, you’ll first need to store the notes as individual txt files in an S3 bucket. These files will be processed, and the results will be saved in a new folder named output within the same bucket.

To view this section of the tutorial, you can check out my GitHub repository linked below.

If you find this useful, please leave a star ⭐️ and follow me to receive notifications of new articles. This will help me grow in the tech community and create more content.

{% github r0mymendez/aws-comprehend-medical %}

📚 References

[1] Amazon Comprehend Medical, Amazon Web Services, URL: https://aws.amazon.com/es/comprehend/medical/
[2] Summary of the HIPAA Privacy Rule, U.S Deparment of health human services, URL: https://www.hhs.gov/hipaa/for-professionals/privacy/laws-regulations/index.html
[3] Boto3 1.34.153 documentation, Amazon Web Services, URL: https://boto3.amazonaws.com/v1/documentation/api/latest/index.html
[4] AWS SDK for pandas (awswrangler),AWS Professional Service open source , URL: https://aws-sdk-pandas.readthedocs.io/en/stable/
[5] NBME - Score Clinical Patient Notes,Kaggle, URL: https://www.kaggle.com/c/nbme-score-clinical-patient-notes/data?select=patient_notes.csv
[6] Detect entities (Version 2), Amazon Web Services, URL: https://docs.aws.amazon.com/comprehend-medical/latest/dev/textanalysis-entitiesv2.html
[7] RxNorm, National Library of Medicie, URL: https://www.nlm.nih.gov/research/umls/rxnorm/index.html
[8] Overview of SNOMED CT, National Library of Medicie, URL: https://www.nlm.nih.gov/healthit/snomedct/snomed_overview.html
[9] The Book Of Ohdsi, Ohdsi, URL: https://ohdsi.github.io/TheBookOfOhdsi/

📚 Other references:

- Image preview reference: [Image by jcomp on Freepik]

Comment on this article

Employing AWS Comprehend Medical for Medical Data Extraction in Healthcare Analytics