The goal of this tutorial is to provide a guide on how to use Amazon Comprehend Medical for identifying medical entities and extracting information, including RxNorm codes, SNOMED CT concepts, and other attributes.
A Step-by-Step Guide to Using Entities, RxNorm and SNOMED CT
The goal of this tutorial is to provide a guide on how to use Amazon Comprehend Medical for identifying medical entities and extracting information, including RxNorm codes, SNOMED CT concepts, and other attributes. We will cover the following key topics:
Amazon Comprehend Medical is a service designed to extract information from unstructured medical texts using natural language processing model (NLP) while ensuring compliance with HIPAA requirements. This service provide the following outputs: * Entities: Key medical elements identified in the text, such as medications, diagnoses, symptoms, and procedures. * RxNorm Codes: These codes are derived from a medical ontology that provides normalized names for medications and drugs, ensuring consistent identification and categorization of medication-related information. * SNOMED CT: This code set originates from a comprehensive medical ontology that represents clinical concepts such as diseases, procedures, and diagnoses, facilitating precise and interoperable health data.
At the time of writing this article, only English texts can be processed usign this service.
The HIPAA (Health Insurance Portability and Accountability Act) privacy rule sets national standards for the protection of individually identifiable health information in the United States.
This refers to data, including demographic information, that relates to: * The individual’s past, present, or future physical or mental health or condition. * The provision of healthcare to the individual. * The past, present, or future payment for healthcare provided to the individual, and that identifies the individual or can reasonably be used to identify them, where this includes common identifiers such as name, address, date of birth, and Social Security number.
“Vocabularies” refer to structured sets of standardized terms and codes used to capture, classify, and analyze patient data. These include controlled vocabularies, terminologies, hierarchies, and ontologies, and are essential for interoperability between healthcare systems, enabling data exchange and facilitating global research. This practice dates back to the 1660s, as shown in the image below.
“Medical vocabularies date back to the Bills of Mortality in medieval London to manage outbreaks of plague and other diseases.” The Book Of Ohdsi
Image source: The Book Of Ohdsi
After understanding the importance of medical vocabularies, we can explore how AWS Comprehend Medical leverages these vocabularies to extract and standardize medical data.
In the following sections, we will describe the specific methods used by AWS Comprehend Medical to process and analyze medical texts.
The detect_entities_v2 method from AWS Comprehend Medical identifies and classifies various categories of medical information within a text. Below is an image illustrating the categories detected by this method.
For each of these classes, not only are the categories to which the entity belongs detected, but also other key values. These values include:
Below, you can see the additional data that can be obtained for each category.
RxNorm is a standardized medical ontology that provides normalized names for clinical medications, and also It serves as a comprehensive resource for identifying and categorizing drugs and their various forms. RxNorm links these standardized names to many other drug vocabularies, ensuring consistency and interoperability across different healthcare systems.
Below is an example with a medication and the related concepts in RxNorm.
SNOMED CT (Systematized Nomenclature of Medicine – Clinical Terms) is a comprehensive multilingual health terminology system. It provides a standardized set of codes, concepts, and synonyms to represent clinical information, including diseases, procedures, and diagnoses.
SNOMED CT facilitates semantic interoperability by allowing mapping between different health vocabularies, such as
ICD-9
andICD-10
.
For this tutorial, we will use a dataset from Kaggle that is associated with the USMLE® Step 2 Clinical Skills examination, this licensing exam evaluates the examinee’s ability to recognize pertinent clinical facts during interactions with standardized patients.
We will use select medical notes from this dataset to process and analyze the results obtained using the AWS Comprehend Medical service.
source: NBME - Score Clinical Patient Notes
To complete this tutorial, you need to meet the following prerequisites:
AWS_ACCESS_KEY
and AWS_SECRET_KEY
credentials. These are crucial for authenticating and authorizing access to AWS services.dev-medical-notes
located in the us-east-1
region.⚠️ If you’re not familiar with creating AWS credentials or setting up an S3 bucket, you can follow this guide: Create a Bucket.
🔐 AWS Credentials
For this tutorial, AWS credentials (AWS_ACCESS_KEY
and AWS_SECRET_KEY
) are required, these credentials are essential for authenticating and authorizing access to AWS services and they can be generated using the IAM (Identity and Access Management) service.
⚠️ Remember to keep your credentials secure and avoid sharing them to prevent unauthorized access to your AWS account.
🔐 IAM Policies and Role
For this tutorial, you need a role and a user with specific policies applied. In the GitHub repository, you’ll find a folder containing the policies that need to be applied.
📦 AWS Libraries
The main libraries we will use are:
🔧 Configuration Setup
To manage AWS credentials, we will use the python-dotenv library to handle environment variables. You need to create a file named .env in the root of the project and configure your AWS credentials there. Below, you will find the format for the file.
File Name:
.env
AWS_SECRET_KEY='mySecretKey'
AWS_ACCESS_KEY='myAccessKey'
AWS_ROLE='arn:aws:iam::xxxx:role/role-name'
⚠️ Considerations
To simplify this tutorial and reduce the complexity of implementing a solution, two classes were created, which are as follows:
📦S3bucket Class
To simplify the explanation of this tutorial and manage the files stored in an S3 bucket, I have created a class named S3bucket. This class will enable us to perform various common operations such as listing the files in a bucket, writing a JSON file, writing a Parquet file, and reading a JSON file.
📦ComprehendMedical Class
To make it easier to use AWS Comprehend Medical and create DataFrames from the processed data, I have developed a class named ComprehendMedical. This class is designed to streamline interactions with the service’s methods, including detect_entities_v2, infer_rx_norm, and infer_snomed_ct. Below are the primary methods of this class and their functionalities:
⚒️ Methods to Generate DataFrames Each of the above methods also has a version that returns the results in DataFrame format using pandas. These DataFrames are then saved in Parquet format, which is efficient for storage and querying, and facilitates integration with other data processing tools. The Parquet files are stored in a new 📁 folder named “stage” within the same Amazon S3 bucket.
# libraries for data processing
import json, os,io,re,uuid
from tqdm import tqdm
from datetime import datetime
from pprint import pprint
import pandas as pd
import numpy as np
# libraries for loading environment variables
from dotenv import load_dotenv
# aws libraries
import boto3
import awswrangler as wr
# libraries for data visualization
import seaborn as sns
from matplotlib import pyplot as plt
import matplotlib as mpl
load_dotenv()= os.getenv("AWS_ACCESS_KEY")
AWS_ACCESS_KEY = os.getenv("AWS_SECRET_KEY")
AWS_SECRET_KEY = os.getenv("AWS_ROLE")
AWS_ROLE = 'us-east-1'
AWS_REGION_NAME = 'dev-medical-notes' BUCKET_NAME
Here we are creating an object from the class called S3bucket and you can see the complete code in the following link of this class
# create an object of the class S3bucket for the bucket 'dev-medical-notes'
= S3bucket(BUCKET_NAME, AWS_ACCESS_KEY, AWS_SECRET_KEY, AWS_REGION_NAME) s3
The first step in the process is to extract a subset of medical notes and upload them to Amazon S3 in JSON {} format. To facilitate the organization and management of these files, they will be stored in a 📁 folder named “raw,” which will be preceded by a 📅 date prefix (dt).
The 📁 “raw” folder will serve as the container for the original, unprocessed files, while the date prefix will help classify and manage the files based on when they were uploaded.
= pd.read_csv('data/patient_notes.csv')
data 'pn_history_len']=data['pn_history'].str.len()
data[
# Plotting the distribution of the number of characters in patient notes
'font.family'] = 'serif'
mpl.rcParams[=(12,4))
plt.figure(figsize'pn_history_len'], color=sns.color_palette('pastel')[4])
sns.histplot(data['Distribution number of characters in patient notes')
plt.title('Number of characters')
plt.xlabel('Frequency')
plt.ylabel( plt.show()
🔍 Analysis of Clinical Note Lengths
In summary, there is significant variability in the length of clinical notes.
However, most notes typically fall within a certain range.
Below are the key points and future considerations for this analysis:
📊 Distribution of Note Lengths
800
and 1000
characters.less than 200
characters.🧹 Data Cleaning
remove these characters
before processing.🔍 Future Research Questions
Although some of these questions cannot be answered with our dataset, these are some questions we could consider for analyzing a similar dataset:
types of patients
have the shortest and longest notes?the severity of the patient's condition
?changed over time
?Additionally, since AWS Comprehend Medical processes notes up to 10,000
characters, performing this analysis is ideal for optimizing the usage of this service.
Selecting Random Notes
# seelcting random notes to test the function
= [42141, 39049, 40593, 38851, 41068, 39457, 39152, 39665, 37830, 41717]
random_notes
# selecting the notes from the data
= data.reset_index().rename({'index':'id'},axis=1).loc[random_notes,:].to_dict('records')
data_test # see the first 3 notes
3] data_test[:
[{'id': 42141,
'pn_num': 95330,
'case_num': 9,
'pn_history': 'Ms. Madden is a 20 yo female presenting w/ the worst HA of her life, unlike anything that she has had before. It is a dull + constant pain, it has gotten progressively worse, started yesterday morn. It is a diffuse pain felt around her head and is nonpulsating. She has photophobia but no phonophobia, has nausea, and vomited 3x yesterday. No sick contacts. Felt warm earlier. No chills, fatigue, CP, SOB, abd pain, or rashes. No sx before the onset of this HA. Ibuprofen, tylenol, sleep have not helped. Walking + bending over makes the pain worse. She has had HA before once or twice a yr but they are usually very mild. \r\nMeds: OCPs\r\nFH: mother w/ migraines, dad w/ HPL\r\nSocial alcohol use, 3 or 4 marijuana joints per week, no tobacco use\r\nPMH: none significant',
'pn_history_len': 765},
{'id': 39049,
'pn_num': 92131,
'case_num': 9,
'pn_history': '20 yo F, c/o headaches.\r\n- started yesterday, right after she woke up. 8/10, dull, constant headache, getting worse. \r\n- nothing makes it better. exacerbated by leaning forward and walking.\r\n- nausea and vomiting. vomited 3 times, green fluids, no blood. \r\n- photophobia. \r\n- mild fever. \r\nROS: none except above. occasional headaches.\r\nPMH: none. Meds: OCP. All: NKDA.\r\nPSH: none. \r\nFH: mother - migrain. father- hyperlipidemia. \r\nSH: sexually active with boyfriend, using condoms and OCP. \r\nnot smoking, drinkes 2-3 a week. smoking marijuana 3-4 times a week.',
'pn_history_len': 562},
{'id': 40593,
'pn_num': 93723,
'case_num': 9,
'pn_history': 'A 20 yo female presents to the clinic with c/o a headache since yesterday. Pt states headache is constant, progressive, and diffuse. Associaed with nausea, vomiting , and decreased appetite. Pain is 10/10 in intensity, non-radiating. Patient states she also felt warmer yesterday. Pt states muscle aches and runny nose.. Patient denies cough, chst pain, adbominal pain. Pt denies recent travel.\r\n\r\nAllergies: none\r\nmeds: OCPs\r\nPMHx:/PShx/hopsi: none\r\nFamily hx:amily hx (mother-migraine headache)\r\nsocial hx: \r\n',
'pn_history_len': 511}]
# write the data to the s3 bucket
for record in tqdm(data_test):
=f"dt={datetime.now().strftime('%Y%m%d')}"
dt = f"medical_record_noteId_{record['id']}.json"
record_file_name =record, filename=f"raw/{dt}/{record_file_name}") s3.write_s3_json(data
100%|██████████| 10/10 [00:02<00:00, 3.48it/s]
🧹 Retrieval and Cleaning of Notes
We will retrieve note 42141 from the S3 bucket, specifically from the folder “raw”. Using these data, we will use the re module to replace the characters and , which correspond to line breaks and tabs.
Next, we will review the dictionary with the note and proceed to modify these characters in the retrieved text.
# Read the data from the s3 bucket
= 42141
note_id = s3.read_s3_json(f'raw/dt=20240804/medical_record_noteId_{note_id}.json')
note
pprint(note)= re.sub(r'[\n\r\t]', ' ',note['pn_history']) note_clean
{'case_num': 9,
'id': 42141,
'pn_history': 'Ms. Madden is a 20 yo female presenting w/ the worst HA of her life, unlike anything that she has had before. It is a dull + constant pain, it has gotten progressively worse, started yesterday morn. It is a diffuse pain felt around her head and is nonpulsating. She has photophobia but no phonophobia, has nausea, and vomited 3x yesterday. No sick contacts. Felt warm earlier. No chills, fatigue, CP, SOB, abd pain, or rashes. No sx before the onset of this HA. Ibuprofen, tylenol, sleep have not helped. Walking + bending over makes the pain worse. She has had HA before once or twice a yr but they are usually very mild. Meds: OCPs FH: mother w/ migraines, dad w/ HPL Social alcohol use, 3 or 4 marijuana joints per week, no tobacco use PMH: none significant',
'pn_history_len': 765,
'pn_num': 95330}
Here we are creating an object from the class called ComprehendMedical and you can see the complete code in the following link of this class
# create an object of the class ComprehendMedical to use the comprehend medical service
= ComprehendMedical(
aws_comprehendMedical =AWS_REGION_NAME,
aws_region_name=AWS_ACCESS_KEY,
aws_access_key=AWS_SECRET_KEY) aws_secret_access
🩺 AWS Comprehend Medical: Entities
# get the entities from the note
= aws_comprehendMedical.get_entities_dataframe(text=note_clean)
tmp_entities # With the function get_entities_dataframe we can get the mapped and unmapped entities
= tmp_entities
mapped_df, unmapped_df 3) mapped_df.head(
3) unmapped_df.head(
# write the mapped and unmapped entities to the s3 bucket
=f"dt={datetime.now().strftime('%Y%m%d')}"
dt =mapped_df, filename=f'stage/{dt}/entites/mapped_entities_noteId_{note_id}.parquet')
s3.write_s3_parquet(data=unmapped_df, filename=f'stage/{dt}/entites/unmapped_entities_noteId_{note_id}.parquet') s3.write_s3_parquet(data
🩺 AWS Comprehend Medical: RxNorm
# get the rxnorm entities from the note
= aws_comprehendMedical.get_rxnorm_dataframe(text=note_clean)
rxnorm_entities rxnorm_entities.head()
# Some of the entities are mapped to RxNorm
for item in rxnorm_entities['RxNormConcepts'][0]:
pprint(item)
{'Code': '5640', 'Description': 'ibuprofen', 'Score': 0.9967318773269653}
{'Code': '10255', 'Description': 'suprofen', 'Score': 0.5894578695297241}
{'Code': '4331', 'Description': 'fenoprofen', 'Score': 0.5856923460960388}
{'Code': '1312748', 'Description': 'truprofen', 'Score': 0.574164867401123}
{'Code': '17387', 'Description': 'alminoprofen', 'Score': 0.5531540513038635}
# Write the RxNorm entities to the s3 bucket
=f"dt={datetime.now().strftime('%Y%m%d')}"
dt =rxnorm_entities, filename=f'stage/{dt}/rxnorm/rxnorm_entities_noteId_{note_id}.parquet') s3.write_s3_parquet(data
🩺 AWS Comprehend Medical: SNOMED CT (Clinical Terms)
# create a dataframe with the snomed entities
= aws_comprehendMedical.get_snomed_dataframe(text=note_clean)
snomed_ct_entities snomed_ct_entities.head()
# Write the snomed-ct entities to the s3 bucket
=f"dt={datetime.now().strftime('%Y%m%d')}"
dt =snomed_ct_entities, filename=f'stage/{dt}/snomed-ct/snomed_ct_noteId_{note_id}.parquet') s3.write_s3_parquet(data
To perform batch processing, you’ll first need to store the notes as individual txt files in an S3 bucket. These files will be processed, and the results will be saved in a new folder named output within the same bucket.
To view this section of the tutorial, you can check out my GitHub repository linked below.
If you find this useful, please leave a star ⭐️ and follow me to receive notifications of new articles. This will help me grow in the tech community and create more content.
{% github r0mymendez/aws-comprehend-medical %}
- Image preview reference: [Image by jcomp on Freepik]
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Mendez (2024, Aug. 7). Romina Mendez: Employing AWS Comprehend Medical for Medical Data Extraction in Healthcare Analytics. Retrieved from https://r0mymendez.github.io/posts_en/2024-08-07-employing-aws-comprehend-medical-for-medical-data-extraction-in-healthcare-analytics/
BibTeX citation
@misc{mendez2024employing, author = {Mendez, Romina}, title = {Romina Mendez: Employing AWS Comprehend Medical for Medical Data Extraction in Healthcare Analytics}, url = {https://r0mymendez.github.io/posts_en/2024-08-07-employing-aws-comprehend-medical-for-medical-data-extraction-in-healthcare-analytics/}, year = {2024} }