Using text mining in R for a pharmaceutical product search through provincial formularies in Canada

From time to time I have to check whether one or another pharmaceutical product is reimbursed on a public formulary in Canada. A usual routine is to go on each provincial formulary web site and search either via an online query tool or in a PDF printout of the formulary. One or two drug products make it easy to search across 10 different formularies. However, for a portfolio of 10 products I would rather search only Ontario and Quebec agencies (which probably cover 2/3 of Canadian public) as it is time consuming. And how about searching for 100+ products?

Online tool for Ontario public formulary allows to search across many fields. The output table can be easily copied to a spreadsheet. British Columbia's Pharmacare provides a snapshot of a database in txt format. However, some provinces (i.e. Quebec, Saskatchewan, Manitoba, Nova Scotia) only publish a PDF document. Manually searching for all products is a daunting task. Off course, there are commercial providers of such information, but what about getting the information for free?

A few months ago I met Avner Levy who introduced me the idea of text mining. Facing a long and boring week(s) of sifting for products in PDF formularies, I explored the idea of scripting the task within R statistical environment. Through trial and errors, I have found a working solution. The script creates a list of products listed with public formularies using openly available information in PDF. The following text is an actual working script with comments....

###########
## SCRIPT##
###########

#INSTALL LIBRARIES
# install.packages('tau')
# install.packages('tm')

# Load libraries
library(utils)
library(tau)
library(tm)

# Many formularies index products with Health Canada's Drug Identification Numbers (DINs)
# Cut and paste DINs of interest from a spreadsheet
DINs<-c(
"2333856
2333872
2333864
2314940
2182815
2182874")

# Some formularies use only brand names and no DINs

BRANDs<-c("PROSCAR
ASMANEX
CELESTONE
ANDRIOL
COZAAR
HYZAAR
OLMETEC
PRINIVIL
PRINZIDE")

# split the chunk of DINs into substrings
DINs<-unlist(strsplit(DINs, "\n"))
BRANDs<-unlist(strsplit(BRANDs, "\n"))

# Visually inspect the output
cat("DINs of interest", DINs)
cat("BRANDs of interest", BRANDs)

# Create a dictionary from DINs of interest
# The dictionary could be made of brand names
dict<-Dictionary(DINs)
#dict<-Dictionary(BRANDs)

# Create txt file using 'pdftotext -layout *.pdf' command in shell on Linux
# Windows or iOS have other tools to create a text output
# -layout option is useful if there is only one line per product in pdf (like in the Quebec's Liste de medicament)
# -raw option is useful if there are two products per line in pdf (like in Manitoba fomulary file sdr.pdf)

system(paste("pdftotext", PdftotextOptions = "-layout -eol unix", shQuote("/home/autofocus/formulary.pdf")))
ramq<-readLines("/home/autofocus/formulary.txt")

# Strip leading and trailing space in each line
# gsub("(^ +)|( +$)", "", YourTextVector)

ramq1 <- gsub("\\s+", " ", gsub("\n|\t", " ", ramq))

# Check the size of the file
object.size(ramq1)

# Format all lines or words as vectors
# I started working with words but then discovered that lines works better
ramq1.wrds <- as.vector(unlist(ramq1))
# ramq1.wrds <- as.vector(unlist(strsplit(ramq1, " ")))

# search for DINs in the document
j<-1
i<-1
ramqDINs<-c("Listed DINs")

for (j in (1:length(dict))){

for (i in (1:length(ramq1.wrds))) {
if (grepl(dict[j],ramq1.wrds[i], ignore.case=TRUE))
ramqDINs<-c(ramqDINs, ramq1.wrds[[i]])
}
}

# find DINs not appearing in the document

j<-1
no_ramqDINs<-c("Not listed DINs")
x<-Corpus(VectorSource(ramqDINs))
for (j in (1:length(dict))){
if (searchFullText(PlainTextDocument(ramqDINs), dict[j]))
no_ramqDINs<-no_ramqDINs
else no_ramqDINs<-c(no_ramqDINs, dict[j])
}

# Summary information about the dictionary and output
summary(dict)
summary(ramqDINs)

# Write output to a text file
# Text file can be imported into a spreadsheet

fileConn<-file("/home/user/Formulary_DINs.txt")
writeLines(ramqDINs, fileConn)
close(fileConn)
file.show("/home/user/Formulary_DINs.txt")

fileConn<-file("/home/user/NotListed_DINs.txt")
writeLines(no_ramqDINs, fileConn)
close(fileConn)
file.show("/home/user/NotListed_DINs.txt")