Import XML data in R

It is hard to say how I ended up on a website for orphan diseases. A site http://www.orphadata.org allows to access a few files with orphan disease data (epidemiological, clinical and genetic) in XML format. XML data are easy to convert in a spreadsheet using MS Excel in Windows but LibreOffice on linux kept returning an error message. Another option would be to write an XSLT file to convert XML data into another format. Luckily, there is also a library in R for direct import from XML files.

INSTALL LIBRARIES
(1) install a few packages in Ubuntu 12.04 via command line
sudo apt-get install libxml2-dev libcurl4-openssl-dev
(2) install packages in R via command line
install.packages(“scrapeR”)

The following script provides basic functions and comments

## LOAD LIBRARIES IN R
library(scrapeR)
# scrapeR will also load two additional packages XML, RCurl, and bitops
# the latter packages were installed as dependencies during installation of 'scrapeR'
# In the hindsight, all commands here are part of the 'XML' library but libraries like 'RCurl' and 'scrapeR' might be useful in the future

## PARSING XML FILE
# Download files from http://www.orphadata.org/data/xml/en_product1.xml and edit the location of the file in the next command;
# Alternatively, parse the file directly from the web
doc<-xmlTreeParse("en_product1.xml", useInternal = TRUE)
top<-xmlRoot(doc)

# Open the en_product1.xml file in a simple text editor just to see the structure
# Top element is 'JDBOR', then 'DisorderList' and all instances of 'Disorder'
# There are 6 children nodes in each 'Disorder' instance
xmlSApply(xmlRoot(doc) [["DisorderList"]], xmlSize)
min(xmlSApply(xmlRoot(doc) [["DisorderList"]], xmlSize))
max(xmlSApply(xmlRoot(doc) [["DisorderList"]], xmlSize))

## EXTRACTING PARTIAL DATA FROM XML FILE
# Create the table with 3 columns (number of extracted elements) and 6403 rows (the number of 'Disorder' instances)
t<-3 # number of extracted data elements
n<-xmlSApply(top , xmlSize) # number of diseases
Table1<-c(rep(NA, times=n*t))
dim(Table1)<-c(n,t)

# Extract all nodes 'Disorder' and recursively extract values from subnodes of interest
i<-1
nodes<-getNodeSet(top, "//Disorder")
for (i in 1:n) {
inode<-nodes[i]
Table1[i,1]<-xpathSApply(inode[[1]],"OrphaNumber", xmlValue)
Table1[i,2]<-xpathSApply(inode[[1]],"ExpertLink", xmlValue)
Table1[i,3]<-xpathSApply(inode[[1]],"Name", xmlValue)
}

## REPEAT THE ABOVE WITH ANOTHER FILE FROM ORPHADATA.ORG
doc<-xmlTreeParse("en_product2.xml", useInternal = TRUE)
top<-xmlRoot(doc)
# Let's extract data to a separate file; joining datasets using 'OrphaNumber' can be done later
t<-4
n<-xmlSApply(top , xmlSize)
Table2<-c(rep(NA, times=n*t))
dim(Table2)<-c(n,t)
# Extracting disease number, class of prevalence, average age of onset, and average age of death
i<-1
nodes<-getNodeSet(top, "//Disorder")
for (i in 1:n) {
inode<-nodes[i]
Table2[i,1]<-xpathSApply(inode[[1]],"OrphaNumber", xmlValue)
Table2[i,2]<-xpathSApply(inode[[1]],"ClassOfPrevalence", xmlValue)
Table2[i,3]<-xpathSApply(inode[[1]],"AverageAgeOfOnset", xmlValue)
Table2[i,4]<-xpathSApply(inode[[1]],"AverageAgeOfDeath", xmlValue)
}

## NESTED DATA AND MISSING ELEMENTS
# Additional data are available in those two xml files
# I am still working on a code to extract repeated elements and to handle the missing ones