The Swiss HIV Cohort Study (SHCS) is an ongoing research project dealing with HIV infected adults. It regroups 7 outpatient clinics and 10 laboratories involving almost all researchers implicated in patient-oriented HIV research in Switzerland. SHCS main objective is to perform and facilitate a wide range of timely research projects of high quality through their national network of collaborators and throughout interdisciplinary collaboration with any interested partner. SHCS aims to collect in a systematic way patient-related data and provide the infrastructure for it in order to facilitate research and improve the high standard of care of HIV infected patients in Switzerland\citep{shcs-intro,shcs-obj}.
\subsection{State of the Protocol}
In a first step, the genomic and non-genomic data of each patient in the study are stored in SHCS data centers anonymized but in plaintext. The susceptibility tests are performed on plaintext values and reported on pdf reports to the clinician. Hence the current configuration does not provide any privacy to SHCS patients since as shown by several research contributions \citep{lin04}, standard anonymization techniques for genomic data are not sufficient to guarantee individuals' privacy. Our goal is to deploy our improved scheme at the SHCS to provide efficient susceptibility tests and protect the privacy of the patients and MUs.
%%%%%%%%%%%%%%%%%%%%%%%%%%
% Constraints
\section{Constraints}
Medical data and specially genomic data privacy is a highly important topic for medical data storage providers. As SHCS stores the genomic and non-genomic data on its servers, there are many security mechanisms put in place to restrict the access to the data. In consequence, getting granted a direct access to the databases storing the data or accessing our own application running within SHCS infrastructure was not possible in the time frame of this project. After discussions with SHCS data center administrators, we decided that the client application would access SHCS infrastructure using web services. Web services are easily deployable and their message format is precisely defined using Web Services Description Language (WSDL) files. Furthermore, they are platform, programming language and library independent which is of a great advantage as heterogeneous devices will access their services. The detailed mitigations to the constraints presented bellow result from discussions with SHCS data center administrators.
\subsection{Storage Infrastructure}
Getting a direct access to the database at SHCS data centers was not something achievable in the short time frame of this project.\par
An available workaround is to deploy a web service which would query the databases directly. The web service is within SHCS firewall. As getting a web service deployed could be easily granted, we decided to choose this approach. SHCS setup is described in \reffig{fig:ws-request-manager}.
In our protocol, the SPU needs to decrypt partially some encrypted data and send it back to the MU. In the SHCS configuration, this was not something which could be trivially achieved as a service running on their back-ends within the data center could not establish communications outside of the firewall.\par
One solution proposed was that the web service and the server communicate using databases as incoming and outgoing message queues. SHCS setup is described in \reffig{fig:ws-decrypt}. On a new request, the web service inserts it to \lstinline{Request} database. The server periodically fetches the results from \lstinline{Request} database. As every request is characterized by a \lstinline{patientId}, the server retrieves the corresponding secret key share form \lstinline{Cryptographic keys} database. It then decrypts partially the ciphertexts and inserts the serialized results to \lstinline{Reply} database. The web service also periodically fetches the results from \lstinline{Reply} database and forwards the reply to the client.
Due to SHCS policy and system constraints, it appears not to be possible to run the server more often than every 2 minutes. In consequence, the expected delay for a request to be processed is 1 minute. This hard constraint influence the design of our protocol, and is particularly discussed in \refsec{sec:mod-all-tests}.
\caption{Decrypt Web Service behind SHCS firewall with Request and Reply database queues, Server App running on back-end system and Cryptographic Key Share database.}
The client infrastructure is also subject to strict policies. In our initial setup, the client was running a MySQL database locally in order to store cryptographic keys as well as SNP weights per tests. MySQL is not installed by default on SHCS clients and installing a new software is again subject to strict policies. A mitigation found was to use the already available Microsoft Access database instead of MySQL database. As we use only very simple database declarations, the syntax of Microsoft Access does not change from MySQL.
In order to limit the complexity of the system, modifications were put in place in order to reduce the calls to the web services at SHCS and hence reduce the processing delay of a susceptibility test.\par
We can distinguish two types of operations. First we need to be able to retrieve the genomic and non-genomic material from the databases at SPU. Second we need to decrypt the result of a test and the values of the HLAs. We define two services \verb#Request Manager# and \verb#Decrypt# services. The \verb#Request Manager# service retrieves from the databases the SNPs, CFs and HLAs given the SNP markers and CF identifiers. The \verb#Decrypt# service decrypts partially a list of encrypted tests results as well as the values of the HLAs.\par
Furthermore, we wanted to address the possibility to protect the Intellectual Property of the SNP weights during the susceptibility test computation. Computing the encrypted test result at MU instead of SPU allows us to hide the SNP weights from the SPU.\par
\begin{figure}
\centering
\input{sections/diagrams/protocol_diagram.tex}
\caption{Sequence diagram of modified protocol between MU and SPU to compute susceptibility tests}
\label{protocol-mu-spu-new}
\end{figure}
\begin{enumerate}
\item\lstinline{getMaterial( materialRequest )}: ask for the material required for a susceptibility test. \lstinline{material Request} is composed of a \lstinline{List(SNP_marker)} and of a \lstinline{List(CF_id)}.
\item\verb#return materialReply#: return the material required for a suceptibility test. \verb#materialReply# is composed of a \verb#List(SNP)#, a \verb#List(CF)# and a \verb#List(HLA)#.
\item\lstinline{decrypt( decryptRequest )}: ask for decryption of encrypted test results. \lstinline{decryptRequest} is composed of a \lstinline{List(EncryptedTestResult)} and a \lstinline{List(HLA)}.
\item\lstinline{return decryptReply}: return the partially decrypted results and HLAs. \lstinline{decryptReply} is composed of a \lstinline{List(PartialDecryptedTestResult)} and a \lstinline{List(PartialDecryptedHLA)}.
\end{enumerate}
The old protocol is referenced in \reffig{fig:protocol-mu-spu-old}. Some key modifications to the protocol are described and discussed in \refsec{sec:prot-mod-discussion}.
\subsubsection{Pack the HLAs with SNPs and CFs}\label{sec:mod-pack-hla-snp-cf}
Due to the separation of services of database queries and decryption, the HLAs are automatically packaged with the answer of the \lstinline{GetMaterial} call to \lstinline{Request Manager} service.\par
This modification increases a bit the communication overhead to \lstinline{Decrypt} Service as the HLAs will have to be sent along the encrypted test results. However this slight overhead simplifies a lot the deployment of the solution as it avoids the need of a third service which would query the database for HLAs and partially decrypt them.
%--
\subsubsection{Query all SNPs for all Potential Tests}\label{sec:mod-all-tests}
There are some particular cases where you need to run more complex tests if the first results are not conclusive (e.g. EFV PHARMACOKINETICS\_CYP2B6 or HIV Progression Test). The previous version of the protocol was running the first test and depending on the decrypted test result, it eventually ran subsequent tests. Due to the delay of the \lstinline{Decrypt} Service, this protocol is not longer satisfactory. The proposed solution is to get the required material for all potential tests at once. This implies again a communication overhead between the client and all services, but avoid stacking multiple decryption delays which would impact the end-user.
%--
\subsubsection{Tradeoff between Communication Overhead and Delay}
As described in \refsec{sec:mod-pack-hla-snp-cf} and \refsec{sec:mod-all-tests}, our modifications involve communication overhead. We present here our motivations to prefer a communication overhead for the sake of lower delay.\par
As described in \refsec{sec:constraint-backend-infra}, every test has an expected delay of 1 minute and 2 minutes in worst cases. Running nested tests (e.g. EFV PHARMACOKINETICS\_CYP2B6 or HIV Progression Test) sequentially would involve stacking multiple 2 minutes delays, which would deteriorate the end-user experience of the application.\par
For the current 19 available tests, two of them, EFV PHARMACOKINETICS\_CYP2B6 and HIV Progression Test, may require nested test calls up to 2 subsequent calls. In worst case, a test needs up to 21 SNPs (for Coronary Artery Disease, CAD) and up to 6 CFs (for CAD). Taking a conservative approach, the communication overhead resulting from the 27 encoded ciphertext is roughly $27\cdot270\byte=7290\byte$ (taking the weight of one encoded ciphertext under a strong elliptic curve to the sense of \refsec{sec:perf-analysis}). In consequence for the 3 additional tests, we have an overhead of $3\cdot7290\byte=21870\byte$, which is really modest in comparison of having at least a 5 minute delay to decrypt the results of the test by \lstinline{Decrypt} web service.
% Web Service Limitations
\subsection{Web Service Limitations}
In order to facilitate the deployment of the web service at SHCS, we decided to simplify the service as much as possible. Also, as we will need a local web service running to test the code and to have the possibility to continue running our application locally, we needed to choose a technology which is standard, easily deployable, and requires little development.\par
In opposition to common trend to use RESTful web services (allowing representational state transfer), we decide to use stateless web services as it simplifies the deployment of solution at SHCS and does not bring any benefit.\par
In a first phase, Document Type Definitions (DTD) have been designed for requests and replies for both services. This allows us to agree with SHCS IT administrators on the messages format for both web services. The full description of these specifications can be found in \refappendix{sec:spec-messages-format}. In a second phase, a WSDL (Web Service Description Language) description will have to be written (or generated) for proper compatibility between the web services running at SHCS and eventually at EPFL.
\subsection{Client Databases Migration to Microsoft Access}
Clients use a local database to store basic patient demographic information as well as cryptographic key shares but also information on test weights.
In our previous application, clients were using local MySQL databases. However as discussed in \refsec{sec:constraint-client-infra}, MySQL cannot be deployed in a reasonable time. We decide to migrate to Microsoft Access databases as it is already deployed on clients.
The implementation of the solution using Microsoft Access is described in \refsec{sec:impl-migr-access}.
\subsection{SPU Database Migration to Oracle}
The initial server application was using MySQL databases to store genomic, non-genomic and cryptographic keys. SHCS databases uses Oracle as database vendor. Implementation details are described in \refsec{sec:impl-migr-oracle}.
%%%%%%%%%%%%%%%%%%%%%%%%%%
% Implementation
\section{Implementation}
The initial code was separated over three independant projects:
\begin{enumerate}
\item\lstinline{PPPClient}: client application run by the MU,
\item\lstinline{PPPServer}: server application run by the SPU,
\item\lstinline{PPPCertifiedInstitution}: certified institution run by the CI.
\end{enumerate}
% Preparation work
\subsection{Preparation Work}
Preparatory work was needed in order to replace various module of the current solution. A first issue was that the cryptographic, database, message and other classes were duplicated over the \lstinline{PPPClient}, \lstinline{PPPServer} and \lstinline{PPPCertifiedInstitution} applications.\par
A fourth project \lstinline{PPPCommons} has been created in order to group all similar classes within one project containing all helper, network communication, database communication, cryptographic libraries and message classes.\par
Grouping all these common classes within one project required generalizing some classes, and extending these within the specific end-user projects. An example is the database communication module which was in large parts duplicated over all 3 projects with only little local modifications to accommodate specialized database queries. Modularizing this component will allow a faster transition to Oracle databases for the server application as well as for the Microsoft Access transition for the client application.\par
Grouping the cryptographic classes was not a trivial thing to achieve. The initial Paillier cryptographic classes had a weak API because it did not offer a level of abstraction high enough to ensure that the library call would be executed correctly, i.e. returning \lstinline{BigInteger} arrays instead of packaging ciphertexts, keys, and related objects within distinct classes. Parameters to Paillier scheme were also disseminated throughout the user interface and logic source codes. The cryptographic scheme was re-written in a way that it would provide a set of operations (encrypt, decrypt, partially decrypt, add and scale) data without actually handling \lstinline{BigInteger} objects.\par
Another effort was made to modularize better the execution of the tests as large portion of code was again duplicated within the same class.
\item Genomic logic for test computation and analysis: \lstinline{framework.*} package,
\item Input and output classes for PDF reader/writer and network communication: \lstinline{io.*} package,
\item User interface: \lstinline{ui.*} package,
\item Helper functions such as property file parser: \lstinline{utils.*} package.
\end{itemize}
Doing that preparation work helped the author of this work to both understand better the logic of the program and lay solid foundations to further extensions of the project while facilitating drastically the modifications and improvements described here-above. On that basis, the modifications to the protocol described in \refsec{sec:prot-modif} and discussed in \refsec{sec:prot-mod-discussion} were fairly straight-forward to put in place.
% Replacement of Paillier by ECCEG Module
\subsection{Replacement of Paillier by ECCEG Module}
In order to replace the Paillier by the ECCEG module, various adaptations were to be made. The initial Paillier module was most of the time returning lists of \lstinline{java.math.BigInteger} objects. It did not provide any interface to abstract the notion of ciphertext, partially decrypted ciphertext or cryptographic key share. Also adding or scaling ciphertext had to be made by accessing \lstinline{java.math.BigInteger} objects. Lots of the bugs and security flaws may happen with an inadequate usage of cryptographic libraries. An effort was made to write a high level interface hiding the underlying algorithms and computations while keeping the module easy to use for the user.\par
On the deployment aspect, all the tables storing encrypted data or the secret key shares were to be modified. They were originally storing \lstinline{java.math.BigInteger} objects inside \lstinline{varchar} objects in the databases. In order to reduce the storage space, we decide to store directly the bytes of the keys and ciphertexts inside \lstinline{blob} (Binary Large Objects) objects.
% Migration to Microsoft Access
\subsection{Migration to Microsoft Access}\label{sec:impl-migr-access}
Microsoft Access does not provide native BLOB objects as in MySQL. In order to store the cryptographic keys at the client, we need to serialize them to characters using base64 encoding. We will then store them in regular \lstinline{varchar} objects.\par
Up to Java 7, the usual way to communication with MS Access databases was to use the JDBC-ODBC (Java Database Connectivity - Open Database Connectivity) bridge technology. However as this will be removed in Java 8 \citep{jdbc-odbc-remove} as that bridge was platform dependent. A mitigation to that is to use a third party JDBC driver as with the UCanAccess library implementing simple transactions on and basic functionalities for MS Access databases \citep{lib-ucanaccess}.
% Migration to Oracle
\subsection{Migration to Oracle}\label{sec:impl-migr-oracle}
In order to migrate the server application to use Oracle databases, little modification is needed. The syntax of the requests do not change. We need to replace the current database driver by the Java Database Connectivity (JDBC) driver provided by Oracle.
% Development of a Web Service
\subsection{Development of a Web Services}
Developing a web services in Java opens the realm of Java Enterprise Edition (EE). We quickly decided to develop our local web services with GlassFish Server which is delivered with Java EE and provides efficient and reliable tools to quickly develop web services \citep{glassfish-doc}. We decided to choose GlassFish Server because it comes along with Java EE and is integrated in all development tools. It is also well documented and supported as it is officially sponsored by Oracle. This appeared to be a more resilient solution than other packages such as Apache Web Services projects \citep{apache-ws}.
%%%%%%%%%%%%%%%%%%%%%%%%%%
% Limitations
\section{Limitations}
We present here limitations of and issues with the current scheme and suggest improvements that could be done in a future work.
The current application leaks SNP markers and CF identifiers to the SPU. A mitigation is to encrypt the identifiers using a symmetric deterministic encryption. The symmetric key will again be distributed along with MU secret key share by the Key Manager to the MU. However this falls out of the scope of this project and will be provided in an extension of the current work.
% Timing attacks
\subsection{Timing Attacks}
Assume we perform marker obfuscation as described in \refsec{sec:depl-marker-obf}, the real markers are not leaked, but the number of SNPs requested leaks information on which test is performed. Similarly, some practician A could need to run only one test when practician B would run many more tests. An eavesdropper could infer whether the patient visits practician A or B by observing the size of the request.\par
A mitigation to this issue could be to use dummy requests or pad the number of SNPs requested by dummy SNPs. Also the padding of the SNPs should be deterministic. If a practician runs the same test again, with some different dummy SNPs, attacker could easily infer which are the relevant markers to the test. The same idea holds for tests using clinical factors.
% Message level security
\subsection{Message Level Security}
It is not clear yet what the options for securing the communication between the web services and the clients are. Typical technologies allowing secure communications to web services use Security Assertion Markup Language (SAML), Kerberos or X.509 security token formats. It is usually preferable to rely on standard and tested protocols instead of reinventing the wheel and start doing home-made cryptography. This usually reduces the attack surface of the system.\par
Another possibility would be to use a typical end-to-end protocol such as SSL for the communication with the web service. This seems to be feasable, but a detailed analysis of Android API has to be conducted to see whether it would be totally compatible.\par
Further discussions with SHCS need to be conducted in order to see which technical solutions are possible to deploy on their system.
% Key management
\subsection{Key Management}
In this work, we did not consider the key distribution and management by the Key Master. We artificially assumed that the initial communication used to transfer the key shares was reliable and secure, e.g. with a previously distributed key symmetric key. To summarize, the client needs to receive a secret key share for material decryption and a symmetric key for identifiers. The server on the other hand needs one key share for material decryption.\par
Another important aspect of key management is the revocation of the keys as well as the enforcement of access controls that would not jeopardize the efficiency of reliability tests.