Health Intelligence: Individualized Care by Detecting Subpopulations from Patient Data

Principal Investigator Luis Perez-Breva

Co-investigators Charles Cooney , Dame Fiona Murray , Tomaso Poggio

Project Website http://isquared.mit.edu.ezproxy.canberra.edu.au/research_detail/16/

Imagine a health system that uses data that has already been collected to rank your treatment options by how they have benefitted a population similar to you, and that continues to learn from the collective experience of the real-world patients and doctors to simplify the approval of new drugs and improve our understanding of human disease.

This research project is funded in part by Center for Biomedical Innovation (CBI)

The promise of personalized medicine implied in the health system we envision assumes a healthcare paradigm built from the singularities of patients (genetic, environmental, behavioral), and requires understanding healthcare for its capacity to generate and act on patient-specific information. This system may be at our reach today thanks to the progress in computational learning techniques that achieve similar feats of customization on the web (search by keyword and similarity, ranking algorithms, visual detection and recognition, collaborative filtering, recommender systems, etc.), and the vast amounts of individual health data collected over the last several years.

In clinical studies alone: over 35 million people were profiled in the 94000 studies received by the FDA from 2000 to 2010 according to clinicaltrials.gov. To characterize a single "blockbuster" drug, several thousand patients may be profiled in different studies. However, despite gathering massive amounts of patient-data, health information today is limited to population statistics and superficial logs of incidents, with healthcare remaining a cottage data industry. We believe this data is underused. Most of the data and specific knowledge remains in silos. Computational learning has been used locally in the system of health to process biological information at discovery, enhance detection and recognition in medical imaging, or systematize diagnosis in medical informatics to name some, but remains bound to these data silos. And most health decisions are restricted to use the superficial information shared across the data boundaries.

The problem of intelligence as it relates to a health system that learns from different sources of data, enhances our understanding of health and disease, and enables individualized medicine remains open. The difficulty lies in the scarce availability of data that has prevented research on algorithms that extract knowledge from terabytes of health information and generalize from disparate data sources. The recent availability of large databases of patient-level data from electronic health records creates an unprecedented opportunity to explore the application of computational learning techniques more broadly, to stratify populations by their similarity against different measures of benefit and risk.

This information-centric approach generalizes from an array of examples that demonstrate the use of known biological information to increase the efficacy of therapies: orphan diseases — particularly genetic diseases — certain cancer therapies, and research on diseases like multiple sclerosis, are examples of information-driven health. In these examples, highly specific information criteria (e.g. biomarkers) have been used successfully to isolate responding populations, develop combination therapies, and break a disease into distinct sub-types. These examples suggest that better information extraction from patient data, efficient mechanisms to search the space of diseases and therapies, and an evidence-based approach to refine the taxonomy of diseases are critical to drive up the efficacy of therapies. We take these as the initial set of core principles behind a learning health system.

In this project we propose exploring the application of machine learning and artificial intelligence to identify cohorts of patients within a study similar to a given patient, and isolate 'similarly' responding subpopulations in this manner. This is critical for diseases that may exhibit a stratification of patient responses to a given therapy that cannot be explained by single-factor bio-markers or genetics alone, but rather depend on multiple factors — biological and environmental. The central hypothesis behind this project is that patient data (laboratory, electronic health records, or clinical study data) contains sufficient information about the underlying biology of the patient and the disease to enable sophisticated inference and assist decision-making at the clinical, regulatory and drug discovery levels. The goal of this project is to determine if this approach has merit to develop a proof of concept in healthcare of the same kind of personalization tools pervasive on the Internet. At later stages we may use the ability of learning algorithms to generalize to compare data across studies and with observational and genetics data. We expect validation to be a recurring challenge along the process. Validation is already a challenge in healthcare, and becomes even greater for the system as a whole. Unlike the Turing test, comparing outcomes of an intelligent health system against human performance or some biological ground-truth is ill-defined. Defining strategies to validate learning in healthcare will be central to the development of a cogent intelligent system, and relates to the broader problem of validation of intelligence.

A long term goal for this project is to outline a roadmap to develop a concept of intelligence in health that is compatible with parallel efforts in adaptive policy, regulatory science, and the economics of sharing clinical data, and helps outline the broader issue of how to validate learning. These activities are part of a broader collaboration with faculty associated with the Engineering Systems Division, Sloan, and the Center for Biomedical Innovation. We anticipate that an important first outcome from this collaboration will be an extended understanding of the landscape of data that is currently scattered throughout the health system.