In the field of public health, there is a need to access health and healthcare-related data much faster than ever before.
A pertinent example of this is the current coronavirus outbreak (COVID-19). In early February 2020, merely days after the first cases were confirmed in the UK, the Department of Health and Social Care along with UK Research & Innovation announced a £20M investment to facilitate rapid research on the virus by world-class UK academic and industry experts1. Encouragingly, this announcement was met with great interest across the sector and my hope is this funding will enable us to ultimately control the current outbreak.
Yet, a number of projects proposed in these applications hinge on expedited access to high-resolution population health and healthcare data, which reside — rightly so — in tightly controlled environments. In the UK, these data predominantly come from providers such as NHS Digital, who host Hospital Episode Statistics (HES), a database containing details of all admissions, A&E attendances and outpatient appointments at NHS hospitals in England; or, from Clinical Practice Research Datalink (CPRD) who collect de-identified patient data from a network of GP practices across the UK.
These types of public health datasets are typically accessed by completing a paper-based application and study protocol which is then checked and peer-reviewed before data are extracted and sent to applicants, following payment.
This process is problematic for a number of reasons:
Health data are highly sensitive
The nature of these data are highly personal. As such, stringent measures have been put in place to ensure data subject’s rights are not violated and that data controllers only allow access to those with a sound rationale for processing and adequate data protection safeguards. These are reasonable controls to put in place, but, the process of applying for data can be incredibly time-consuming, often taking months and sometimes years.
One of the problems is that understanding the contents of a given dataset without prior access is very hard. In many cases, data providers develop dictionaries to explain the variables on offer, but this provides little additional information on data structure or quality.
As such, researchers are expected to develop and submit, up-front, a complete study protocol detailing each variable and time point required, research questions and formulated hypotheses, and expected outputs. While important to know, this process ignores the fact that research, by its very nature, is an organic process. It is very difficult to derive hypotheses based on data you have never seen.
Application success rates are low
Without access to data ahead of time, researchers are forced to complete an application and protocol based on limited information. The most common result: rejection; and often after a long wait.
This is a disheartening process, especially as with more insight into the data being applied for, many applications would be approved first time. The review process often involves internal critique by the data provider before expert advice is sought from an advisory committee who meet on a semi-regular basis. Comments from unsuccesful applications are fed back to the applicant who is tasked with making amendments and re-submitting for review.
Data are expensive
Finally, after successive rounds of review, if an application is accepted, the researcher is then asked to pay upfront for the dataset. Upon payment, data will be released for a defined period of time, and in most cases, at the end of that period, must be destroyed. The fees for accesss to these data are not cheap, again putting many junior researchers off applying in the first place.
A recent report by EY, estimated the value of patient data in the UK to be £10B per year2. Although, given the potential interest from industry players in acquiring these data, I think this may be an underestimate. This valuation will keep the cost of access to data high, but at what cost to research, and indirectly, to patients?
With demand for access to health data in the UK increasing, researchers and industry alike are at the mercy of data providers who dictate the rules of engagement. However, more can be done to lower the barriers to entry without compromising the integrity or safety of patient data.
For instance, in rare circumstances I have been able to negotiate access to an exploratory data extract on which researchers can explore the dataset and formulate hypotheses in a secure environment ahead of developing a full data application and protocol. This is beneficial as it allows researchers to view the data in the form they would receive it in, fully understand its structure, organically test research questions and come up with hypotheses, as well as potentially identifying additional variables of interest that maximise the utility of the data.
Secondly, developing the application in concert with experienced researchers who know the dataset increases the likelihood of a successful application. We have recently achieved this by bringing together a group of academics and support staff experienced in health-related data to spread the awareness of and facilitate applications to common data. As a result, this internal development process dramatically reduces the time to develop a successful application.
Finally, rather than supplying individual extracts for each researcher, data providers could provide institutions with a complete dataset — for an appropriate fee — with access managed through an approved advisory board. This would further reduce the time and complexity associated with researchers applying for single data extracts.
In conclusion, as the need for rapid accesss to complex and sensitive datasets grows — as demonstrated by the current COVID-19 outbreak — so to does the need to identify pragmatic means of reducing barriers to entry.