Skip to main content


The Guardrails of Data Science: Regulations and Certifications for Health Care

January 20, 2021

In this post we explore the guidelines intended to provide guardrails for data science efforts. Guardrails often are in the form of generally accepted standards or examples of best practices as presented by data scientists and acquired through rigorous scientific experiments. Regulations are the formalization of best practices and set a standard for ensuring data scientists, like clinical providers, do no harm to patients. We now explore some of these guidelines and formal regulations intended to provide tools and guardrails for data science in health care. In our exploration we will discover that the context (e.g. performing health care operations vs conducting scientific research) under which health care data is used informs what regulations apply to a data science effort.

HIPAA: Privacy and Patient Trust

HIPAA stands for Health Insurance Portability and Accountability Act. It was signed by President Bill Clinton in 1996 and further enforced with the HITECH Act of 2009. It was created to ensure that personal identifying information (PII) and protected health information (PHI) collected by health care is not disclosed without the patient’s consent or knowledge.

To be able to search for patterns in the health-related data and draw conclusions, data-driven entities need to have access to PHI information of patients such as comorbidities, diagnosis, and historical records of medications. PIIsuch as age, address, or demographicsis collected to include this information in analysis. The full name of the patient, which helps track patient changes within different insurances and collect longer periods of data, is a key factor, particularly in chronic conditions like multiple sclerosis (MS).

This information is necessary to perform proper analysis on health-related fields; in fact, many innovations and scientific advances would not have been possible without this structured annotation. For example, the estimation of the cancer risk factor based on genetic tests has shown great progress since we started collecting genetic data to be able to find similarities and discrepancies in between different cancer subtypes and diagnoses.

To ensure compliance under the HIPAA Security Rule, access to PHI and sensitive information is restricted to authorized personnel only. Typically, the PII and PHI received data is preprocessed to “de-identify” it, before making it accessible to the personnel who require this data to perform analysis. The identification consists on matching different sources of data (e.g. MRIs, EMR, biomarkers, claims data, etc.) to the same patient, dropping every PII and associating to all these data sources a random and unique “patient_id” that allows matching the sources, if required. A patient’s various data sources are often mapped to the patient using a master patient index solution. Data lineage policies and practices track where data comes from, how it has been processed, who accesses the data, and where it has been moved to through its journey as part of a data science effort. This level of diligence is important to ensure that there is no inappropriate disclosure of PHI, security breaches, or violation of policy and procedure that would negatively impact a patient’s privacy.

The intent of PII protections is to mitigate and reduce risk, not to entirely eliminate it. De-identification and anonymization of data to enable its use in data science efforts is an example of regulatory risk mitigation. Cases such as Breyer v Germany demonstrate it can be difficult to identify what is PII and entirely eliminate risk through regulatory efforts. In this 2016 case brought before German courts the question of whether a dynamically assigned IP (internet protocol) address, which is assigned whenever a user browses the Internet, should be considered PII. The answer, as in many cases, is that it depends on a complex combination of context and details. Technology through scientific progress is evolving and safeguarding patient privacy will require ongoing diligence.

Not being HIPAA compliant is a serious topic that carries serious consequences at the State and Federal level. The penalties for noncompliance are based on the level of negligence and can range from $100 to $50,000 per violation (or per record), with a maximum penalty of $1.5 million per year for violations of an identical provision. Violations can also carry criminal charges that can result in jail time.

Global Privacy: GDPR

General Data Protection Regulation, is the European Union standard for personal data protection. This regulation affects numerous aspects of the protection of each individual's data, including:

  • the type of data and processing allowed,
  • conditions under which they must opt-in,
  • the ability to have their data removed,
  • increased transparency and accountability in data processing, especially regarding the sharing of data with third parties.

Whereas HIPAA’s focus is on protecting a person’s health data, GDPR emphasizes a person's ownership and control of their data. In particular, this includes the requirement that patients specifically opt into data collection as well as the ability to have their data easily removed from companies to which they have previously given consent. The general takeaways from GDPR is that you need to make your system have the ability to allow your customers to fully remove themselves from your system with the same level of effort as it was to onboard the person. You must also provide information on what other third party groups are using the customers data for. A good overview of GDPR requirements can be found here.

The points in GDPR regarding enhanced accountability, transparency, and consent are, while not without cost, generally things that fall into the category of good scientific practice and good science communication. The staff of a company, university, or hospital should be able to explain to a trial subject, or an app user the goals, methodology, and benefits of a study or analysis in a way that is compelling enough to convince a member of the public. The question of deleting one’s data from an organization’s records raises a particularly interesting question about model development from an auditing perspective. If someone’s (or several someones’) data is used to train a model for use in a clinical setting, and that individual/those individuals decide(s) to delete their data from the organization performing that work, then that data is no longer available for the construction and training of future models. The model itself (i.e. the digital object that estimates a probability used to apply some clinical label) will, after training, testing, and validation, be stored for future use. Its creators will, however, no longer be able to train a subsequent model, whether an upgrade or audit check, using exactly the same data. This means that model builders need to both understand for themselves and communicate to the world the uncertainties associated with their model. For example, if the first version of a model estimates the probability that some patient has a particular medical disorder at 5.3±1.4%, and an updated model trained on a slightly (or even completely) different data set estimates that probability as 4.9±1.3%, those results are statistically consistent with one another. In fact, many machine learning models have an element of stochasticity in terms of the grouping or ordering of samples as the model arrives at its final configuration. Those random changes to model parameters are small, but not zero. The concept of both experimental and modeling uncertainty is a vitally important one, and should be a part of the way we think about clinical tests, and indeed many other types of statistical analysis (physical measurements, lab tests, political polls, etc.). This too, can be filed under the heading of good scientific practice that should be adopted more broadly regardless of any external regulatory burden.

Is Cloud Computing Compliant or Safe?

Cloud Providers, Cloud services and cloud data centers have taken the existing business model of renting physical space in a data center and abstracted the physical layer to only allow digital access to the resources you pay for. The largest vendors, Amazon Web Services, Microsoft Azure and Google Cloud Platform do the heavy lifting of getting different security certifications, and are regularly audited so you do not have to, along with providing BAAs upon request. The caveat is that, because the underlying physical infrastructure, and networking abides by all the certifications they proudly boast on their websites, does not mean the work loads you run are following those standards. It is the responsibility of customers of the cloud services to take HIPAA eligible services and follow the guidelines to make those services compliant.

Policy of Least Privilege (PoLP)

A philosophy that proposes individuals should only have access to the data necessary to conduct their work sounds reasonable, right? We believe PoLP is a foundational concept and agree that this is a good approach to data governance and the data stewardship that data science teams play a role in. It is important to implement a PoLP guideline at the beginning of any effort involving patient data since retrofitting it into a project can be difficult or impossible. Generally, data scientists and engineers should be limited to data sets and individual data elements that enable their effort without limiting innovation or clinical insights. Implementing PoLP can be technically difficult and requires an organization to establish policies and procedures that are supported by all stakeholders. A strong PoLP policy protects patients while concurrently enables the development of clinical insights. An often overlooked aspect of PoLP is that it not only applies to humans in a data science effort but also to the engineering pipelines, machine learning algorithms, data storage systems, and overall workflows that any effort employs.

Software as a Medical Device (SaMD)

The International Medical Device Regulators Forum (IMDRF) regulates medical devices and products through a collaborative lens that safeguards the use of medical technology throughout international locations. Software as a Medical Device (SaMD) is a framework from the IMDRF Working Group which is chaired by the US FDA and was first established in 2013. The framework’s intent is to provide safe guidelines for software that is embedded in medical hardware, or otherwise comprises a medical service that guides or suggests medical care. The evolving nature of data science and software advances requires that SaMD likewise evolves. Currently there are IMDRF SaMD Working Group activities that address Artificial Intelligence Medical Devices (AIMDs), Medical Device Cybersecurity Guide, and Personalized Medical Devices which demonstrate the complexity of defining and regulating software which plays a role in the treatment of patients.

The SaMD risk categorization framerwork has been proposed to establish a common lexicon and approach to determining the levels of risk that software in the health care environment poses for patients and the public. There are four risk categories/levels which are demonstrated in Table 1 below.

Table 1

SaMD is evaluated using the Software as a Medical Device: Clinical Evaluation guide which describes a SaMD application’s performance with regard to analytical and technical accuracy as well as clinical validation. Clinical assessment is complemented by guidelines for software quality and engineering standards in the SaMD: Application of Quality Management System (QMS) document. The SaMD regulatory pathway is captured in Figure 1 below from the IMDRF documentation.

Figure 1

In this approach, the FDA would expect a commitment from manufacturers on transparency and real-world performance monitoring for artificial intelligence and machine learning-based software as a medical device, as well as periodic updates to the FDA on what changes were implemented as part of the approved pre-specifications and the algorithm change protocol.

The proposed regulatory framework could enable the FDA and manufacturers to evaluate and monitor a software product from its premarket development to postmarket performance. This potential framework allows for the FDA’s regulatory oversight to embrace the iterative improvement power of artificial intelligence and machine learning-based software as a medical device, while assuring patient safety.

Other Data Science Standards and Frameworks

There are numerous existing standards that directly or indirectly regulate data science efforts. This is not a comprehensive list, each effort requires a review of relevant regulations and consideration of evolving frameworks:

The Tradeoffs of Regulation vs Innovation

Innovation is inherently novel and uncertain--and therefore risky--while regulation implies control of such risks from new, untried products or services. Health care records, unlike credit cards or passwords, are especially sensitive because they can’t be canceled, changed, or reset in the event of a breach. Technological advances have led to increasingly stringent regulations because of the ability to identify patients based on their digital footprint. Blinding demographic variables and not sharing patient information across institutions are measures often taken as safeguards for privacy, but these limit data scientists’ ability to build precision models or sufficiently power studies. Putting securely engineered systems in place can be time-consuming and expensive.

Regulation is often perceived as hindering innovation, but it can also serve as an opportunity for organizations to be even more innovative while adhering to across-the-board guidelines. As with all data-driven medical solutions, regulatory agencies have been attempting to achieve the best balance allowing promising innovations into the medical marketplace where they can be field-tested while providing access to patients willing to accept the risk. Informed consent is a key part of understanding and communicating what terms a user is accepting when their data is used for business and research purposes. Having safeguards in place also ensures that progress with every therapeutic, diagnostic, and device is medically necessary with proven efficacy and safety.

Regardless of the size of your data-driven organization, it is always a good idea to proactively plan out your project: 1) explore usage cases that require handling PII/PHI, 2) categorize the risk and impact of scenarios, and 3) test the robustness and compliance for the system that you have in place (whether that be manual or automated). Smaller and typically more dynamic organizations in proof-of-concept or beta-testing often want to work more nimbly, but poorly created prototypes can make it challenging to retroactively layer on encryption and security. Larger institutions with more patients and stakeholders are typically more risk-averse, and it is important to embed security checkpoints and synchronization with development cycles as opposed to being reactive to a breach.

As discussed in our last article, the COVID-19 pandemic has led to the scientific method unfolding live, as guidelines and recommendations are communicated to the public as more samples are acquired. This public health pressure to innovate has led to unique privacy considerations when it comes to contact-tracing and free genetic testing, so make sure to understand the terms of service and level of risk mitigation whether you are a patient, provider, payer, or pharmaceutical company.

Next Article

In our upcoming blog posts we will explore our experience and thoughts about building machine learning models.

About David Hughes

HughesDavid Hughes is the Principal Machine Learning Data Engineer for Octave Bioscience. He develops cloud-based architectures and solutions for surfacing clinical intelligence from complex medical data. He leverages his interest in graph based data and population analytics to support data science efforts. David is using his experience leading clinical pathways initiatives in oncology to facilitate stakeholder engagement in the development of pathways in neurodegenerative diseases. With Octave, he is building a data driven platform for improving patient experience, mitigating cost, and advancing health care delivery for patients and families.

About Octave Bioscience

OctaveThe challenges for MS are significant, the issues are overwhelming, and the needs are mostly unmet. That is why Octave is creating a comprehensive, measurement driven Care Management Platform for MS. Our team is developing novel measurement tools that feed into structured analytical data models to improve patient management decisions, create better outcomes and lower costs. We are focused on neurodegenerative diseases starting with MS.


Back to Top