The application of artificial intelligence (AI), and more specifically, machine learning (ML), to support decision making and accelerate innovation has experienced exponential growth in recent years. Large, aggregated datasets are extremely valuable; they can help us to identify trends in a population, relationships between different variables, test hypotheses and evaluate outcomes.
One of the central challenges for AI developers is the vast amount of data required in order to train AI algorithms successfully. Many of today’s AI applications seek to solve human-centred problems, and therefore rely heavily on data about individuals – personal data – in order to function. In this article, we will look at the phases in the lifecycle of ML applications, the importance of personal data in those phases, and the approaches which can be adopted to minimise regulatory risk whilst allowing algorithms to achieve their aims.
Personal data – being data relating to an identified or identifiable individual – must only be collected and processed in accordance with data protection laws. In the UK, we are concerned principally with the Data Protection Act 2018 (DPA) and the EU General Data Protection Regulation (GDPR). Certain types of personal data, such as health or biometric data, are classified as special category data which are subject to heightened restrictions on processing, and must be handled with additional safeguards. In this article, we will look principally at how ML models can meet the legal requirement of data minimisation; which, as we will see, also assists in achieving data security and in meeting the legal requirement that personal data are retained no longer than necessary for the purposes for which they are processed. We should not, however, lose sight of the requirement for personal data to be processed lawfully. Non-exhaustively, that means the data are processed: to perform a contractual obligation; in the legitimate interests of the processor; in order to carry out a task in the public interest; or based on the consent of the individual whose data is being processed. Another important aspect of data protection law in the context of automated decision making – including decisions reached using AI systems – is that individuals have the right not to be subject to purely automated decisions. This should be borne in mind by organisations implementing AI systems.
Some of the most widely reported (and debated) AI applications at present rely on biometric and other special category data, for example, AI systems that process vast amounts of patient data in order to find specific symptoms that could help identify the presence of a disease, thus enabling research into potential causes and cures. The importance of considering the privacy impact of AI systems will only increase, as more powerful models are fuelled by increasing amounts of data. With the wider availability of hardware specialised for the processing required in ML models (such as neural processors/accelerators, found even in mobile devices), it is more feasible than ever to implement larger, more powerful models. Typically, more data are required to train these larger models and achieve these higher levels of performance.
Typical ML Pipeline
Before turning to the strategies for minimising the privacy impact of ML models, it is useful to have in mind the common phases for such models. Trained ML models process data in order to make some prediction about that data. For example, in order to verify a user’s identity, face recognition models receive a picture of a face and predict whether the received picture depicts a specified user or not. This phase, where a trained ML model is used to make predictions, is usually referred to as the inference phase. The phase before this is known as the training phase. An untrained model has access to training examples during this phase; the data comprising those examples will often be collected from one device and transferred to a server where the ML model training will take place. These training examples include some input data and a target for each data point. The target represents the result that we would ideally like our model to produce for each given data point. Then, to train the model, we generate predictions for some training examples, and adjust the model so that the next time it sees the training examples, the generated predictions are more similar to the corresponding targets. In the face recognition example, the target would be 1 (i.e. ‘True’) for images with the user’s face, and 0 (i.e. ‘False’) otherwise.
These are the two most important phases, but there are other phases where data is processed. Before deploying ML models in practice, it is important to go through a validation phase. This involves making predictions on unseen data (i.e. data that the model has not been trained on) to ensure that the model performs well when deployed. When picking a suitable model for deployment, several models may be validated. The best performing model will then be tested on a separate data set in the testing phase in order best to assess the model. In practice, this is implemented by distinguishing a large collection of data into different sets for these distinct phases, with a large majority of data being used to train the various models (e.g. 70% of the data), and smaller portions for validation and testing (e.g. 20% for validation and 10% for testing).
While it is common to separate training and inference phases, this may not always be the case in practice. One example of this is on-device training in which a model is continually refined through use, e.g. a face recognition model which is refined over time as a user’s appearance changes.
Most of ML depends on the assumption that the data used to train a model is from the same distribution as the data that will be received during inference. Therefore any pre-processing (i.e. processing done to raw data before input into the ML model) approaches to preserving privacy discussed below will need to be implemented in both the training, testing and inference phases.
Data Protection Techniques
Many applications of ML, such as face recognition, use an individual as a data point. Each example in the set of training examples is linked to a particular individual and each example might contain a set of descriptors about the individual. For example, a job recommendation service may use an ML model to match candidates with jobs; each training example will represent a particular candidate. The descriptors for each candidate may include information such as industry sector experience, years of experience, relevant education, etc. For training examples linked to an individual, the data protection principle of minimisation means that the amount of personal data collected, processed and retained should be no more than what is necessary for the purpose of the data processing. Similarly, in the testing and inference phases, minimising the personal data used in ML applications is a key component of regulatory compliance. Below, we examine some of the principal methods for minimising personal data in ML applications, from training through to inference.
Perturbation Methods – Adding Noise
A simple technique which seeks to reduce the privacy impact of data processing on individuals, is to ‘perturb’ the training examples by adding ‘noise’ to the data (e.g. by replacing existing data points with modified values). This approach seeks to mitigate the risk of retaining a fully accurate data profile for individuals in the trained ML model, which would pose a heightened risk to those individuals, for example, in the event of a data breach. In the job recommendation service example, a candidate’s previous income may be desirable to include in the training data. Instead of using (and storing) a candidate’s real income, we could generate values from a noise distribution and perturb the values by adding this noise to produce a ‘noisy estimate’ of the candidate’s real income.
This is a very simple method, and easy to implement in the inference phase, as well as in training, as all that is needed is remembering which noise distributions were chosen during training, and the features to which noise was added.
There are, however, some drawbacks to implementing perturbation in order to reduce privacy impact. First, with any method that perturbs individual features of training data, there is a risk of reducing the predictive power of ML models, as patterns in the training data become corrupted by the addition of noise. More importantly however, is that if outliers are present in the training data (e.g. if the highest earning candidate earns double the second highest earner), adding noise will do little to mitigate the risk that the outlier could be singled out from the data.
Using Random Matrices
Another perturbation method could involve generating a random matrix to multiply with the training examples (also represented as a matrix) to transform human-readable, sensitive data into a matrix of ‘random’ numbers. In this case, as all of the data has been modified by the same transformation, the predictive power of ML models should not be affected. At inference, instead of sending the personal data to a server where training data is stored, the personal data can be first multiplied by the generated matrix in order to preserve an individual’s privacy. However, there is risk that the process could be reverse engineered to find the inverse transformation and recover the original data.
A related approach is to train ML models using synthetic training data.In this privacy-preserving technique, generative models are used to generate synthetic data which closely matches the characteristics of the original training data set comprising personal data. Such generative models will of course need to be trained in the first place, which will require the use of personal data, although that training could make use of federated learning (as we examine in greater detail below), where personal data is only stored on a local device. Alternatively, a centralised trusted organisation could train these models in a secured location, with the personal data deleted once the generative models are trained (favouring compliance with the data retention principle, that personal data is not held longer than necessary). Once the generative models are trained to produce high-quality synthetic data, this synthetic data can then be used to train AI and ML applications. In order to avoid a potential pitfall to this approach, care must be taken to ensure that the generative model is not merely ‘remembering’ examples in the personal data training set, as this would not achieve the aim of reducing the privacy impact on individuals. We should also remember that at the inference phase, personal data will still need to be processed to generate accurate predictions in relation to a specific individual. The bulk of data required in ML applications, however, is to train models at the outset, so using synthetic data to train models can greatly minimise the amount of personal data being processed.
In addition to minimising the amount of personal data processed, data protection law requires that appropriate security measures are in place to protect personal data from improper disclosure. One approach to mitigating the risk of a serious data breach, is to adopt the approach of ‘federated learning’ which allows ML models to be trained, whilst retaining personal data on localised servers rather than all personal data being held on a single common server. This technique involves including the devices used to collect personal data in the training phase of ML applications. This is performed by having the different devices partly train their own local versions of a ML model, and sending these results to a server that then trains a ‘global’ version of the model. The global versions of the model are updated and sent back to the participating devices in order to continue the training process. Whilst the ‘results’ of the local training (e.g. the gradients of an objective function used to update the parameters of a neural network) are derived from personal data, the risk of identifying individuals from the results would be greatly reduced. Further security can be implemented by requiring the results sent to the server to be encrypted, such that decryption can only occur when a large enough number of participating devices have sent their results to the server. At inference, the local version currently on a device can be used to generate personalised predictions, without user data being transmitted or stored remotely on the cloud.
As alluded to in the previous sections, ML models can be implemented on the devices that use the results of the generated predictions. In the user verification example mentioned previously, the face recognition model can be implemented on the device that needs to verify the user. In this privacy-preserving technique, data is not sent to the cloud during inference.
Local models are more difficult to implement in the training phase, as fully training ML models is typically too resource intensive for current personal handheld devices. As a result, personal data may still need to be sent to a server for training models – unless federated learning or other related techniques are also used during training.
One way to reduce the amount of personal data required to achieve sufficient predictive power in this setting is by adapting pre-trained ML models. For example, when implementing user verification, instead of training a face recognition model for each user from scratch (requiring a large number of examples for each specific user), pre-trained models may be adapted for this purpose. In particular, a powerful image recognition model may be trained on millions of examples of images to detect whether there is any face in a picture. This model would not require images containing a specific user for training. As this model has captured the idea of what a face is, large parts of this model may be extracted to form part of a face recognition model in order to implement user verification. Then, the entire face recognition model can be trained by deploying the model locally on a user’s device, and asking the user to complete a configuration process using only a few examples captured of that specific user.
What the Future Holds
We have seen that the different approaches to mitigating privacy impact on individuals by minimising the personal data used in the different phases of AI applications, prioritising data security and being mindful of the term for which data are retained. The greatest mitigation from a regulatory perspective is achieved when data no longer pertain to an individual (including by allowing the ‘singling out’ of a person, even without necessarily ascertaining their full identity) – that data ceases to be ‘personal data.’ Anonymised data held in a secure system, achieves that desirable result, effectively removing the regulatory burden of data protection law by stripping data of its ability to identify an individual. The sophistication of AI algorithms, in tandem with the number of publicly accessible data points on individuals, means that AI applications are increasingly able to analyse data sets which have been anonymised (that is to say, they contain no personal data), in combination with other data sets, in order to re-associate data with individuals.
It is not without irony that AI algorithms now pose a threat to the use of anonymisation for achieving privacy compliance when dealing with personal data in AI systems. Whilst anonymisation continues to be a useful tool for reducing risk when processing large data sets (in particular), new technologies (in particular) which can result in such data being re-associated with individuals, means that anonymisation remains a tool, but not a full solution, for regulatory compliance. As AI continues to gain prominence, we can expect to see an increased focus on data security in order to guard against the improper disclosure of anonymised data, where they could be vulnerable to becoming ‘personal data’ once again, presenting a significant risk for the owner of the AI system.