Masked Data vs Generated Test Data


Published: 27 November 2017
Author: Jared Holgate
Category: Engineering
Tags: DevOps,Testing,Data

Summary

What are the pros and cons of using Masked Production Data vs Generated Test Data. This article goes into a bit of detail on both.

Definitions

Masked / Scrubbed Data

Masked (sometimes called scrubbed) data is a copy of your production database that is processed to anonymise any sensitive data, such as user data, personally identifying data, financial data, etc.

Generated Data

Generated data takes an empty database and fills it with completely fictional data.

Pros and Cons of Masked Data

Pros

Cons

Pros and Cons of Generated Data

Pros

Cons

Which to use?

If you really want to ensure that your data is completely free from confidential data, then generation is the best option.

I have found that the process of masking can often be complicated, time consuming and flaky. When a very long running masking process fails over night, it can be extremely frustrating when you come in the next day and find you don't have your data available.

Most people would automatically assume that masking is the simplest solution and will give them the best representation of their production data, however depending on the complexity of your database, effective masking can be extremely difficult to achieve. In those cases, despite a similar level of analysis required for Generated data, it is often a more viable solution to the problem. Generated data can also be easier to maintain going forward if you are using the right tool.

In both cases, I strongly recommend you do not attempt to use hand crafted scripts. These quickly become unwieldy and difficult to maintain. There are plenty of tools out there covering a whole range of functionality and pricing.

Caveat: These are my own personal views based on my experience of Masking and Generation of data and shouldn't be taken as gospel!