Masked Data vs Generated Test Data

Summary

What are the pros and cons of using Masked Production Data vs Generated Test Data. This article goes into a bit of detail on both.

Definitions

Masked / Scrubbed Data

Masked (sometimes called scrubbed) data is a copy of your production database that is processed to anonymise any sensitive data, such as user data, personally identifying data, financial data, etc.

Generated Data

Generated data takes an empty database and fills it with completely fictional data.

Pros and Cons of Masked Data

Pros

The volume of data exactly matches your production environment, so it is easy to replicate performance testing with the same quantity of data.
The referential integrity of the data is relatively easy to maintain.
Third party applications can be masked, even if you don’t have a full understanding of the complex schema.

Cons

A copy of the production database is required somewhere to perform the masking.
A lot of analysis may be required to determine which data to mask and with rules to make it useful for testing.
It’s possible to miss some rules and accidentally expose confidential data. It needs constant management, especially when new schema is added.
Large databases may need sub-setting to make them usable in dev / test environments.
Masking of large databases is extremely time consuming and can take several hours. It is not something that can be run on demand every time a copy is required.
A copy data management solution (e.g. Delphix, Actifio, etc) is often required to make the solution viable and enable quick mounting of databases in dev and test environments.
It is very difficult to add more data to a masked set if you want to test the limits of your application.
Masking needs to be consistent over time otherwise it makes it very difficult for testers to use. In other words, every time you mask John Smith you want it to always be Fred Bloggs.
Consistency across databases is often required, if one application feeds another. This can make the masking sets dependent and therefore very complex to maintain.
Masking can be reverse engineered in some cases or a determined person may be able to find patterns that give away information in ways you may not expect.
Complicated schema may be difficult to mask, especially if confidential data is part of a foreign key.
De-normalised data and data stored as XML or JSON is extremely difficult to mask effectively.
Masked data does not test edge cases that don’t exist in production data.
It is difficult to blow away and reset Masked data during a test run.
Schema and data changes can result in expected failure of the masking set.

Pros and Cons of Generated Data

Pros

There is no need to ever touch the production database.
Generated data is guaranteed to not contain and confidential information (unless you do something crazy!).
Generated data can easily be scaled up and down, so you can have more data than production to stress test your app or a small amount on a developers PC.
Generated data is relatively easy to make consistent over time and accross applications.
Generated data can test edge cases that don’t exist in production (yet).
Developers / /Testers can take responsibility for ongoing maintenance of generated test data as an when new columns / features are added.

Cons

A lot of analysis is required to determine what data to put in each field.
Depending on the tool, it can take time to generate the test data.
It may still be a requirement to manage test data with a copy data management solution to make it workable in a fast paced Continuous Delivery environment.
Generating data for a third party application may be very difficult unless they can supply you with a data set.

Which to use?

If you really want to ensure that your data is completely free from confidential data, then generation is the best option.

I have found that the process of masking can often be complicated, time consuming and flaky. When a very long running masking process fails over night, it can be extremely frustrating when you come in the next day and find you don’t have your data available.

Most people would automatically assume that masking is the simplest solution and will give them the best representation of their production data, however depending on the complexity of your database, effective masking can be extremely difficult to achieve. In those cases, despite a similar level of analysis required for Generated data, it is often a more viable solution to the problem. Generated data can also be easier to maintain going forward if you are using the right tool.

In both cases, I strongly recommend you do not attempt to use hand crafted scripts. These quickly become unwieldy and difficult to maintain. There are plenty of tools out there covering a whole range of functionality and pricing.

Caveat: These are my own personal views based on my experience of Masking and Generation of data and shouldn’t be taken as gospel!

Back to Home