Thursday, 2 January 2020

The data elements of successful data migration


Introduction


This post describes the data elements if successful data migration. It follows from the introductory post Four elements of a successful data migration.

The data element can be subdivided into four components to help us categorise the issues we must address and information we must find to fully understand the data we are migrating.
  • Quality
  • Location
  • Volume
  • Security
We can represent these four components using a pyramid.


Quality


The quality of the data being migrated is critical to the success of the migration because it determines whether or not the data is accepted by the data owner.

We must be sure that the migration process does not cause the quality of the data to deteriorate. To assess this we must understand the quality of the source data. This can be done by working with the client to determine the key metrics for the data quality. it is always necessary to involve the data owner in this exercise because we should ensure the metrics align with their goals and priorities for the data.


Quality Metrics


Here are some examples of data quality metrics that might be useful.
  • Total count of records.
  • Total count of contact records.
  • Count of Contact records missing a last name value.
  • Count of Contact records missing an email address.
  • Percentage of Contacts missing a last name value.
  • Percentage of Contacts missing an email address.
  • Sum of all Opportunity amounts.
  • Sum of opportunity amount by Owner.
  • Sum of Opportunity amount by Account.


Do not try to clean or improve the data quality during the migration


This might sound odd or even lazy but there is a good reason why we should avoid cleaning the data or removing duplicates from the data during the migration. We must provide evidence through our testing that we have migrated all the data that is in the scope of the migration. If we can’t prove this it may be that we have lost some records in transit. If we implement a transformation to remove duplicate records it becomes very difficult to determine if a discrepancy in row count between source and target is due to a legitimate merge of duplicates or an error in the process. Likewise if we are cleaning fields during the migration it becomes very difficult to determine whether

One of our goals should be to make it as easy as possible to prove that the migration has completed correctly.


Location


We should clearly understand from where we will get the data for the migration. This could range from a set of CSV files to a cloud system or API. Knowing the source of our data we then must decide on where it will reside during the migration. The two main options here are either an on premise location or in the cloud.

The answer to this question will depend on,
  • The tools we have chosen to use to move the data
  • The target system
  • The security requirements of the migration

There are some guidelines we can employ to inform our decision.
  • Treat production PII data as if it were toxic waste.
  • Limit the places you store it.
  • Limit who can access it.
  • Securely dispose of it when it is no longer needed.
A staging database or filestore can be a very useful tool in the migration process. These can be on premise or in the cloud.

If the client already uses databases on premise ask the client to spin up a database for you in their infrastructure. This has the benefit of keeping the data within the client's boundary of control.

Using a client owned on premise filestore for staging the migration data in flat files keeps the data within the clients boundary of control.

If the client does not have any existing databases or filestores the data could be staged in the cloud. The client will need to be comfortable with this approach and it may be a tougher sell than the on premise option. The considerations for a cloud staging database are,
  • Does the client already use a cloud platform that can be reused?
  • Does the client need to purchase access to a new platform?


Volume


The volume of data to migrate will inform our choice of process and technology.

The list below shows the appropriate choices for increasing volumes of data.


Up to Hundreds of Records

Manual migration by direct entry of data through the UI or use of the UI import wizards


Thousands of Records

UI import wizards if the data has only a few links to other objects and requires no transformation. Dataloader.io or Salesforce Dataloader if the data has many links to other objects mapped via external IDs. A manually managed process could be acceptable but an automated one would be preferred.


Tens of Thousands of Records

Salesforce Dataloader (Dataloader.io may require a fee at this volume level), Talend, Pentaho, an automatic and repeatable process is required to make this manageable.


Hundreds of Thousands or Millions of Records upwards

Talend, Pentaho, Mulesoft, using parallel threads where possible, an automatic repeatable process is a necessity.


Target and Source system capacity and limitations

It will also be necessary to validate that the target system can accommodate the expected data volume. With knowledge of the volume and chosen process technology we must then test the speed of the migration to determine the elapsed time that it will take so we can plan the migration. We need to determine whether the volume of records can be loaded within our migration window. If we predict that it will not fit we may then change our process from a single migration to a staged migration across a number of days or weeks.


Security


Security sits at the top of the pyramid because it is the overriding concern for the migration. We must ensure that the data remains secure during the migration process and the consequences if the data is not transferred securely can be severe.

To address this we need to understand the security classification of the data and be clear about who will have access to the data as it is migrated. Does the data contain personally identifiable information? Does the data contain sensitive information? What would be the impact of a loss of the data during the migration?

For example if the migration process is developed by a third party or consultancy, it may be prudent that they should not have access to the production data for the migration and so the process would be handed over to an internal staff member to execute.

We must also consider the legal obligations placed on us by the relevant legislation e.g. GDPR and the determination of who is the data processor for the migration.


Summary


There are four primary considerations relating to the data of a migration. They can be represented by a pyramid where the overriding consideration is the security of the data. The quality of the data will determine the usefulness of the data once migrated. The location and volume of data will determine what tools we may use for the migration process.

























Building a Virtual Consultancy

The COVID-19 pandemic has forced many organisations to embrace remote working. However for many of us in consulting working remotely is noth...