Extraordinarily in this world where information is ‘global’, there is a growing need to consolidate data from different sources where the identification of the people, products, or places involved is not obvious. Typically this is in order to create the ‘Single Customer View’ with all the data on one customer in one place, but it may also be products or services with different identifiers.

Identifying products is simply a question of ‘language’. As long as you have a conversion table (‘Bread’ in English is ‘Pain’ in French) and each data source identifies the language that it is using, there is no problem.

But if Mr J Smith with email address jsmith@gmail.com buys online, and John Smith with credit card 3333 4444 5555 6666 buys in the shop, how do we detect if these are the same person?

By using a hierarchy of clues we can make a good effort. The starting point is to look for potential common factors – email address, credit card number (store this in hashed form!), physical address. This will find the majority of your matches, and will probably give enough for reliable reporting and segmentation of 90% or over of your customers. Further matching could be done by identifying ‘potentials’ for any unknown sale, downloading the ‘potentials’ in a batch to allow a human to try to make a decision between them. This might push up the identification to 95% or more.

Even then, there will be some customers who cannot be identified. Suppose that J Smith buys from the shop using a new card, then wishes to return via the website. He will log onto the website and the website will not know that he has bought. However even this can be solved relatively simply by asking him for the number of his receipt, which will then make the identification absolute.

So it really is not such a big problem after all…..


