When a business owner approaches me to set up a new ETL pipeline for their emerging dataset, I always ask one question: “What’s the unique identifier of this dataset?” This question gets paraphrased according to the business owner’s knowledge and understanding on data and database, and often times I help them identify a candidate (either a single-column primary key, or a multi-column composite key).

What I don’t do is I never take no for an answer.

The statement that “This dataset has no unique identifier” invites trouble1 — ETL pipelines exist to bring order into the chaotic data from various sources. If they are adamant about the data being without a unique identifier, follow up with a few more questions:

  1. How are you gonna tell if the data have duplicate in the future? How do you ensure its quality down the road?
  2. Why do you need this dataset to begin with, given that you can’t ensure its quality?

Chances are you will find that they have absolutely no clue themselves. Kindly ask them to come back later.

  1. Especially with our archaic infrastructure that gets virtually zero IT support. We have stiches of solutions that depend on Perl, Python 2, Python 3, R, VBA, JavaScript, R, C#, .NET, Ruby on Rail, Oracle SQL, SQL Server, and MS Access, to name a few.