ID assignment in Big Data Projects

Many out there have build (Big) Data Platforms, or claim to have done so.

Some have been awarded for their endeavors, some have built something even more amazing! Data products are spawning everywhere, everyday.

Regardless of the case, everyone designing and implementing such platforms has to deal with an aspect of great importance:

Now let’s take a step back; this may seem like a trivial task for those out there dealing with relational DBs or rather traditional architectures and platforms.

A simple object = new Object() or findById may suffice.

  • But what about complex platforms?
  • Data platforms receiving data from various sources?
  • Data stored in any kind of storage engines or scraped one-off from another platform?

How can one be certain that everything will be correctly matched throughout a pipeline?

The answer is one and only one:

Special care should be taken as far as id assignment is concerned!

Many practises may be applied to tackle this problem:

  • applying a hash function over crawled/scraped urls,
  • some kind of internal identification process,
  • attempting to identify/extract each source’s unique identification method (everyone has one!).

Let’s review each of the above.

Hash function over crawled urls

This is a somewhat safe approach; urls are unique throughout the web so chances are a hash function on top can prove to be succesfull.

It however does not come without any drawbacks.

It is not uncommon for urls of websites to be generated based on the title of the source. It is the piece of text containing the most important information on the generated content; and the most SEO friendly one!

So what about updates to the titles? This can lead to updates to the url as well. So even though that is a rather straight-forward choice, special care should be taken to such updates in order to avoid duplicates.

Internal Identification Process

Time for the another approach; an internal identification process. This can be implemented either deploying an API endpoint responsible for assigning an ID to each resource collected (if your architecture follows the microservice one), or a simple method/function/bash script if you follow a monolithic approach.

The above suggested method has some very important pros; most important of them being its blackbox way of working. Once it has been perfected, you no longer have to worry about duplicates in your platform or assigning the same ID to 2 different resources. Not bad at all!

Of course they exist! First and foremost, time should be spent perfecting such a mechanism. We cannot stress enough the important of ID assignment in (Big) Data Projects/Platforms, so you should definitely allow many hours (or story points) to such a project/task since it will be the backbone of pretty much everything you build!

Another drawback we should point out is the rationale behind the identification process. Basing it uniquely on the collected content can lead to duplicates as described in the previous case. Having some kind of complex process involving various factors (possibly differentiating based on the collected source) may prove more suitable.

Remote Source Identification

Let’s switch our attention to the most challenging choice available. Time to attempt to identify the collected source’s ID assignment method.

It is because it requires knowledge of the remote source’s tech stack. Although one may think of this trivial if the data collected is in an xls or csv format where identification is rather straight-forward what if a CMS is employed?

Knowledge of it should be present if one wants to successfully assign a unique ID able to avoid duplicates. For instance Drupal assigns a unique id to each piece of content () always present in meta tags and by-default in CSS classes of article tags.

However not everything is a drawback for this method! If employed correctly one should never worry for her ID assignment; or almost never. Care should be taken only when some major migration takes place on the remote source’s side, a rather infrequent case.

This concludes our analysis over various methods that can be employed as fas as ID assignment in (Big) Data Platforms is concerned.

All of the above have pros and cons as is the case with everything out there. Similarly to every choice one has to make, you should weight these pros and cons.

Our suggested approach?

Apply some kind of hybrid approach, taking advantage of pros from various methods. It is what we have deployed so far in our platform and seems to work well.

The most important note?

Know your data and your sources.

Keeping such knowledge in mind can prove crucial when assigning a unique identification method!

Data, data, data! Loves providing data-powered solutions to sectors varying from media and financial institutions to the food industry.