Automating Data Engineering Tasks - A Game Changer Mentality

Mihalis Papakonstantinou
3 min readNov 30, 2020

For every data engineer out there (and this most probably expands to other domains as well), there are 2 categories your tasks fall under:

  • tasks that are challenging, meaning interesting, and
  • tasks that are somewhat trivial.
Photo by Victor He on Unsplash

All of us are favouring the first set. Every professional in the field (in every field!) wants to constantly challenge {him, her}self in order to evolve on a personal and professional level.

However the second one is equally important.

To start with, every task starts in the first category and ends up in the second. Every single task each individual out there handles for the first time is a challenging one.

One needs to perform it over and over again for it to fall under the category of the trivial tasks.

Have you ever considered how much of you time is dedicated on a monthly basis to trivial tasks?

Hopefully by now you are starting to think about it.

And you are led to the same outcome: “Τoo much!”.

How many times did you have to:

  • create a new ETL workflow?
  • performing the same operations?
  • calling the same endpoints?
  • saving results to a file system directory or to a storage engine?
  • moving data from one server to another?

And there are many more questions where that came from!

And many ways to tackle each task.

One can always click on Sublime, IntelliJ, PyCharm or whatever editor you favour and click on create new project.

Nothing compares with the excitement of a blank page!

This is true only for tasks never handled before though. For all the rest you mainly copy-paste stuff from previous projects or Stack Overflow.

And although Stack Overflow will (hopefully) always be there for you to copy-paste, if you turn to previously implemented projects too often it may mean that there is some room for automation to take place!

  • Why copy-paste the same code over and over again when you can create a new API endpoint to handle your requests?
  • Why manually triggering a workflow when ETL pipelines are out there?
  • Why (s)ftp uploading stuff with Filezilla when you can write a script to do that?
  • Why having reminders to your calendar for backup ops when one or more scripts can take care of that?
  • Why manually adding cronjobs when you can automatically generate a DAG file and put it in your Apache Airflow instance?

And although the above depicted list can most probably go on forever, there is a key take away message here:

This can be fully automated!

  • A new project is always fun to build and can help packing the same ops under one roof.
  • Cronjobs and crontabs are there for you; all you need to do is create the script and pick the desired frequency.
  • Upload files & directories scripts are easily writable and do not include the possibility of something going wrong.
  • You may have more than 10 different storage engines; taking the time to write a backup script for each new instance means you never have to worry about it again, no small thing!
  • ETL workflow frameworks can make your life way (way) easier, why not employ them?

If a project manager is reading, now is the time to stop! Because the above suggestions take time. Even though it may not be a good thing if under a strict deadline, attempting to automate your work as much as possible can save way more time in the long run!

It is something that is easier said that done!

Why?

Because it does not have anything to do with the tech specs of an individual.

It is a matter of mentality!

If one attempts to always keep in mind the reusability and automation of processes then the result can be amazing.

Both in terms of code quality and component reusability, as well as in terms of robustness and bugless components.

--

--

Mihalis Papakonstantinou

Data, data, data! Loves providing data-powered solutions to sectors varying from media and financial institutions to the food industry.