Uptime Monitoring in Data Platforms
As we dive in deeper into the data era, data platforms is a term that each and every one of us will hear more and more.
From a data engineer’s point of view building one is a challenging task. Choosing the right tool for the job, deploying it and seeing it in action is a great feeling alone.
What about uptime though?
You have built the best data platform out there; capable of handling huge amounts of data; harmonising, linking together and enhancing in any ML/DL or any other possible way available.
You are proud of yourself!
However, what if:
- a node in your Elasticsearch instance stops working?
- your MongoDB deployment consumes too much memory?
- the awesome model you have designed and implemented takes up too much CPU/GPU?
- the Apache Airflow webserver and/or scheduler stops working?
Depending on your setup there are many more questions where that came from but you get the point.
One needs to ensure uptime for every component deployed to the stack
How to ensure uptime in data platforms?
To this crucial question all I can answer is based on my experience building and maintaining a data platform handling almost a billion of data records with different types.
Our infrastructure is a microservice one. This means that every single piece of component is wrapped up with a bunch of API endpoints, accessible through various ports in our servers.
Every newly deployed project, API service, storage engine in the infrastructure should come with a check-up script
What does this mean? Each and every time a new framework, tool, storage engine, project is deployed the following script is also setup as a cronjob.
In this script we are checking for the uptime of an Elasticsearch instance running at port 9200:
- we are using the command line tool nmap to check for the availability of the port,
- if found to be closed an email containing the last 50 lines of the respective service’s log is sent to a selected mail account,
- the service is stopped (should it not be already),
- the system’s cache is cleared, and
- the service is restarted.
And this script is executed every 2 minutes within our architecture for every single component deployed.
The following ensures that every component in our infrastructure will always be up and running unless something super weird happens!
Not all problems come from the infrastructure
Many problems may come from the implementation and wrap-up code. One has to ensure that such cases are also monitored!
Thankfully, Elastic has made available tools to that end. APM is out there and easily integratable!
Let’s dive into a specific example of such integration into a Spring Data project!
First one has to download and install/run an APM server instance:
- head over to https://www.elastic.co/downloads/apm and get the latest version available,
- untar the file,
- edit the apm-server.yml file and set the desired configuration parameters,
- in this example we will set the port,
- the connection to our monitoring dedicated Elasticsearch instance, and
- the connection to the respective Kibana instance for visualization purposes,
- execute the respective binary.
Perfect, now our APM server is up and running! Let’s connect our Spring Data project to start sending over data.
To do that we should:
- include the required dependencies in our project,
- create a properties file with the connection parameters,
- attach the ElasticAPM to our project.
Another step is left and we are ready to investigate possible performance issues! This last step applies only to setups involving Tomcat (like our own).
One needs to link Elastic APM’s jar file to her Tomcat installation as an environmental variable. To do this, one has to edit the setenv.sh file located in the bin directory of Tomcat’s extraction directory.
Perfect, if everything was setup correctly it is time to visualize and investigate the results! And what better place to do so other than our dedicated Kibana instance!
And this is the highest possible analysis present. You can dive in way deeper even to a method level.
Using such a framework, we have been able to identify where our code sucks (or takes a lot of time to execute actually) and focus our work. No small thing!
The result of the above?
Our data infrastructure has, over the past 5 years, experienced a downtime of under 1 hour!
And this goes proves the importance of such integrations into your data platform.
Uptime monitoring frameworks/tools are very critical and time saving!