So, we are building a new monitoring solution. There is already on the market a couple of solutions, among others: Nagios, Munin, Cacti and more recently Graphite are commonly used. You can find on James Turnbull blog the details of his last year survey: https://www.kartar.net/posts/monitoring-survey---tools.markdown/ (this year survey is still in progress, feel free to participate: https://www.kartar.net/posts/monitoring-survey.markdown/)
After years of using those classical tools, we felt that the monitoring world deserves better tools. There are some competitors on the SaaS market, no complete Open Source solution has succeeded. Today, we believe that a monitoring solution should:
- be scalable. Monitoring tools are eating resources, we keep history for a long time, we monitor tons of servers, containers and applications.
- be highly available. If your monitoring tool is down, you don't know if you're still in the business or need to send errors to your customers.
- be a combination of measurements and alarms. Knowing that MySQL server is running is no more enough.
- be collaborative. Teams are bigger, we need to share information through a tool. Not sending emails saying "I broke the web server, don't worry"
- be integrated. This tool needs to be integrated with other tools: instant messaging, DNS, Load Balancers, etc.
So we have started a new project to implement our ideas and what we did not find on the market. In a nutshell:
- we want a smart agent. Agent is a central piece: it is running on the "measured" equipment. Agent is in charge of gathering metrics and sharing equipment state to the world. Our first-class citizen will be our monitoring platform, but Load Balancers, Service Discovery tools should be able to use a state coming from Bleemeo agent, instead of getting each tool reinventing a way to get the server "status".
- we want secure and real time communications. Pushing data by HTTP takes a lot of resources. We use MQTT (a standard defined by IBM for telemetry) for real time and lightweight communications. We are able to detect in real time if we lost an agent. All communications out of a datacenter are secured with SSL. Munin and Nagios both have a 5min resolution. Graphite introduced a 10s refresh, we use this frequency. You want to see your infrastructure in real time to detect issues as soon as possible.
- we believe that in those cloud days, there is a value to propose a solution as a service. YOU run your business, WE run the monitoring solution: we know our software better than anyone else. We know how to scale it, how to be sure it's always available. We don't think that paying for monitoring a server with a month granularity is a cloud approach. We propose to pay your monitoring per hour. Because your servers, your containers have short life-cycle nowadays.
We will start a private beta of our solution in the next weeks. Feel free to register to participate in the beta phase. In the meantime, feel free to contact us if you have any questions or follow our blog, our Twitter, LinkedIn, Facebook, Instagram!