Help build the future of open source observability software Open positions

Check out the open source projects we support Downloads

One platform for logs, metrics, traces, profiles, and OpenTelemetry: Inside Booking.com's move to Grafana Cloud

One platform for logs, metrics, traces, profiles, and OpenTelemetry: Inside Booking.com's move to Grafana Cloud

2025-04-25 9 min

Booking.com is one of the world’s leading online travel platforms. Visitors to the site can use it to make reservations for flights, cars, and all kinds of accommodations. Offering variety to external customers who want to one-stop shop is part of what has made the company so successful. But that same strategy did not work when it came to Booking.com’s observability.

In a recent ObservabilityCON On the Road talk, two Booking.com team members — Murugesan Ramaiah, a solution architect, and Ahmadali Shafiee, a site reliability engineer — discussed the company’s two-year journey from multiple observability solutions to a single, unified platform with Grafana Cloud

In doing so, the Booking.com team was able to use Grafana Cloud to observe all their metrics, logs, traces, and profiles in one place. They also set themselves up for future growth by using as-code practices to help automation and consistency, as well OpenTelemetry to centralize their telemetry with vendor-agnostic pipelines. 

“We wanted to put together a centralized telemetry pipeline that should be interoperable out of the box,” he explained. “[It] should be able to embrace an agnostic way to observe, but also able to ingest the telemetry to any of the back ends, [and] supports the centralized telemetric pipeline.” 

During their talk, the pair walked through the changes and shared the lessons they learned along the way.

Note: Booking.com’s session from ObservabilityCON on the Road is now available to watch on demand. You can check out the full session below.

 

More is less

Before migrating to Grafana Cloud, Booking.com was using multiple observability solutions across different business units. The company had four platforms and a variety of solutions for notifications, UI alerting, observability storage, and telemetry. 

“The main problem was that the majority of our observability stack has been homegrown,” Shafiee said, “and with our new developments, we needed to invest more time and effort to also maintain and develop new solutions for that.” The way it was, he added, was not sustainable in terms of work and cost.

The company had an “enormous” amount of telemetry agents, Ramaiah said, because it runs its service on-premises and in the cloud. It also runs a lot of observability backend systems for metrics, logs, and traces. The observability group decided to create a new vision and strategy that would be simplified, modernized, and future-proof.

And rather than different tools for different types of telemetry, they wondered if they could use a single system that could scale to meet all their needs. “We wanted to unify metrics, logs, traces, into a specific system that can scale for today, but also the future needs,” he added.

For the first time, the team also wanted to invest in profiling and real user monitoring. 

“Profiling can help us to solve the shift left approach in the software development lifecycle, which we never used to do,” Ramaiah said. The goal was to enable the engineering teams to bring the profiling solutions, instrument their applications, and collect the data, so they could look at their application performance in the pre-production during the software developer lifecycle and make informed decisions. 

As for real user monitoring, that was especially important given that Booking.com is an internet company. Understanding customers’ experiences and sentiments is very valuable, Ramaiah said, and the company’s legacy UI and alerting solution were “boring” and “non-intuitive.”  

Overall, the team wanted to find a platform that was vendor-agnostic, centralized, modernized, scalable, and cost-efficient. 

Moving towards OpenTelemetry

The company’s applications run on-prem and on multiple public clouds. It also has a machine learning platform, a Gen AI platform, and a data platform. 

As a technology-driven travel company, Ramaiah explained, Booking.com had a lot of telemetry agents as well, including Telegraf and VictoriaMetrics. There were also home-grown solutions, the costly AppDynamic systems, and some of the open source systems as well. 

“We run one of the biggest Graphite instances in the world,” he said. “I can say that we have about 800 million active series. That’s really huge. They’re running more than 300 bare metal machines where Graphite basically horizontally scaled.” 

Ramaiah said they were driven toward OpenTelemetry, because its strategy aligned with the company’s own strategy. “They are interoperable, able to scale for the booking needs, also able to collect, process, enrich the telemetry, and then put them inside any of the backends, so that we are not vendor locked-in.”

With OpenTelemetry, they were able to stop using AppDynamics — a great cost-savings — as well as Thanos. 

Centralizing metrics with Mimir

Booking.com has worked closely with Grafana Labs to redefine its observability strategy. Ramaiah said that it’s an ideal partnership because his company wanted to work with a vendor that natively supports OpenTelemetry and has a cloud-based product. 

Although the team had relied on Graphite as a block storage, it’s difficult to scale. They decided to replace it in order to centralize all the metrics into a single solution: Grafana Mimir. They chose Mimir because it also supported block storage and allowed them to scale horizontally in the cloud. 

“Today, we store about 85 million metrics on Mimir,” he said. “They are natively integrated with OpenTelemetry, and we were able to get rid of AppDynamics and also Thanos. All the metrics now come into Mimir.”

Screenshot of how Booking.com uses the LGTM stack and OpenTelemetry

Unifying logs and profiling

Another goal for Booking.com is to unify all the logs into one centralized system. The company says it has one of the biggest instances on Elasticsearch — about 800+ bare metal servers — and have already deprecated a number of clusters. Although Elasticsearch is still being used for its full text search capabilities, Booking.com has switched over to Grafana Cloud Logs, which is powered by Grafana Loki, for its observability logging. As with Mimir, Ramaiah said the draw was that it could scale horizontally and was built on cloud for cloud.

The company has successfully introduced Grafana Cloud Profiles, which is powered by Grafana Pyroscope. He said it has “great benefits [and] great values in terms of ensuring and building a quality software for all of Booking’s customers. We were into a RUM [Real User Monitoring] before, but we were blindfolded in some places.” 

More than 50 percent of the company’s revenues come from its mobile app, so observing the end-to-end customer journey on the web as well as on mobile is important. With that, Ramaiah explained, they are able to “connect that frontend observability with the backend, so we can provide that end-to-end distributed tracing and are also able to measure the customer satisfaction, the customer sentiments, using RUM solutions.” 

He added that they are working with a Grafana partner for mobile observability and expect to have success stories, but it is “too early to disclose anything on RUM and Synthetic.” 

As in other areas, Booking.com was being dragged down by a lot of conflicting UI and alerting solutions. Today, when it comes to UI, they follow Grafana’s “big tent” philosophy, Ramaiah said, so the company’s code-to-user interface is now Grafana UI for alerting visualizations as well.

Lessons learned 

Since 2023, Shafiee said, Booking.com has seen “great success” onboarding customers to the company’s OpenTelemetry pipelines, which rely heavily on Grafana Labs products.

He said they have been focusing on observability as code to make sure that everything is as automated as possible, and they’ve seen a “huge interest” in profiling. They are evaluating different machine learning solutions to replace their longtime, homegrown anomaly detection services, too.

They have learned many lessons along the way:

Create a strategy from the bottom up, not from the top down. “We align with the team, align with the leadership, and make sure that the company vision and the observability vision co-exist,” Ramaiah said.

Technology isn’t the biggest obstacle—it’s humans. When the team decided to replace its still-working, decade-old legacy system with OpenTelemetry, it caused a lot of chaos among the people who had been using the old system and were forced to make a change. To help them manage, Raimaih explained, “We were honest and we were humble, but we were also transparent to say, ‘Well, of course, the technology is going away, but not you.’ " Assuring them that they were valuable was key, giving them time to digest what was happening, and asking them what tools they needed for their job, then bringing them into the new vision and strategy ensured there was alignment. 

Allow for teamwork. “When it comes to transformations, it’s not about one team who can own everything,” Raimaih said. Although they wanted to build an observability platform, they wanted to allow the  engineering teams to have autonomy as well. “We want the platform not the end-to-end journey, so there is a collaboration required.” Once again, they had to be transparent and proactive in communicating, and also distribute work across the company. One way they did that was by having observability champions based on specific interest. “We formed a group specific to each vertical that we support, and then we worked as a community, we worked as a team to contribute some of the open source products, like Open Telemetry, for example, and that’s how we were able to see a massive adoptions.”

Understand that everything has a cost. “Nothing is free in this world, including the open source product that you use,” Ramaiah said. He said it was important to look at the total cost of ownership of a product and know how much is being spent and on what — and to also know if it’s possible to deprecate a legacy product while adding a new one. 

Offer opportunities for learning and education. One thing they did to support this was create a “doc-athon.” They invited the observability group and all of the observability champions to revamp the observability documentations so everything was up-to-date for customers and internal teams, which then can increase adoptions. 

Provide support. “Irrespective of how good the documentations are, your customers are still going to ask questions,” Ramaiah said. In addition to establishing a good support process, they also set up an incident management process and collected continuous feedback.

Communicate. Shafiee said they had been used to troubleshooting the issues they were seeing in Grafana, but over the past two years, they improved how they communicated with the Grafana Labs team about technical challenges. “We learned to be more persuasive,” he said. “We learned to be able to have the right communication channels to fix our problems.” 

Raimaih ended the talk with an important reminder: When something happens, if your observability doesn’t help answer the four W’s — What? Why? Where? and When? — “It’s time to change the observability platform and the strategy, not the questions.”

Grafana Cloud is the easiest way to get started with metrics, logs, traces, dashboards, and more. We have a generous forever-free tier and plans for every use case. Sign up for free now!

OSZAR »