September 26, 2023

Sept. 11th 2023 – brevo.com domain name expiration

Reading time about 10 min
Domain expiration incident

Summary

The corporate domain name used to host the Brevo platform (brevo.com) expired on September 11th, 2023. As usual, during a domain expiration, the domain was disabled and redirected to an empty page unrelated to Brevo’s activity, resulting in progressive service degradation impacting our customers around the world.

The domain was renewed 6 hours after its expiration, and services were restored progressively thereafter.

Root causes

We have identified three main isolated causes, which, when linked together, led to the incident.

Registrar account misconfiguration

Earlier this year, we acquired the domain brevo.com as part of our rebranding from Sendinblue to Brevo. During the acquisition, we were legally required to keep the same registrar as the previous domain owner, which was not our usual one. We created a new account on this registrar and transferred domain ownership to it. As usual, we enabled the necessary settings including domain transfer lock, automatic renewal, and payment method.

Broken notifications

During the summer, we disabled the email address configured to receive domain name expiration notifications from this registrar without realizing its importance.

Payment method failure

The payment method used to renew this domain expired before the domain renewal.

Missing monitoring

We monitor the status of all our domain names automatically through an external service. However, brevo.com was not included in this list because the new registrar hasn’t been included to our domain auto-discovery pipeline.

Trigger

While the root causes created a situation propitious to create an outage, it was missing an event to trigger it. This event happened on September 11th, 2023 when the domain expired.

Resolution

We attempted to renew the domain as soon as we noticed its expiration. A process that usually only takes a dozen minutes. Unfortunately, we met some unexpected issues, which made things worse:

  • The payment method update triggered their anti-fraud process and flagged our account for manual review
  • Despite our attempts to resolve the issue through their 24/7 support team, they were unable to validate our account or bypass the manual process, which was handled by their compliance team, operating only from the US West Coast (PDT) business hours

As the situation was at a complete standstill, we considered the following actions in order to restore our domain as quickly as possible:

  • Transfer our domain to another registrar: Although technically possible, this operation would potentially take several days to complete.
  • We deployed our platform on a backup domain to recover the service. We initiated the change without knowing its duration because we had not configured our configuration management to operate in this manner. Fortunately, the domain came back before we finished this step.
  • Escalate the issue to the registrar executive team we successfully reached someone from the company who escalated internally.

As soon as we escalated the issue internally, support sent us optimistic messages, confirming that they had validated the account and initiated the renewal process for the domain.

A few minutes later, we effectively renewed the domain. Our service became progressively available in the following hours, depending on DNS propagation.

We lost brevo.com the domain for 5 hours and 53 minutes.

The duration of the incident for our customers is complex to calculate, as it depends on their DNS configuration.

Our monitoring probes reported an approximately 6h outage.

Outage as seen from an external probe

Impacts

During the outage, the whole platform continued to operate normally. In the background, already scheduled campaigns were processed, emails were still being delivered, and redirection links were working.

The main disruptions were in accessing our websites, our public API, and our SMTP Relay under the brevo.com domain. Applications that continued to use our API and SMTP Relay under the domain sendinblue.com were not impacted.

Lesson learned

The incident resulted from a combination of multiple factors and revealed some mandatory actions we will take in the coming days or weeks in order to improve our stability and our communication.

Status page isolation

During the incident, our incident public communication tool – the status page – was not available. We understand the critical role this component plays during these events, and will take steps to improve its isolation from our platform, at all layers:

  • Domain Name: We will host the status page on a dedicated domain name
  • DNS: We will host the DNS for the domain name zone with a different provider than the one we use on our platform.
  • Hosting: we will make sure our status page provider does not use the same hosting stack as ours

Domain name registrar uniformization

It does not make sense to keep multiple registrars unless there is no other solution for specific top-level domains. We already manually reviewed all our domain name expirations, their auto-renewal, payment method, notification methods, and monitoring. We didn’t identify any anomaly.

However, we found a lack of documentation and centralization. Our teams were managing domain subscriptions autonomously, resulting in duplicate registrar accounts, or multiple registrars’ usage. We already created a new process to order and manage domain names by a single team, and are now working to migrate all already existing domain names under this team’s ownership. This process includes better notification management and mandatory domain name monitoring from an external service.

Configure a failover domain for critical services

Critical services, such as our public API or SMTP Relay, were not accessible under the brevo.com domain. Although we informed our customers early to switch to the sendinblue.com domain, we understand the inconvenience caused by these changes.

To avoid such issues in the future, we will work on improving those service resiliency, by deploying a second path to reach them.

Timeline (UTC Time)

  • 10th September: brevo.com domain was about to expire and a notification was sent to an expired email alias
  • 11th September:
    • 6:50 am: Domain name servers updated – the start of the outage, appearing slowly thanks to the DNS propagation
    • 6:58 am: First alerts triggered to the infrastructure team from external monitoring
    • 7:05 am: Initial investigation started
    • 7:40 am: War room created due to the magnitude of the issue
    • 8:35 am: Root cause identified
    • 8:40 am: Our registrar anti-fraud process locked our account due to suspicious activity (new payment method + contact details update), preventing us from renewing the domain
    • 8:50 am: Reached out support to unlock the account
    • 9:30 am: We initiated various escalations to the registrar’s executive management.
    • 12:41 pm: Account unlocked and domain retrieved. NameServers updated back to our Authoritative DNS provider. This marks the end of the domain misconfiguration
    • 12:42 pm: Alerts progressively start to resolve
    • 12:50 pm: Google and CloudFlare DNS Resolvers cache flushed to speed up resolution for end customers

Background

Domain name management

Domain names, such as brevo.com, are strings used to identify companies, corporations, government entities, and even individuals on the internet. Companies called “Registrars” borrow these names for a fixed period of time and announce the domains on a centralized system based on the DNS protocol. Registrars act as a bridge between Registries – the organizations owning Top Level Domain (such as .com, .tech, .io…) and companies or individuals looking for a domain name (called “Registrants”).

Registrars also offer a DNS service along with domain registration, in order to host the DNS configuration of the domain names they sell. Registrants can either decide to host their DNS zone on Registrar name servers, or on custom name servers of their choice. We, at Brevo, are using custom name servers for our production domain names.

That being said, taking an alternate registrar, like we did for brevo.com, is technically completely transparent for us and our customers, as they are not implied on the DNS resolution path.

Domain name registration and configuration schema

DNS Resolution

One of the main purposes of the DNS protocol is to convert a DNS name such as brevo.com, app.brevo.com, or status.brevo.com to an IP address.

To achieve this, we store the mapping between DNS names and IP addresses in what we call “DNS Zones,” and we distribute these zones across different DNS servers in a tree structure. We refer to the DNS server configured to serve a zone as the “Authoritative” server.

To optimize the response time and reduce the load on authoritative servers, you can use another type of DNS server between authoritative servers and end devices (such as computers, servers, and smartphones). These servers, known as “DNS Resolvers,” maintain an internal cache of the records they have previously requested. If a client requests an unknown DNS record, the Resolver will make a request to the Authoritative server and store the answer in its own cache. The cache duration is variable and depends on the DNS Record configuration on the Authoritative DNS Server. It may vary from a few seconds to many weeks.

DNS Servers type schema

F.A.Q.

The incident continued even after the announcement of the service restoration. Why?

This is due to the nature of the DNS protocol, which relies on multiple caching layers. The process called DNS Propagation consists of progressively refreshing caching layers of data.

After restoring the domain, we initiated a cache flush on two of the major public DNS resolvers: Google (8.8.8.8 and 8.8.4.4) and Cloudflare (1.1.1.1). Other DNS Resolvers don’t offer this possibility, so we were not able to force the cache flush for our domain.

DNS Misconfigured schema

This is also why the incident appeared progressively, why some clients were impacted while others were not, but also why we couldn’t speed up resolution time for still-impacted clients.

Did the incident compromise my data?

No.

For a brief period of 6 hours, our registrar owned and managed the domain. The purpose was solely to redirect everything behind brevo.com to an advertising page. During this time, nobody else accessed the domain. The servers and databases remained under our control.

No one else has ever acquired or manipulated the domain.