Seven years in the life of Hypergiants' off-nets
Published September 03, 2022
Found something wrong? Submit a pull request!
These paper reviews can be delivered weekly to your inbox, or you can subscribe to the Atom feed. As always, feel free to reach out on Twitter with feedback or suggestions!
Seven years in the life of Hypergiants’ off-nets
What is the research?
Many large tech organizationsFor example, Akamai, FAANG/MAGMA/MANGA/other abbreviations, and Alibaba. (also known as hypergiants) serve multimedia content (like video and games)There is a related talk, Internet Traffic 2009-2019 from Craig Labovitz that describes how the, “Internet is now largely a video and game delivery system”. to users all around the world. Serving this content with low latency poses difficult technical challenges. One solution is placing servers close to users in off-netThis approach is called off-net because the servers are “off” the main network. networks that consumers directly connect to via their Internet Service Provider (ISP)Networks that consumers directly connect to are sometimes called eyeball networks - more information on these peering arrangements is here. .
The authors argue that understanding off-nets is important, as the growth of the pattern could change internet structure, routing, and performance. At the same time, there was limited tooling to understand the prevalence of off-net services, meaning that researchers had minimal visibility into the pattern’s impact. The paper aims to address this problem by increasing internet observability, and unlocking future research into off-nets.
What are the paper’s contributions?
The paper makes three main contributions:
- Development of a new approach to characterize off-nets, providing a new dataset on their deployment.
- Validation of the approach using third-party datasets.
- Analysis of the off-net dataset, providing insight into the pattern’s usage worldwide by multiple large tech organizations.
How does the system work?
One of the paper’s main goals is reliably detecting hypergiant off-nets worldwide.
The paper’s implementation differs from previous attempts to characterize off-nets. Prior research relied on DNS resolversSee example paper on studying Youtube’s server selection. Interestingly, this paper used PlanetLab, a global research network that was truly ahead of its time! or enumerating patterns in hypergiant DNS recordsSee Open Connect Everywhere: A Glimpse at the Internet Ecosystem through the Lens of the Netflix CDN, a paper on Netflix’s Open Connect or this blog post on Facebook’s CDN. to find servers. Both approaches had their downsides - for example, the first could stress open DNS resolver infrastructure, while the latter relied on fragile DNS enumeration techniques.
To implement their solution, the authors use a new approach that relies on two datasources, Transport Layer Security (TLS) certificatesThere is great background on TLS from Julia Evan’s blog - see Dissecting a TLS certificate, and her related zine. and HTTP(s) fingerprints.
Predominantly allSee this report on TLS usage. hypergiants use TLS certificates to encrypt user traffic. Because hypergiants deploy the same services in off-nets and on-nets, similar certificates are present on servers in both network types - as a result, it is theoretically possible to identify off-nets for a hypergiant if a server on the network is reusing the same certificate on-net.
The paper discusses several complications with putting this idea into practice:
- For legacy reasons, subsidiaries of a hypergiant might not be using the same certificates - the paper cites LinkedIn and Github (acquisitions of Microsoft) using different certificates than the parent company.
- Hypergiants issue certificates for customers to deploy to their own servers, so the presence of a certificate doesn’t necessarily mean the server is owned by the hypergiantCloudflare provides this option for customers, meaning that one would have a Cloudflare signed certificate on your origin server. .
To limit the impact of these complications, the approach implemented by the paper also verifies HTTP(S) fingerprints, checking that a candidate off-net server returns stable/known headers for a given hypergiant.
The implementation combines both TLS certificates and HTTP(s) Fingerprints in a two pass process.
The first pass scrapes valid TLS certificates from a Hypergiant’s on-nets, searching for the name of the organization in the Subject Info of returned certificatesThere are open source tools for inspecting TLS certificates, including certigo, that you can use! . Using this data, the system issues queries to IP addresses outside of a hypergiant to identify candidate off-net servers (looking for a TLS certificate match).
Because of the complications (discussed above) of solely relying on certificates to determine off-nets, the paper then makes a second pass over the candidate off-net servers using HTTP(S) Fingerprints. If a server returns headers matching the expected hypergiantFrom the first pass, a candidate off-net server has a matching certificate with a hypergiant on-net server. , the implementation mark that the off-net server’s IP address belongs to hypergiant off-net.
From there, the IPs are mapped to an Autonomous System (AS)See Cloudflare’s docs on Autonomous Systems. . As ASes can represent large networks with defined ownership, this mappingThe paper links to a few resources it uses, one of which is a neat tool called RouteViews. is helpful to develop an understanding of when a hypergiant has a server in a network owned by an internet service provider (ISP) or another organization that provides internet service to consumers.
The paper validates its approach to finding off-nets and assigning them to hypergiants in three ways: comparing to open source datasets, consulting with the hypergiants themselves, and evaluating relative to previously published results.
First, the authors compare gathered certificates and their assignments to hypergiants against open source databases, namely Rapid7 and Censys. The three datasets roughly match, although the existing Rapid7 and Censys datasets have fewer data pointsThe authors note their “scan found around 20% more addresses, which we attribute to two causes. First, both Rapid7 and Censys have to respond to complaints and remove IP addresses from their scans. As both scans have run for years, more address space is excluded over time. A second reason for this difference is that our scan took almost four days to execute, which may trigger less rate limiting than the other, faster scans.” .
The authors also consulted with four hypergiants on the veracity of the paper’s dataset:
All four agreed that the estimation of the off-net footprint is very good. One HG operator indicated that 6% of ASes we identified as hosting the HG’s off-nets were not on the HG’s list, and 11% from the HG’s list were not uncovered by our technique (while also indicating that the HG’s list may not be 100% correct)
Lastly, the paper compares against previous research on Facebook and Netflix off-nets, finding that the paper’s dataset roughly matches. A fun anecdote from the Facebook-related identification comes from Anurag Bhatia’s blog:
Back in 2019, I was in San Francisco, California for NANOG 75. While roaming around in the lobby, someone read the NANOG card hanging around my neck and greeted me. His 2nd line after greeting was “Oh I know that name, you are the guy who mapped our caching nodes” and we both laughed. I must say this specific category of the post has brought some attention around.
How is the dataset evaluated?
After verifying the dataset, the paper performs three main analyses: hypergiant off-net footprint growth, calculating hypergiants’ reach to the world’s internet users, and hypergiant deployment overlap.
To measure hypergiant off-net footprint growth, the paper considers counts of ASes where a given hypergiant is present, as well as the size of the network measured by “customer cones”Sizing is based on CAIDA AS relationship dataset, which indicates, “Small ASes have customer cones ≤ 10 ASes, Medium ASes have customer cones ≤ 100 ASes, Large ASes ≤ 1000 ASes, and XLarge ASes > 1000 ASes.” . Interestingly, the authors call out that hypergiants are present in many more Large/XLarge ASes than is typical for the internet (5% for hypergiants, vs 0.5% for the rest of the internet) - this conclusion makes sense in light of off-net deployments aiming to reach as many customers as possible.
The paper also represents off-net deployments by region. It is possible to see growth in off-net locations from hypergiants like Facebook, who the authors note started heavily investing in an internal CDN mid-2017The paper cites this article when discussing FB’s movie. Engineering Egress with Edge Fabric: Steering Oceans of Content to the World on the Facebook blog also discusses integration with many networks around the world. I hope to read papers on traffic engineering in a future paper review! .
These regional investments in off-nets impact the ability to put content closer to users, as visible from an increase to the percent of Internet users connected to “ASes hosting Facebook’s off-net servers”.
Lastly, the paper graphs the total number of ASes with off-nets, and the presence of hypergiants in them. Since 2013, the distinct number of ASes with a hypergiant has grown significantly. As of mid-2021, if an AS hosts any off-net for a top four hypergiant, it is very likely to also host an off-net for another of the top four. At the same time, the number of ASes that host more than one hypergiant has grown significantly.
I found this paper interesting as it creates a novel datasource based on publically available networking information (now made usable due to the increase in TLS-based encryption). Beyond developing a new methodology, the authors verify it using baselines from previous studies, advancing the internet observability state of the art. High performance networking and traffic engineering at scale makes for a fascinating set of technical topics, and I’m looking forward to diving into related research in future paper reviews.