ABCDEFIJKLMNOPQRSTUVWXYZAAABACADAE
1
Hey there! Have a paper to suggest? Please submit this form!
2
https://forms.gle/s8XGxvSZxWKGSJyi6
3
4
StatusWeekYearTitlePaper Link
5
Published!1/9/202232022ghOSt: Fast & Flexible User-Space Delegation of Linux SchedulingPaper Link
6
Published!1/16/202242022Shard Manager: A Generic Shard Management Framework for Geo-distributed ApplicationsPaper Link
7
Published!1/23/202252022The ties that un-bind: decoupling IP from web services and sockets for robust addressing agility at CDN-scalePaper Link
8
1/30/202262022Presented at Alexey Charapko's reading group: https://www.youtube.com/watch?v=H-7OCFnTeMY
9
Published!5/10/2022202022Monarch: Google’s Planet-Scale In-Memory Time Series DatabasePaper Link
10
Published!5/17/2022212022Druid: a real-time analytical data store
https://dl.acm.org/doi/pdf/10.1145/2588555.2595631
11
Published!6/4/2022232022Data-Parallel Actors: A Programming Model for Scalable Query Serving Systems
https://www.usenix.org/system/files/nsdi22-paper-kraft.pdf
12
Published!7/4/2022282022Sundial: Fault-tolerant Clock Synchronization for Datacenters
13
Published!7/24/22312022Metastable Failures in the Wild
14
Published!8/25/22352022Automatic Reliability Testing For Cluster Management Controllers
15
Published!9/1/22362022Seven years in the life of Hypergiants' off-nets
16
Published!10/08/22412022SDN in the stratosphere: loon's aerospace mesh networkSIGCOMM
17
Published!10/31/22452022Design and evaluation of IPFS: a storage layer for the decentralized webSIGCOMM
18
Published!12/11/22512022Jupiter evolving: transforming google's datacenter network via optical circuit switches and software-defined networkingSIGCOMM
19
Published!1/19/202332023Elastic cloud services: scaling snowflake's control planeSoCC 22
20
Published!2/26/202392023Meta's Next-generation Realtime Monitoring and Analytics Platform
21
Published!3/28/2023132023Ambry: LinkedIn’s Scalable Geo-Distributed Object Store
22
Published!4/16/2023162023Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems
23
Published!6/6/2023232023TelaMalloc: Efficient On-Chip Memory Allocation for Production Machine Learning Accelerators
24
Published!6/29/2023262023Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale
25
Published!7/23/2023302023Defcon: Preventing Overload with Graceful Feature Degradation
26
27
28
29
30
NetworkingRunning BGP in Data Centers at Scale
31
NetworkingOrion: Google’s Software-Defined Networking Control Plane
32
NetworkingTaiji: managing global user traffic for large-scale internet services at the edge
33
NetworkingB4: Experience with a Globally-Deployed Software Defined WAN
34
NetworkingMaglev: A Fast and Reliable Network Load Balancer
35
DatabasesSpanner: Google’s Globally Distributed Database
36
RPCsMethod overloading the circuit
https://dl.acm.org/doi/10.1145/3542929.3563466
37
BlobStoreAmbry: LinkedIn’s Scalable Geo-Distributed Object Store
http://dprg.cs.uiuc.edu/data/files/2016/ambry.pdf
38
DistSysUnderstanding and Detecting Software Upgrade Failures in Distributed Systems
39
DBAmazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service
https://www.usenix.org/system/files/atc22-elhemali.pdf
40
DBCockroachDB: The Resilient Geo-Distributed SQL Database
41
AIUsing deep learning to annotate the protein universe
https://www.nature.com/articles/s41587-021-01179-w.epdf?sharing_token=q6tRetZ422gIjtPMP4s5a9RgN0jAjWel9jnR3ZoTv0M6G6LioRgZ9bzThQkXRdrB3jzKxuUul1YK61iQvv0TpiY1g-t8hlEEJAPaWoOEQSPqrFygPoSzQFS2EpxMCyl-LsP8mRRne59fwzepXL22aNjligptda4Cl01WNl1U13I%3D
42
43
ByteGraph: A High Performance Distributed Graph Database in ByteDanceVLDB
44
Velox: Meta’s Unified Execution EngineVLDB
45
D3: A Dynamic Deadline-Driven Approach for Building
Autonomous Vehicles
https://dl.acm.org/doi/pdf/10.1145/3492321.3519576
46
Building An Elastic Query Engine on Disaggregated Storage
47
How to fight production incidents?: an empirical study on a large-scale cloud service
48
49
50
Papers I want to read / write about soon!
51
52
Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems
53
F1 Query: Declarative Querying at Scale
54
Building An Elastic Query Engine on Disaggregated Storage
55
Elle: Inferring Isolation Anomalies from Experimental Observations
56
Conflict-free Replicated Data TypesLink
57
Dissecting performance bottlenecks of strongly-consistent replication protocolsLink
58
Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure
59
Toward formally verifying congestion control behavior
60
Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks
61
Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP
62
Evolution of Development Priorities in Key-value Stores Serving Large-scale Applications: The RocksDB Experience
63
Lineage stash: fault tolerance off the critical path
64
Consensus in the Cloud: Paxos Systems Demystified
https://cse.buffalo.edu/tech-reports/2016-02.orig.pdf
65
The Snowflake Elastic Data Warehouse
66
Debugging the OmniTable Way
67
68
Backlog (arbitrarily ordered)
69
70
Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
71
From cloud computing to sky computing
72
In reference to RPC: it's time to add distributed memory
73
Blending containers and virtual machines: a study of firecracker and gVisor
74
Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Cloud-Scale Infrastructure
75
Challenges and Opportunities for Autonomous Vehicle Query Systems
76
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
77
Overload Control for μs-Scale RPCs with Breakwater
78
Taiji: managing global user traffic for large-scale internet services at the edge
79
A buffer-based approach to rate adaptation: evidence from a large video streaming service
80
Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering
81
Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization
82
Azure Accelerated Networking: SmartNICs in the Public Cloud
83
Unikernels as Processes
84
Slim: OS Kernel Support for a Low-Overhead Container Overlay Network
85
Keeping Master Green at Scale
86
RPCValet: NI-Driven Tail-Aware Balancing of µs-Scale RPCs
87
TCP-Fuzz: Detecting Memory and Semantic Bugs in TCP Stacks with Fuzzing
88
Site-to-site internet traffic control
89
CRDTs for truly concurrent file systems
90
91
Caerus: NIMBLE Task Scheduling for Serverless Analytics
92
Ownership: A Distributed Futures System for Fine-Grained Tasks
93
When Cloud Storage Meets RDMA
94
ICARUS: Attacking low Earth orbit satellite networks
95
FAASNET: Scalable and Fast Provisioning of Custom Serverless Container Runtimes at Alibaba Cloud Function Compute
96
P3: Distributed Deep Graph Learning at Scale
97
GoJournal: a verified, concurrent, crash-safe journaling system
98
Argus: Debugging Performance Issues in Modern Desktop Applications with Annotated Causal Tracing
99
Experiences Deploying Multi-Vantage-Point Domain Validation at Let’s Encrypt
100
ReDMArk: Bypassing RDMA Security Mechanisms