Google: OpenAI hires Todd Underwood to head new SRE team

Nov 28, 2023 | Posted by Abdul-Rahman Oladimeji

OpenAI has hired Google’s Todd Underwood to head a new Site Reliability Engineering team focused on research and training workloads as it already has a functional team for the applied side working on inference and API products. The SRE concept originally started at Google, but has since spread across the IT industry.

Underwood said on LinkedIn, "At Google, I created the Machine Learning Site Reliability Engineering (ML SRE) organization. We founded it in 2016 (there was already a Cloud ML SRE team; we built one for internal services and then combined them). After a reorganization split those teams up, I went off to work on Capital Engineering... Recently, I really wanted to get back to more SRE work but also to move closer to the ML infrastructure, especially the training infrastructure. Hence OpenAI!"

Underwood added, "I’m now in a position to build a new team of ML training infrastructure at some interesting scale (even interesting for folks coming from Google, I dare say). This is a team that will need to be involved in the infrastructure from the ground up to the model, with opportunities to work on hardware health of accelerators, job orchestration and execution, model dynamics, and of course a special focus on metrics and measurement."

