Home News Doug Fuller, VP of Software Engineering at Cornelis Networks – Interview Series

Doug Fuller, VP of Software Engineering at Cornelis Networks – Interview Series

0
Doug Fuller, VP of Software Engineering at Cornelis Networks – Interview Series

As Vice President of Software Engineering, Doug is answerable for all facets of the Cornelis Networks’ software stack, including the Omni-Path Architecture drivers, messaging software, and embedded device control systems. Before joining Cornelis Networks, Doug led software engineering teams at Red Hat in cloud storage and data services. Doug’s profession in HPC and cloud computing began at Ames National Laboratory’s Scalable Computing Laboratory. Following several roles in university research computing, Doug joined the US Department of Energy’s Oak Ridge National Laboratory in 2009, where he developed and integrated latest technologies on the world-class Oak Ridge Leadership Computing Facility.

Cornelis Networks is a technology leader delivering purpose-built high-performance fabrics for High Performance Computing (HPC), High Performance Data Analytics (HPDA), and Artificial Intelligence (AI) to leading business, scientific, academic, and government organizations.

What initially attracted you to computer science?

I just looked as if it would enjoy working with technology. I enjoyed working with the computers growing up; we had a modem at our college that allow me check out the Web and I discovered it interesting. As a freshman in college, I met a USDOE computational scientist while volunteering for the National Science Bowl. He invited me to tour his HPC lab and I used to be hooked. I have been a supercomputer geek ever since.

You worked at Red Hat from 2015 to 2019, what were among the projects you worked on and your key takeaways from this experience?

My primary project at Red Hat was Ceph distributed storage. I’d previously focused entirely on HPC and this gave me a possibility to work on technologies that were critical to cloud infrastructure. It rhymes. Most of the principles of scalability, manageability, and reliability are extremely similar regardless that they’re geared toward solving barely different problems. When it comes to technology, my most significant takeaway was that cloud and HPC have lots to learn from each other. We’re increasingly constructing different projects with the identical Lego set. It’s really helped me understand how the enabling technologies, including fabrics, can come to bear on HPC, cloud, and AI applications alike. It is also where I actually got here to know the worth of Open Source and how one can execute the Open Source, upstream-first software development philosophy that I brought with me to Cornelis Networks. Personally, Red Hat was where I actually grew and matured as a frontrunner.

You’re currently the Vice President of Software Engineering at Cornelis Networks, what are a few of your responsibilities and what does your average day seem like?

As Vice President of Software Engineering, I’m answerable for all facets of the Cornelis Networks’ software stack, including the Omni-Path Architecture drivers, messaging software, fabric management, and embedded device control systems. Cornelis Networks is an exciting place to be, especially on this moment and this market. Due to that, I’m undecided I even have an “average” day. Some days I’m working with my team to unravel the newest technology challenge. Other days I’m interacting with our hardware architects to be certain our next-generation products will deliver for our customers. I’m often in the sphere meeting with our amazing community of shoppers and collaborators ensuring we understand and anticipate their needs.

Cornelis Networks offers next generation networking for High Performance Computing and AI applications, could you share some details on the hardware that is obtainable?

Our hardware consists of a high-performance switched fabric type network fabric solution. To that end, we offer all of the essential devices to totally integrate HPC, cloud, and AI fabrics. The Omni-Path Host-Fabric Interface (HFI) is a low-profile PCIe card for endpoint devices. We also produce a 48-port 1U “top-of-rack” switch. For larger deployments, we make two fully-integrated “director-class” switches; one which packs 288 ports in 7U and an 1152-port, 20U device.

Are you able to discuss the software that manages this infrastructure and the way it’s designed to decrease latency?

First, our embedded management platform provides easy installation and configuration in addition to access to a wide selection of performance and configuration metrics produced by our switch ASICs.

Our driver software is developed as a part of the Linux kernel. In actual fact, we submit all our software patches to the Linux kernel community directly. That ensures that each one of our customers enjoy maximum compatibility across Linux distributions and straightforward integration with other software corresponding to Lustre. While not within the latency path, having an in-tree driver dramatically reduces installation complexity.

The Omni-Path fabric manager (FM) configures and routes an Omni-Path fabric. By optimizing traffic routes and recovering quickly from faults, the FM provides industry-leading performance and reliability on fabrics from tens to 1000’s of nodes.

Omni-Path Express (OPX) is our high-performance messaging software, recently released in November 2022. It was specifically designed to cut back latency in comparison with our earlier messaging software. We ran cycle-accurate simulations of our send and receive code paths to be able to minimize instruction count and cache utilization. This produced dramatic results: if you’re within the microsecond regime, every cycle counts!

We also integrated with the OpenFabrics Interfaces (OFI), an open standard produced by the OpenFabrics Alliance. OFI’s modular architecture helps minimize latency by allowing higher-level software, corresponding to MPI, to leverage fabric features without additional function calls.

The complete network can also be designed to extend scalability, could you share some details on the way it is capable of scale so well?

Scalability is on the core of Omni-Path’s design principles. At the bottom levels, we use Cray link-layer technology to correct link errors with no latency impact. This affects fabrics in any respect scales but is especially vital for large-scale fabrics, which naturally experience more link errors. Our fabric manager is targeted each on programming optimal routing tables and on doing so in a rapid manner. This ensures that routing for even the most important fabrics will be accomplished in a minimum period of time.

Scalability can also be a critical component of OPX. Minimizing cache utilization improves scalability on individual nodes with large core counts. Minimizing latency also improves scalability by improving time to completion for collective algorithms. Using our host-fabric interface resources more efficiently enables each core to speak with more distant peers. The strategic selection of libfabric allows us to leverage software features like scalable endpoints using standard interfaces.

Could you share some details on how AI is incorporated into among the workflow at Cornelis Networks?

We’re not quite able to talk externally about our internal uses of and plans for AI. That said, we do eat our own pet food, so we get to reap the benefits of the latency and scalability enhancements we have made to Omni-Path to support AI workloads. It makes us all of the more excited to share those advantages with our customers and partners. We’ve definitely observed that, like in traditional HPC, scaling out infrastructure is the one path forward, however the challenge is that network performance is definitely stifled by Ethernet and other traditional networks.

What are some changes that you simply foresee within the industry with the appearance of generative AI?

First off, using generative AI will make people more productive – no technology in history has made human beings obsolete. Every technology evolution and revolution we’ve had from the cotton gin to the automated loom to the phone, web and beyond have made certain jobs more efficient, but we haven’t worked humanity out of existence.

Through the applying of generative AI, I imagine corporations will technologically advance at a faster rate because those running the corporate may have more free time to deal with those advancements. For example, if generative AI provides more accurate forecasting, reporting, planning, etc. – corporations can deal with innovation of their field of experience

I specifically feel that AI will make each of us a multidisciplinary expert. For instance, as a scalable software expert, I understand the connections between HPC, big data, cloud, and AI applications that drive them toward solutions like Omni-Path. Equipped with a generative AI assistant, I can delve deeper into the of the applications utilized by our customers. I even have little doubt that it will help us design even more practical hardware and software for the markets and customers we serve.

I also foresee an overall improvement in software quality. AI can effectively function as “one other set of eyes” to statically analyze code and develop insights into bugs and performance problems. This shall be particularly interesting at large scales where performance issues will be particularly difficult to identify and expensive to breed.

Finally, I hope and imagine that generative AI will help our industry to coach and onboard more software professionals without previous experience in AI and HPC. Our field can seem daunting to many and it will possibly take time to learn to “think in parallel.” Fundamentally, similar to machines made it easier to fabricate things, generative AI will make it easier to think about and reason about concepts.

Is there anything that you want to to share about your work or Cornelis Networks normally?

I’d wish to encourage anyone with the interest to pursue a profession in computing, especially in HPC and AI. On this field, we’re equipped with probably the most powerful computing resources ever built and we bring them to bear against humanity’s best challenges. It’s an exciting place to be, and I’ve enjoyed it every step of the best way. Generative AI brings our field to even newer heights because the demand for increasing capability increases drastically. I am unable to wait to see where we go next.

LEAVE A REPLY

Please enter your comment!
Please enter your name here