Home News Marlos C. Machado, Adjunct Professor on the University of Alberta, Amii Fellow, CIFAR AI Chair – Interview Series

Marlos C. Machado, Adjunct Professor on the University of Alberta, Amii Fellow, CIFAR AI Chair – Interview Series

Marlos C. Machado, Adjunct Professor on the University of Alberta, Amii Fellow, CIFAR AI Chair – Interview Series

Marlos C. Machado is a Fellow in Residence on the Alberta Machine Intelligence Institute (Amii), an adjunct professor on the University of Alberta, and an Amii fellow, where he also holds a Canada CIFAR AI Chair. Marlos’s research mostly focuses on the issue of reinforcement learning. He received his B.Sc. and M.Sc. from UFMG, in Brazil, and his Ph.D. from the University of Alberta, where he popularized the concept of temporally-extended exploration through options.

He was a researcher at DeepMind from 2021 to 2023 and at Google Brain from 2019 to 2021, during which period he made major contributions to reinforcement learning, particularly the appliance of deep reinforcement learning to regulate Loon’s stratospheric balloons. Marlos’s work has been published within the leading conferences and journals in AI, including Nature, JMLR, JAIR, NeurIPS, ICML, ICLR, and AAAI. His research has also been featured in popular media akin to BBC, Bloomberg TV, The Verge, and Wired.

We sat down for an interview on the annual 2023 Upper Certain conference on AI that’s held in Edmonton, AB and hosted by Ammi (Alberta Machine Intelligence Institute).

Your primary focus has being on reinforcement learning, what draws you to such a machine learning?

What I like about reinforcement learning is this idea, it’s a really natural way, in my view, of learning, that’s you learn by interaction. It feels that it’s how we learn as humans, in a way. I don’t love to anthropomorphize AI, nevertheless it’s similar to it’s this intuitive way of you will try things out, some things feel good, some things feel bad, and also you learn to do the things that make you are feeling higher. One among the things that I’m fascinated about reinforcement learning is the proven fact that because you really interact with the world, you’re this agent that we discuss, it’s trying things on the earth and the agent can come up  with a hypothesis, and test that hypothesis.

The explanation this matters is since it allows discovery of recent behavior. For instance, probably the most famous examples is AlphaGo, the move 37 that they discuss within the documentary, which is that this move that individuals say was creativity. It was something that was never seen before, it left us all flabbergasted. It isn’t anywhere, it was just by interacting with the world, you get to find those things. You get this ability to find, like certainly one of the projects that I worked on was flying visible balloons within the stratosphere, and we saw very similar things as well.

We saw behavior emerging that left everyone impressed and like we never thought of that, nevertheless it’s good. I feel that reinforcement learning is uniquely situated to permit us to find such a behavior since you’re interacting, because in a way, certainly one of the really difficult things is counterfactuals, like what would happened if I had done that as a substitute of what I did? That is an excellent difficult problem basically, but in a number of settings in machine learning studies, there’s nothing you may do about it. In reinforcement learning you may, “What would happened if I had done that?” I would as well try next time that I’m experiencing this. I feel that this interactive aspect of it, I actually prefer it.

After all I’m not going to be hypocritical, I feel that a number of the cool applications that got here with it made it quite interesting. Like going back many years and many years ago, even once we talk concerning the early examples of huge success of reinforcement learning, this all made it to me very attractive.

What was your favorite historical application?

I feel that there are two very famous ones, one is the flying helicopter that they did at Stanford with reinforcement learning, and one other one is TD-Gammon, which is that this backgammon player that became a world champion. This was back within the ’90s, and so that is during my PhD, I made sure that I did an internship at IBM with Gerald Tesauro and Gerald Tesauro was the guy leading the TD-Gammon project, so it was like this is basically cool. It’s funny because once I began doing reinforcement learning, it isn’t that I used to be fully aware of what it was. Once I was applying to grad school, I remember I went to a number of web sites of professors because I desired to do machine learning, like very generally, and I used to be reading the outline of the research of everyone, and I used to be like, “Oh, that is interesting.” Once I look back, without knowing the sphere, I selected all of the famous professors in our reinforcement learning but not because they were famous, but because the outline of their research was appealing to me. I used to be like, “Oh, this website is basically nice, I would like to work with this guy and this guy and this woman,” so in a way it was-

Such as you found them organically.

Exactly, so once I look back I used to be saying like, “Oh, these are the those that I applied to work with a protracted time ago,” or these are the papers that before I actually knew what I used to be doing, I used to be reading the outline in another person’s paper, I used to be like, “Oh, that is something that I should read,” it consistently got back to reinforcement learning.

While at Google Brain, you worked on autonomous navigation of stratospheric balloons. Why was this a very good use case for providing web access to difficult to achieve areas?

That I’m not an authority on, that is the pitch that Loon, which was the subsidiary from Alphabet was working on. When going through the best way we offer web to a number of people on the earth, it’s that you simply construct an antenna, like say construct an antenna in Edmonton, and this antenna, it lets you serve web to a region of as an example five, six kilometers of radius. In the event you put an antenna downtown of Latest York, you’re serving tens of millions of individuals, but now imagine that you simply’re attempting to serve web to a tribe within the Amazon rainforest. Possibly you’ve 50 people within the tribe, the economic cost of putting an antenna there, it makes it really hard, not to say even accessing that region.

Economically speaking, it doesn’t make sense to make an enormous infrastructure investment in a difficult to achieve region which is so sparsely populated. The thought of balloons was similar to, “But what if we could construct an antenna that was really tall? What if we could construct an antenna that’s 20 kilometers tall?” After all we do not know tips on how to construct that antenna, but we could put a balloon there, after which the balloon would have the option to serve a region that may be a radius of 10 times larger, or if you happen to discuss radius, then it’s 100 times larger area of web. In the event you put it there, as an example in the midst of the forest or in the midst of the jungle, then perhaps you may serve several tribes that otherwise would require a single antenna for every certainly one of them.

Serving web access to those hard to achieve regions was certainly one of the motivations. I keep in mind that Loon’s motto was not to supply web to the following billion people, it was to supply web to the last billion people, which was extremely ambitious in a way. It isn’t the following billion, nevertheless it’s similar to the toughest billion people to achieve.

What were the navigation issues that you simply were trying to unravel?

The way in which these balloons work is that they are usually not propelled, similar to the best way people navigate hot air balloons is that you simply either go up or down and you discover the windstream that’s blowing you in a selected direction, then you definitely ride that wind, after which it’s like, “Oh, I don’t desire to go there anymore,” perhaps then you definitely go up otherwise you go down and you discover a distinct one and so forth. That is what it does as well with those balloons. It’s not a hot air balloon, it’s a hard and fast volume balloon that is flying within the stratosphere.

All it could actually do in a way from navigational perspective is to go up, to go down, or stay where it’s, after which it must find winds which are going to let it go where it desires to be. In that sense, that is how we might navigate, and there are such a lot of challenges, actually. The primary one is that, talking about formulation first, you must be in a region, serve the web, but you furthermore may wish to ensure these balloons are solar powered, that you simply retain power. There’s this multi-objective optimization problem, to not only ensure that I’m within the region that I would like to be, but that I’m also being power efficient in a way, so that is the very first thing.

This was the issue itself, but then while you take a look at the main points, you do not know what the winds appear like, you understand what the winds appear like where you’re, but you do not know what the winds appear like 500 meters above you. You’ve what we call in AI partial observability, so that you haven’t got that data. You may have forecasts, and there are papers written about this, however the forecasts often might be as much as 90 degrees flawed. It’s a extremely difficult problem within the sense of the way you cope with this partial observability, it’s an especially high dimensional problem because we’re talking about a whole bunch of various layers of wind, after which you’ve to think about the speed of the wind, the bearing of the wind, the best way we modeled it, how confident we’re on that forecast of the uncertainty.

This just makes the issue very hard to reckon with. One among the things that we struggled essentially the most in that project is that after all the pieces was done and so forth, it was similar to how can we convey how hard this problem is? Since it’s hard to wrap our minds around it, because it isn’t a thing that you simply see on the screen, it’s a whole bunch of dimensions and winds, and when was the last time that I had a measurement of that wind? In a way, you’ve to ingest all that whilst you’re fascinated by power, the time of the day, where you must be, it’s rather a lot.

What is the machine learning studying? Is it simply wind patterns and temperature?

The way in which it really works is that we had a model of the winds that was a machine learning system, nevertheless it was not reinforcement learning. You’ve historical data about all styles of different altitudes, so then we built a machine learning model on top of that. Once I say “we”, I used to be not a part of this, this was a thing that Loon did even before Google Brain got involved. That they had this wind model that was beyond just the various altitudes, so how do you interpolate between the various altitudes?

You may say, “as an example, two years ago, that is what the wind looked like, but what it looked like perhaps 10 meters above, we do not know”.  Then you definitely put a Gaussian process on top of that, in order that they had papers written on how good of a modeling that was. The way in which we did it’s you began from a reinforcement learning perspective, we had a superb simulator of dynamics of the balloon, after which we also had this wind simulator. Then what we did was that we went back in time and said, “Let’s pretend that I’m in 2010.” We’ve got data for what the wind was like in 2010 across the entire world, but very coarse, but then we are able to overlay this machine learning model, this Gaussian process on top so we get actually the measurements of the winds, after which we are able to introduce noise, we may do all styles of things.

Then eventually, because now we have the dynamics of the model and now we have the winds and we’re going back in time pretending that that is where we were, then we actually had a simulator.

It’s like a digital twin back in time.

Exactly, we designed a reward function that it was staying on track and a bit power efficient, but we designed this reward function that we had the balloon learn by interacting with this world, but it could actually only interact with the world because we do not know tips on how to model the weather and the winds, but because we were pretending that we’re up to now, after which we managed to learn tips on how to navigate. Mainly it was do I’m going up, down, or stay? Given all the pieces that’s going around me, at the tip of the day, the underside line is that I would like to serve web to that region. That is what was the issue, in a way.

What are a few of the challenges in deploying reinforcement learning in the actual world versus a game setting?

I feel that there are a few challenges. I do not even think it’s necessarily about games and real world, it’s about fundamental research and applied research. Because you may do applied research in games, as an example that you simply’re attempting to deploy the following model in a game that’s going to ship to tens of millions of individuals, but I feel that certainly one of the essential challenges is the engineering. In the event you’re working, a number of times you utilize games as a research environment because they capture a number of the properties that we care about, but they capture them in a more well-defined set of constraints. Due to that, we are able to do the research, we are able to validate the training, nevertheless it’s type of a safer set. Possibly “safer” shouldn’t be the appropriate word, nevertheless it’s more of a constrained setting that we higher understand.

It’s not that the research necessarily must be very different, but I feel that the actual world, they create a number of extra challenges. It’s about deploying the systems like safety constraints, like we needed to ensure that the answer was protected. If you’re just doing games, you do not necessarily take into consideration that. How do you ensure that the balloon shouldn’t be going to do something silly, or that the reinforcement learning agent didn’t learn something that we hadn’t foreseen, and that’s going to have bad consequences? This was certainly one of the utmost concerns that we had, was safety. After all, if you happen to’re just playing games, then we’re probably not concerned about that, worst case, you lost the sport.

That is the challenge, the opposite one is the engineering stack. It’s totally different than if you happen to’re a researcher on your individual to interact with a pc game because you must validate it, it’s effective, but now you’ve an engineering stack of a complete product that you’ve to cope with. It isn’t that they are just going to allow you to go crazy and do whatever you wish, so I feel that you’ve to change into way more aware of that additional piece as well. I feel the dimensions of the team may also be vastly different, like Loon on the time, that they had dozens if not a whole bunch of individuals. We were still after all interacting with a small variety of them, but then they’ve a control room that might actually talk with aviation staff.

We were clueless about that, but then you’ve many more stakeholders in a way. I feel that a number of the difference is that, one, engineering, safety and so forth, and perhaps the opposite certainly one of course is that your assumptions don’t hold. Plenty of the assumptions that you simply make that these algorithms are based on, after they go to the actual world, they do not hold, after which you’ve to determine tips on how to cope with that. The world shouldn’t be as friendly as any application that you’ll do in games, it’s mainly if you happen to’re talking about just a really constrained game that you simply are doing on your individual.

One example that I actually love is that they gave us all the pieces, we’re like, “Okay, so now we are able to try a few of these items to unravel this problem,” after which we went to do it, after which one week later, two weeks later, we come back to the Loon engineers like, “We solved your problem.” We were really smart, they checked out us with a smirk on their face like, “You didn’t, we all know you can not solve this problem, it’s too hard,” like, “No, we did, we absolutely solved your problem, look, now we have 100% accuracy.” Like, “That is literally inconceivable, sometimes you haven’t got the winds that permit you …” “No, let us take a look at what is going on on.”

We found out what was happening. The balloon, the reinforcement learning algorithm learned to go to the middle of the region, after which it will go up, and up, after which the balloon would pop, after which the balloon would go down and it was contained in the region without end. They’re like, “That is clearly not what we wish,” but then after all this was simulation, but then we are saying, “Oh yeah, so how will we fix that?” They’re like, “Oh yeah, after all there are a few things, but certainly one of the things, we ensure the balloon cannot go up above the extent that it may burst.”

These constraints in the actual world, these features of how your solution actually interacts with other things, it is easy to overlook while you’re only a reinforcement learning researcher working on games, after which while you actually go to the actual world, you are like, “Oh wait, these items have consequences, and I actually have to concentrate on that.” I feel that that is certainly one of the essential difficulties.

I feel that the opposite one is similar to the cycle of those experiments are really long, like in a game I can just hit play. Worst case, after per week I actually have results, but then if I actually need to fly balloons within the stratosphere, now we have this expression that I wish to use my talk that is like we were A/B testing the stratosphere, because eventually after now we have the answer and we’re confident with it, so now we wish to ensure that it’s actually statistically higher. We got 13 balloons, I feel, and we flew them within the Pacific Ocean for greater than a month, because that is how long it took for us to even validate that what all the pieces we had provide you with was actually higher. The timescale is way more different as well, so you do not get that many probabilities of trying stuff out.

Unlike games, there’s not 1,000,000 iterations of the identical game running concurrently.

Yeah. We had that for training because we were leveraging simulation, regardless that, again, the simulator is way slower than any game that you simply would have, but we were in a position to cope with that engineering-wise. If you do it in the actual world, then it’s different.

What’s your research that you simply’re working on today?

Now I’m at University of Alberta, and I actually have a research group here with a lot of students. My research is way more diverse in a way, because my students afford me to do that. One thing that I’m particularly enthusiastic about is that this notion of continual learning. What happens is that just about each time that we discuss machine learning basically, we’ll do some computation be it using a simulator, be it using a dataset and processing the information, and we’ll learn a machine learning model, and we deploy that model and we hope it does okay, and that is effective. Plenty of times that is exactly what you wish, a number of times that is perfect, but sometimes it isn’t because sometimes the issues are the actual world is just too complex so that you can expect that a model, it doesn’t matter how big it’s, actually was in a position to incorporate all the pieces that you simply desired to, all of the complexities on the earth, so you’ve to adapt.

One among the projects that I’m involved with, for instance, here on the University of Alberta is a water treatment plant. Mainly it’s how will we provide you with reinforcement learning algorithms which are in a position to support other humans in the choice making process, or tips on how to do it autonomously for water treatment? We’ve got the information, we are able to see the information, and sometimes the standard of the water changes inside hours, so even if you happen to say that, “Every single day I’ll train my machine learning model from yesterday, and I’ll deploy it inside hours of your day,” that model shouldn’t be valid anymore because there’s data drift, it isn’t stationary. It’s really hard so that you can model those things because perhaps it is a forest fire that is occurring upstream, or perhaps the snow is beginning to melt, so you would need to model the entire world to have the option to do that.

After all nobody does that, we do not try this as humans, so what will we do? We adapt, we continue to learn, we’re like, “Oh, this thing that I used to be doing, it isn’t working anymore, so I would as well learn to do something else.” I feel that there are a number of publications, mainly the actual world ones that require you to be learning consistently and without end, and this shouldn’t be the usual way that we discuss machine learning. Oftentimes we discuss, “I’ll do an enormous batch of computation, and I’ll deploy a model,” and perhaps I deploy the model while I’m already doing more computation because I’ll deploy a model a few days, weeks later, but sometimes the time scale of those things don’t work out.

The query is, “How can we learn continually without end, such that we’re just recuperating and adapting?” and this is basically hard. We’ve got a few papers about this, like our current machinery shouldn’t be in a position to do that, like a number of the solutions that now we have which are the gold standard in the sphere, if you happen to just have something just continue to learn as a substitute of stop and deploy, things get bad really quickly. That is certainly one of the things that I’m really enthusiastic about, which I feel is similar to now that now we have done so many successful things, deploy fixed models, and we are going to proceed to do them, considering as a researcher, “What’s the frontier of the world?” I feel that certainly one of the frontiers that now we have is that this aspect of learning continually.

I feel that certainly one of the things that reinforcement learning is especially suited to do that, because a number of our algorithms, they’re processing data as the information is coming, and so a number of the algorithms just are in a way directly they’d be naturally fit to be learning. It does not imply that they do or that they’re good at that, but we haven’t got to query ourselves, and I feel we’re a number of interesting research questions on what can we do.

What future applications using this continual learning are you most enthusiastic about?

That is the billion-dollar query, because in a way I have been in search of those applications. I feel that in a way as a researcher, I actually have been in a position to ask the appropriate questions, it’s greater than half of the work, so I feel that in our reinforcement learning a number of times, I wish to be driven by problems. It’s similar to, “Oh look, now we have this challenge, as an example five balloons within the stratosphere, so now now we have to determine tips on how to solve this,” after which along the best way you’re making scientific advances. At once I’m working with other a APIs like Adam White, Martha White on this, which is the projects actually led by them on this water treatment plant. It’s something that I’m really enthusiastic about since it’s one which it’s really hard to even describe it with language in a way, so it’s similar to it isn’t that every one the present exciting successes that now we have with language, they’re easily applicable there.

They do require this continual learning aspect, as I used to be saying, you’ve the water changes very often, be it the turbidity, be it its temperature and so forth, and operates a distinct timescales. I feel that it’s unavoidable that we’d like to learn continually. It has an enormous social impact, it’s hard to assume something more necessary than actually providing drinking water to the population, and sometimes this matters rather a lot. Because it is easy to overlook the proven fact that sometimes in Canada, for instance, once we go to those more sparsely populated regions like within the northern part and so forth, sometimes we haven’t got even an operator to operate a water treatment plant. It isn’t that that is purported to necessarily replace operators, nevertheless it’s to really power us to the things that otherwise we couldn’t, because we just haven’t got the personnel or the strength to do this.

I feel that it has an enormous potential social impact, it’s an especially difficult research problem. We haven’t got a simulator, we haven’t got the means to obtain one, so then now we have to make use of best data, now we have to be learning online, so there’s a number of challenges there, and that is certainly one of the things that I’m enthusiastic about. One other one, and this shouldn’t be something that I have been doing much, but one other one is cooling buildings, and again, fascinated by weather, about climate change and things that we are able to have an effect on, very often it’s similar to, how will we resolve how we’re going to cool a constructing? Like this constructing that now we have a whole bunch of individuals today here, this may be very different than what was last week, and are we going to be using the exact same policy? At most now we have a thermostat, so we’re like, “Oh yeah, it’s warm, so we are able to probably be more clever about this and adapt,” again, and sometimes there are a number of people in a single room, not the opposite.

There’s a number of these opportunities about controlled systems which are high dimension, very hard to reckon with in our minds that we are able to probably do significantly better than the usual approaches that now we have straight away in the sphere.

In some places up 75% of power consumption is literally A/C units, in order that makes a number of sense.

Exactly, and I feel that a number of this in your home, they’re already in a way some products that do machine learning and that then they learn from their clients. In these buildings, you may have a way more fine-grained approach, like Florida, Brazil, it’s a number of places which have this need. Cooling data centers, that is one other one as well, there are some firms which are beginning to do that, and this feels like almost sci-fi, but there’s a capability to be consistently learning and adapting as the necessity comes. his can have a huge effect on this control problems which are high dimensional and so forth, like once we’re flying the balloons. For instance, certainly one of the things that we were in a position to show was exactly how reinforcement learning, and specifically deep reinforcement learning can learn decisions based on the sensors which are far more complex than what humans can design.

Just by definition, you take a look at how a human would design a response curve, just a few sense where it’s like, “Well, it’s probably going to be linear, quadratic,” but when you’ve a neural network, it could actually learn all of the non-linearities that make it a way more fine-grained decision, that sometimes it’s quite effective.


Please enter your comment!
Please enter your name here