Spotlight on Lablet Research #19 - Mixed Initiative and Collaborative Learning in Adversarial Environments
Spotlight on Lablet Research #19 -
Mixed Initiative and Collaborative Learning in Adversarial Environments
Lablet: Vanderbilt University
Sub-Lablet: University of California, Berkeley
One of the goals of the research is to characterize the limiting behavior of machine learning algorithms deployed in competitive settings.
Led by Principal Investigator (PI) Claire Tomlin and Co-PI Shankar Sastry, this research project focuses on a game theoretic approach to learning dynamic behavior safely through reachable sets, probabilistically safe planning around people, and safe policy gradient reinforcement learning. An understanding of the behaviors (convergence, optimality, etc.) of these algorithms in such settings is sorely lacking. The researchers are looking at disturbance (attempts to force the system into the unsafe region) and control (attempts to stay safe) as well as fundamental issues with gradient play in games, since machine learning algorithms are increasingly being implemented in competitive settings.
In many settings where multiple agents interact, the optimal choices for each agent depend heavily on the choices of the others. These coupled interactions are well-described by a general-sum differential game, in which players have differing objectives, the state evolves in continuous time, and optimal play is characterized by Nash equilibria. Often, problems admit multiple Nash equilibria. From the perspective of a single agent in such a game, this multiplicity of solutions can introduce uncertainty about how other agents will behave. This research proposes a general framework for resolving ambiguity between Nash equilibria by reasoning about the equilibrium other agents are aiming for. The researchers demonstrate this framework in simulations of a multi-player human-robot navigation problem that yields two main conclusions: First, by inferring which equilibrium humans are operating at, the robot is able to predict trajectories more accurately; and second, by discovering and aligning itself to this equilibrium the robot is able to reduce the cost for all players.
Many problems in robotics involve multiple decision-making agents. To operate efficiently in such settings, a robot must reason about the impact of its decisions on the behavior of other agents. Differential games offer an expressive theoretical framework for formulating these types of multi-agent problems. Unfortunately, most numerical solution techniques scale poorly with state dimension and are rarely used in real-time applications. For this reason, it is common to predict the future decisions of other agents and solve the resulting decoupled, i.e., single-agent, optimal control problem. This decoupling neglects the underlying interactive nature of the problem; however, efficient solution techniques do exist for broad classes of optimal control problems. The researchers take inspiration from one such technique, the Iterative Linear-Quadratic Regulator (ILQR), which solves repeated approximations with linear dynamics and quadratic costs. Similarly, the proposed algorithm solves repeated linear-quadratic games. The team experimentally benchmarks its algorithm in several examples with a variety of initial conditions and show that the resulting strategies exhibit complex interactive behavior. The results indicate that the algorithm converges reliably and runs in real- time. In a three-player, 14-state simulated intersection problem, the algorithm initially converges in <0.25s. Receding horizon invocations converge in <50 ms in a hardware collision-avoidance test.
Partially Observable Markov Decision Processes (POMDPs) with continuous state and observation spaces have powerful flexibility for representing real-world decision and control problems but are notoriously difficult to solve. While recent online sampling-based algorithms that use observation likelihood weighting have shown unprecedented effectiveness in domains with continuous observation spaces, there has been no formal theoretical justification for this technique. This research offers such a justification, proving that a simplified algorithm, Partially Observable Weighted Sparse Sampling (POWSS), will estimate Q-values accurately with high probability and can be made to perform arbitrarily near the optimal solution by increasing computational power.
When the pandemic hit, the research team began thinking about the resilience of CPS systems to attack. While there has been a great deal of SoS work on how to have cyber systems operate through an attack, there has been almost no work on what it takes to restart a shut-down societal system. However, a large part of resilience is the ability to restart. Although the researchers do not as yet have results on the main topic of resilience that they are working on, the highlight thus far has been a development of an understanding of the linking between infection models for disease spread (referred to as SEIR models (Susceptible, Exposed, Infected, Recovered models)), social distancing, partial shut downs, contact tracing with models of economic activity. The team is formulating models of optimal decision-making to open up sectors of the economy while keeping acceptable bounds on disease. The methods being developed involve a complex mixture of AI/ML applied to large data sets, which are publicly available, and model parameter estimation. The techniques are ones that will have widespread applicability to other classes of networks under cyber (or natural) attacks.
The PI and Co-PI developed a new course in systems theory at Berkeley for upper-level undergraduates and first and second-year graduate students, on a rapprochement between control theory and reinforcement learning. The course focused on a modern viewpoint on modeling, analysis, and control design, leveraging tools and successes from both systems and control theory and machine learning. This course was notable for the rich work it featured in multi-agent systems.
More information on this project can be found here.