Continuously Generating Models of Cloud Systems with Applications in Privacy
ABSTRACT
Soundly assessing the security and privacy of a complex system requires understanding the flow of data between its components. In an ideal world, every system would be accompanied by an expressive and correct model that enables static reasoning and verification of security and privacy properties at multiple levels of abstraction. In reality, complex systems lack up to date models (or lack models entirely) at even one level of abstraction. We describe an approach to automatically and continuously generate models of cloud systems from ground truth. We present key observations of practical system design and implementation that imply that code analysis alone is insufficient to understand the behavior of a system. We make the case that the behavior of a system cannot be fully characterized without considering a concrete configuration and provide an in-depth exploration of the challenges of per-component configuration analysis and system-level configuration analysis. Component-level configuration can be used to control the connections between components at run-time and to enable or disable features in different deployment contexts (e.g., geographic regions). Additionally, configuration parameters can create implicit data flows that cannot be resolved through code analysis alone. For example, consider the ubiquitous Log4j library, which supports both capturing additional values from the environment in log messages and connecting log destinations that are never mentioned in source code. As an additional complication, many common libraries are able to consume and synthesize multiple configurations including both local disks and network sources. System-level configuration is any configuration affecting the semantics of a system but that is not associated with a single component. Examples include network configuration, operating system configuration, and run-time orchestration facilities. Infrastructure as Code (IaC) attempts to systematize the provisioning and setup of infrastructure to make systems reproducible and consistent. In principle, analyzing IaC should reveal all of the connections between components. However, in practice, analyzing IaC alone can be incomplete. For example, IaC may specify the precise hash of a container, but it typically will not precisely represent all of the processes running in that container; a log monitor running in such a container could add an implicit connection to a log aggregator service. We present a parameterized analysis pipeline that is able to continuously re-generate models of modern cloud system behavior by analyzing a combination of code, configuration, and infrastructure. As a worked example, we instantiate this framework with a precise reachability analysis to track the flow of private data through systems to give service owners a complete picture of how they handle potentially private data. The data mappings produced by this analysis can enable both assessments of privacy regulatory compliance and proactive mechanized privacy policy enforcement. While we will discuss our results in terms of privacy analysis, we will draw connections to safety and security analysis, as the framework is general.
BIO
Tristan Ravitch is a Senior Applied Scientist in AWS Privacy Engineering. He has been building program analysis and verification tools for 15 years with a focus on improving confidence in legacy systems. He has worked on binary analysis, static dataflow analysis, bounded model checking, and architectural recovery applied to systems ranging from microcontrollers to distributed cloud systems. While his passion is proving systems correct, Tristan has recently been working on automatically generating high-fidelity models to improve the connection between code and system models.