Measurements to Improve AI/ML Training Data Sets

pdf

Adequacy of training data is critical for a wide range of artificial intelligence (AI) and machine learning (ML) methods. Current approaches are insufficient, and can result in unreliable performance, bias, and problems in transferring models to new or changing environments. Concepts from combinatorial testing can add confidence by providing a quantitative measure of the thoroughness of a data set. 

Conventional methods of dealing with this problem include randomizing data selection and using very large volumes of training data to improve chances of including relevant inputs. But how much random data will be sufficient, especially with domains where a limited volume of data is available? Another strategy is to ensure inclusion of at least one of each object type in training data, but this may not adequately cover the range and distribution of object attributes (such as the color or reflectivity), or combinations of attributes of different object types in practice. As shown in real-world examples, the inability of an ML system to respond correctly to unanticipated inputs may lead to unacceptable outcome. It is essential to consider potential interactions between inputs. Here inputs refer to the attribute values used within the training and testing data. 

Why not measure the input space coverage directly? Concepts developed within the field of combinatorial testing for software systems can be applied in measuring how thoroughly combinations of inputs have been included in training and testing of machine learning models [1]. This poster will present the definition and background of a number of measures that are useful for this task and illustrate their utility with real-world examples. 

For one illustration of how combinatorial coverage concepts can be developed and applied as useful measures, we show how they may be used with the problem of transfer learning [2]. In machine learning, differences are almost certain to exist between training data sets, test data sets, and later real-world data. Because the time and data available for training are finite, the training data may not include some input combinations that will eventually show up in use. Transfer learning concerns predicting performance of a model trained on one data set when applied to another set of data, such as a new environment, changed environment, new ranges or values for data, or other changes. 

To illustrate the deficiencies of conventional approaches to transfer learning, and show how combinatorial methods produce improvements, we can consider an image recognition problem of aircraft in satellite imagery. The problem is to determine if an image contains or does not contain an airplane, using two data sets: one with 21,151 images, and one with 10,849 images. If a model is trained on first set and applied to the second, a surprising result occurs. The model trained on the larger data set loses accuracy when applied to the smaller data set. Conversely, the model trained on the smaller data set shows no performance drop when applied to the large data set. This result seems backwards because larger data sets are generally expected to be more useful as training data. 

Using a combinatorial coverage measure called set difference combinatorial coverage (SDCC), we can understand the result, and see that this counter-intuitive result could have been predicted from the SDCC measure. Combinatorial coverage provides a way to measure the strength of the training data. Evaluating this coverage in the training phase can help identify where even large volumes of data are inadequate. Evaluating this metric during transfer learning can help identify when a model may not transfer well and can be used to identify samples for retraining that cover the gap. A number of these measures are now being studied to ensure that even rare combinations or sequences of inputs produce correct and safe responses from the autonomous system. 

1. D. R. Kuhn, I. Dominguez Mendoza, R. N. Kacker, and Y. Lei, “Combinatorial coverage measurement concepts and applications,” 2013 IEEE Sixth Intl Conf. on Software Testing, Verification and Validation Workshops, pp. 352-361. 

2. E. Lanus, L. J. Freeman, D. R. Kuhn, and R. N. Kacker, “Combinatorial Testing Metrics for Machine Learning,” in Proc. 2021 IEEE Intl. Conf. on Software Testing, Verification and Validation Workshops, April 2021. 

This abstract summarizes the article “Assured Autonomy through Combinatorial Methods”, IEEE Computer, May 2024.


Rick Kuhn is a Computer Scientist at the National Institute of Standards and Technology and is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and the Washington Academy of Sciences. His research focuses on cybersecurity, empirical studies of software failure, and software verification and testing. He received an MS in computer science from the University of Maryland College Park. Contact him at d.kuhn@nist.gov 

 


M S Raunak is a Computer Scientist at the National Institute of Standards and Technology, Gaithersburg, USA. His research interests include verification and validation of ‘difficult-to-test’ systems such as complex simulation models, cryptographic implementations, and machine learning algorithms. He is currently focused on developing metrics, tools, and techniques for developing explainable and trustworthy artificial intelligence and machine learning systems. Raunak received a Ph.D. in Computer Science from the University of Massachusetts Amherst. He is a member of the IEEE. Contact him at ms.raunak@nist.gov 


Raghu N. Kacker is a Scientist in the National Institute of Standards and Technology (NIST), Gaithersburg, MD 20899, USA. His research interests include testing of software-based systems for trust and security, and mathematics of measurement. He received his Ph.D. from the Iowa State University. He is a Fellow of the American Statistical Association, a Fellow of the American Society for Quality, and a Fellow of the Washington Academy of Sciences. Contact him at raghu.kacker@nist.gov


Jaganmohan Chandrasekaran is a postdoctoral associate at the Virginia Tech National Security Institute. His research is at the intersection of Software Engineering and AI, focusing on the reliability and trustworthiness of AI-enabled software systems. He received his Ph.D. in Computer Science from the University of Texas at Arlington. Contact him at jagan@vt.edu 


Erin Lanus is a Research Assistant Professor at the National Security Institute and Affiliate Faculty Computer Science at Virginia Tech. Her research in AI assurance leverages combinatorial interaction testing as well as metric and algorithm development for test and evaluation of AI/ML with a focus on dataset quality. Lanus received her Ph.D. in Computer Science from Arizona State University. Contact her at lanus@vt.edu 


Tyler Cody is a Research Assistant Professor at the Virginia Tech National Security Institute. His research interest is in developing principles and best practices for the systems engineering of machine learning and artificial intel- ligence. He received his Ph.D. in Systems Engineering from the University of Virginia for his work on a systems theory of transfer learning. Contact him at tcody@vt.edu 


Laura Freeman is a Research Professor of Statistics and dual hatted as the Deputy Director of the Virginia Tech National Security Institute and Assistant Dean for Research for the College of Science. Her research leverages experimental methods for conducting research that brings together cyber-physical systems, data science, artificial intelligence (AI), and machine learning to address critical challenges in national security. Freeman received her Ph.D. in Statistics from Virginia Tech. Contact her at laura.freeman@vt.edu 
 

License: CC-3.0
Submitted by Amy Karns on