Publications | Science of Security Virtual Organization

Towards the reuse of physical models within the development life-cycle: a case study of Simulink models

Metadata Discovery Problem - In order to enable a collaborative Model-based Systems Engineering (MBSE) environment through computer systems, it is completely necessary to enable the possibility of communicating tools (interoperability) and reusing previous engineering designs saving costs and time. In this context, the understanding of the underlying concepts and relationships embedded in the system artifacts becomes a cornerstone to properly exploit engineering artifacts. MBSE tool-chains and suites, such as Matlab Simulink, can be applied to different engineering activities: architecture design (descriptive modeling), simulation (analytical modeling) or veriﬁcation. Reuse capabilities in speciﬁc engineering tools are a kind of non-functional aspect that is usually covered providing a type of search capability based on artifact metadata. In this work, we aim to ease the reuse of the knowledge embedded in Simulink models through a solution called PhysicalModel2Simulink. The proposed approach makes use of an ontology for representing, indexing and retrieving information following a meta-model (mainly to semantically represent concepts and relationships). Under this schema, both meta-data and contents are represented using a common domain vocabulary and taxonomy creating a property graph that can be exploited for system artifact discovery. To do so, a mapping between the Matlab Simulink meta-model and the RSHP (RelationShHiP) meta-model is deﬁned to represent and serialize physical models in a repository. Then, a retrieval process is implemented on top of this repository to allow users to perform text-based queries and look up similar artifacts. To validate the proposed solution, 38 Simulink models have been used and 20 real user queries have been designed to study the effectiveness, in terms or precision and recall, of the proposed solution against the Matlab Simulink searching capabilities.

Authored by Eduardo Cibrian, Roy Mendieta, Jose Alvarez-Rodriguez, Juan Llorens

SecretHunter: A Large-scale Secret Scanner for Public Git Repositories

Metadata Discovery Problem - Collaborative software development platforms like GitHub have gained tremendous popularity. Unfortunately, many users have reportedly leaked authentication secrets (e.g., textual passwords and API keys) in public Git repositories and caused security incidents and ﬁnical loss. Recently, several tools were built to investigate the secret leakage in GitHub. However, these tools could only discover and scan a limited portion of ﬁles in GitHub due to platform API restrictions and bandwidth limitations. In this paper, we present SecretHunter, a real-time large-scale comprehensive secret scanner for GitHub. SecretHunter resolves the ﬁle discovery and retrieval difﬁculty via two major improvements to the Git cloning process. Firstly, our system will retrieve ﬁle metadata from repositories before cloning ﬁle contents. The early metadata access can help identify newly committed ﬁles and enable many bandwidth optimizations such as ﬁlename ﬁltering and object deduplication. Secondly, SecretHunter adopts a reinforcement learning model to analyze ﬁle contents being downloaded and infer whether the ﬁle is sensitive. If not, the download process can be aborted to conserve bandwidth. We conduct a one-month empirical study to evaluate SecretHunter. Our results show that SecretHunter discovers 57\% more leaked secrets than state-of-the-art tools. SecretHunter also reduces 85\% bandwidth consumption in the object retrieval process and can be used in low-bandwidth settings (e.g., 4G connections).

Authored by Elliott Wen, Jia Wang, Jens Dietrich

Distributed AI-Driven Search Engine on Visual Internet-of-Things for Event Discovery in the Cloud

Metadata Discovery Problem - Millions of connected devices like connected cameras and streaming videos are introduced to smart cities every year, which are valuable source of information. However, such rich source of information is mostly left untapped. Thus, in this paper, we propose distributed deep neural networks (DNNs) over edge visual Internet of Things (VIoT) devices for parallel, real-time video scene parsing and indexing in conjunction with BigQuery retrieval on stored data in the cloud. The IoT video streams parsed into adaptive meta-data of person, attributes, actions, object, and relations using pre-trained DNNs. The meta-data cached at the edge-cloud for real-time analytics and also continuously transferred to the cloud for data fusion and BigQuery batch processing. The proposed distributed deep learning search platform bridges the gap between edge-to-cloud continuum computation by utilizing state-of-the-art distributed deep learning and BigQuery search algorithms for the geo-distributed Visual Internet of Things (VIoT). We show that our proposed system supports real-time event-driven computing at 122 milliseconds on virtual IoT devices in parallel, and as low as 2.4 seconds batch query response time on multi-table JOIN and GROUP-BY aggregation.

Authored by Arun Das, Mehdi Roopaei, Mo Jamshidi, Peyman Najafirad

Inclusion and Exclusion Criteria for Automating Adherence to Scope of Conference Calls for Papers

Metadata Discovery Problem - To conduct a well-designed and reproducible study, researchers must deﬁne and adhere to clear inclusion and exclusion criteria for subjects. Similarly, a well-run journal or conference should publish easily understood inclusion and exclusion criteria that determine which submissions will receive more detailed peer review. This will empower authors to identify the conferences and journals that are the best ﬁt for their manuscripts while allowing organizers and peer reviewers to spend more time on the submissions that are of greatest interest. To provide a more systematic way of representing these criteria, we extend the syntax for concept-validating constraints of the Nexus-PORTAL-DOORS-Scribe cyberinfrastructure, which already serve as criteria for inclusion of records in a repository, to allow description of exclusion criteria.

Authored by Adam Craig, Carl Taswell

A Spatial Common Datasets for Linkage between Dataset Values

Metadata Discovery Problem - We present a methodology for constructing a spatial ontology-based datasets navigation model to allow cross-reference navigation between datasets. We defined the structure of the dataset as metadata, the field names, and the actual values. We defined the relationship between datasets as 3-layer such as metadata layer, field names layer, and data value layer. The relationships in the metadata layer was defined as the correspondence between metadata values. We standardized the field names in dataset to discover the relationships between field names. We designed a method to discover the relationship between data values based on common knowledge datasets for each domain. To confirm the validity of the presented methodology, we applied our methodology to implement an ontology-based knowledge navigation model for actual disasterrelated processes in operation. We built a knowledge navigation model based on spatial common knowledge.

Authored by Yun-Young Hwang, Sumi Shin

Design of Ontology Model for Knowledge Navigator

Metadata Discovery Problem - We defined and expressed graph-based relationships of pieces of knowledge to allow cross-reference navigation of the knowledge as an ontology. We present a methodology for constructing an ontology-based knowledge navigation model to allow cross-reference navigation between pieces of knowledge, related concepts and datasets. We defined the structure of the dataset as metadata, the field names of the actual values, and the actual values. We defined the relationship between datasets as 3-layer such as metadata layer, field names layer, and data value layer. The relationships in the metadata layer was defined as the correspondence between metadata values. We standardized the field names in dataset to discover the relationships between field names. We designed a method to discover the relationship between data values based on common knowledge for each domain. To confirm the validity of the presented methodology, we applied our methodology to implement an ontology-based knowledge navigation model for actual disaster-related processes in operation. We built a knowledge navigation model based on spatial common knowledge to confirm that the configuration of the knowledge navigation model was correct.

Authored by Yun-Young Hwang, Jiseong Son, Sumi Shin

OPC UA Service Discovery and Binding in a Service-Oriented Architecture

Metadata Discovery Problem - The OPC UA (Open Platform Communications Unified Architecture) technology is found in many industrial applications as it addresses many of Industry 4.0’s requirements. One of its appeals is its service-oriented architecture. Nonetheless, it requires engineering efforts during deployment and maintenance to bind or associate the correct services to a client or consumer system. We propose the integration of OPC UA with the Eclipse Arrowhead Framework (EAF) to enable automatic service discovery and binding at runtime, reducing delays, costs, and errors. The integration also enables the client system to get the service endpoints by querying the service attributes or metadata. Moreover, this forms a bridge to other industrial communication technologies such as Modbus TCP (Transmission Control Protocol) as the framework is not limited to a specific protocol. To demonstrate the idea, an indexed line with an industrial PLC (programmable logic controller) with an OPC UA server is used to show that the desired services endpoints are revealed at runtime when querying their descriptive attributes or metadata through the EAF’s Orchestrator system.

Authored by Aparajita Tripathy, Jan Van Deventer, Cristina Paniagua, Jerker Delsing

Metadata Verification: A Workflow for Computational Archival Science

Metadata Discovery Problem - Researchers seeking to apply computational methods are increasingly turning to scientific digital archives containing images of specimens. Unfortunately, metadata errors can inhibit the discovery and use of scientific archival images. One such case is the NSF-sponsored Biology Guided Neural Network (BGNN) project, where an abundance of metadata errors has significantly delayed development of a proposed, new class of neural networks. This paper reports on research addressing this challenge. We present a prototype workflow for specimen scientific name metadata verification that is grounded in Computational Archival Science (CAS), report on a taxonomy of specimen name metadata error types with preliminary solutions. Our 3-phased workflow includes tag extraction, text processing, and interactive assessment. A baseline test with the prototype workflow identified at least 15 scientific name metadata errors out of 857 manually reviewed, potentially erroneous specimen images, corresponding to a ∼ 0.2\% error rate for the full image dataset. The prototype workflow minimizes the amount of time domain experts need to spend reviewing archive metadata for correctness and AI-readiness before these archival images can be utilized in downstream analysis.

Authored by Joel Pepper, Andrew Senin, Dom Jebbia, David Breen, Jane Greenberg

Conditional Metadata Embedding Data Preprocessing Method for Semantic Segmentation

Metadata Discovery Problem - Semantic segmentation is one of the key research areas in computer vision, which has very important applications in areas such as autonomous driving and medical image diagnosis. In recent years, the technology has advanced rapidly, where current models have been able to achieve high accuracy and efficient speed on some widely used datasets. However, the semantic segmentation task still suffers from the inability to generate accurate boundaries in the case of insufficient feature information. Especially in the field of medical image segmentation, most of the medical image datasets usually have class imbalance issues and there are always variations in factors such as shape and color between different datasets and cell types. Therefore, it is difficult to establish general algorithms across different classes and robust algorithms that differ across different datasets. In this paper, we propose a conditional data preprocessing strategy, i.e., Conditional Metadata Embedding (CME) data preprocessing strategy. The CME data preprocessing method will embed conditional information to the training data, which can assist the model to better overcome the differences in the datasets and extract useful feature information in the images. The experimental results show that the CME data preprocessing method can help different models achieve higher segmentation performance on different datasets, which shows the high practicality and robustness of this method.

Authored by Juntuo Wang, Qiaochu Zhao, Dongheng Lin, Erick Purwanto, Ka Man

Automatic classification of OER for metadata quality assessment

Metadata Discovery Problem - Open Educational Resources (OER) are educational materials that are available in different repositories such as Merlot, SkillsCommons, MIT OpenCourseWare, etc. The quality of metadata facilitates the search and discovery tasks of educational resources. This work evaluates the metadata quality of 4142 OER from SkillsCommons. We applied supervised machine learning algorithms (Support Vector Machine and Random Forest Classiﬁer) for automatic classiﬁcation of two metadata: description and material type. Based on our data and model, performances of a ﬁrst classiﬁcation effort is reported with the accuracy of 70\%.

Authored by Veronica Segarra-Faggioni, Audrey Romero-Pelaez