Multicore Computing Security 2015

Multicore Computing Security

2015

As Big Data and Cloud applications have grown, the size and capacity of hardware support has grown as well. Multicore and many core systems have security problems related to the Science of Security issues of resiliency, composability, and measurement. The research work cited here was presented in 2015.

F. Dupros, F. Boulahya, H. Aochi and P. Thierry, “Communication-Avoiding Seismic Numerical Kernels on Multicore Processors,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 330-335. doi: 10.1109/HPCC-CSS-ICESS.2015.230

Abstract: The finite-difference method is routinely used to simulate seismic wave propagation both in the oil and gas industry and in strong motion analysis in seismology. This numerical method also lies at the heart of a significant fraction of numerical solvers in other fields. In terms of computational efficiency, one of the main difficulties is to deal with the disadvantageous ratio between the limited pointwise computation and the intensive memory access required, leading to a memory-bound situation. Naive sequential implementations offer poor cache-reuse and achieve in general a low fraction of peak performance of the processors. The situation is worst on multicore computing nodes with several levels of memory hierarchy. In this case, each cache miss corresponds to a costly memory access. Additionally, the memory bandwidth available on multicore chips improves slowly regarding the number of computing core which induces a dramatic reduction of the expected parallel performance. In this article, we introduce a cache-efficient algorithm for stencil-based computations using a decomposition along both the space and the time directions. We report a maximum speedup of x3.59 over the standard implementation.

Keywords: cache storage; finite difference methods; gas industry; geophysics computing; multiprocessing systems; petroleum industry; seismic waves; seismology; wave propagation; Naive sequential implementations; cache-efficient algorithm; cache-reuse; communication-avoiding seismic numerical kernel; computational efficiency; finite-difference method; gas industry; memory bandwidth; memory hierarchy; multicore chips; multicore computing nodes; multicore processors; numerical method; numerical solvers; oil industry; peak performance; pointwise computation; seismic wave propagation simulation; seismology; stencil-based computations; strong motion analysis; Memory management; Multicore processing; Optimization; Program processors; Seismic waves; Standards; communication-avoiding; multicore; seismic (ID#: 16-9931)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336184&isnumber=7336120

X. Lin et al., “Realistic Task Parallelization of the H.264 Decoding Algorithm for Multiprocessors,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 871-874. doi: 10.1109/HPCC-CSS-ICESS.2015.33

Abstract: There is a phenomenon that hardware technology has developed ahead of software technology in recent years. Companies lack of software techniques that can fully utilize the modern multi-core computing resources, mainly due to the difficulty of investigating the inherent parallelism inside a software. This problem exists in products ranging from energy-sensitive smartphones to performance-eager data centers. In this paper, we present a case study on the parallelization of the complex industry standard H.264 HDTV decoder application in multi-core systems. An optimal schedule of the tasks is obtained and implemented by a carefully-defined software parallelization framework (SPF). The parallel software framework is proposed together with a set of rules to direct parallel software programming (PSPR). A pre-processing phase based on the rules is applied to the source code to make the SPF applicable. The task-level parallel version of the H.264 decoder is implemented and tested extensively on a workstation running Linux. Significant performance improvement is observed for a set of benchmarks composed of 720p videos. The SPF and the PSPR will together serve as a reference for future parallel software implementations and direct the development of automated tools.

Keywords: Linux; high definition television; multiprocessing systems; parallel programming; source code (software); video coding; H.264 HDTV decoder application; H.264 decoding algorithm; PSPR; SPF; data centers; energy-sensitive smart phones; multicore computing resources; multiprocessors; optimal task schedule; parallel software implementations; parallel software programming; performance improvement; preprocessing phase; realistic task parallelization; software parallelization framework; source code; task-level parallel; workstation; Decoding; Industries; Parallel processing; Parallel programming; Software; Software algorithms; Videos (ID#: 16-9932)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336273&isnumber=7336120

M. Paulitsch, “Keynote,” 2015 27th Euromicro Conference on Real-Time Systems, Lund, 2015, pp. xiii-xiii. doi: 10.1109/ECRTS.2015.7

Abstract: Summary form only given. Mixed-criticality embedded systems are getting more attention due to savings in cost, weight, and power, and fueled by the ever increasing performance of processers. Introduced into practice more than 2 decades ago — e.g. in aerospace with the concept of time and space-partitioning — optimization and different underlying hardware architectures like multicore processors continue to challenge system designers. This talk should present you a mix of different aspects of mixed-criticality system architecture and designs and underlying approaches of the past with excursions into real space, aerospace and railway systems. With the advent of multicore system-on-chip and multicore processors many of the original assumptions and solutions are challenged and sometimes invalidated and new problems emerge that require special attention. We will walk through current and future challenges and look at point solutions and discuss possible research needs. The interplay of safety, security, system design, performance optimization, scheduling aspects, and application needs and constraints combined with modern computing architectures like multicore processors provide a fertile ground for research and discussions in this field.

Keywords: computer architecture; embedded systems; scheduling; security of data; system-on-chip; computing architectures; hardware architectures; mixed-criticality embedded systems; mixed-criticality system architecture; multicore processors; multicore system-on-chip; performance optimization; safety; scheduling aspects; security; system design (ID#: 16-9933)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7176020&isnumber=7175989

A. Cilardo, J. Flich, M. Gagliardi and R. T. Gavila, “Customizable Heterogeneous Acceleration for Tomorrow's High-Performance Computing,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1181-1185. doi: 10.1109/HPCC-CSS-ICESS.2015.303

Abstract: High-performance computing as we know it today is experiencing unprecedented changes, encompassing all levels from technology to use cases. This paper explores the adoption of customizable, deeply heterogeneous manycore systems for future QoS-sensitive and power-efficient high-performance computing. At the heart of the proposed architecture is a NoC-based manycore system embracing medium-end CPUs, GPU-like processors, and reconfigurable hardware regions. The paper discusses the high-level design principles inspiring this innovative architecture as well as the key role that heterogeneous acceleration, ranging from multicore processors and GPUs down to FPGAs, might play for tomorrow's high-performance computing.

Keywords: field programmable gate arrays; graphics processing units; multiprocessing systems; network-on-chip; parallel processing; power aware computing; quality of service; FPGA; GPU-like processors; NoC-based many-core system; QoS-sensitive computing; customizable heterogeneous acceleration; heterogeneous acceleration; heterogeneous manycore systems; high-level design principles; high-performance computing; innovative architecture; medium-end CPU; multicore processors; power-efficient high-performance computing; reconfigurable hardware regions; Acceleration; Computer architecture; Field programmable gate arrays; Hardware; Program processors; Quality of service; Registers (ID#: 16-9934)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336329&isnumber=7336120

Q. Luo, F. Xiao, Y. Zhou and Z. Ming, “Performance Profiling of VMs on NUMA Multicore Platform by Inspecting the Uncore Data Flow,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 914-917. doi: 10.1109/HPCC-CSS-ICESS.2015.47

Abstract: Recently, NUMA (Non-Uniform Memory Access) multicore platform becomes more and more popular which provides hardware level support for many hot fields, such as cloud computing and big data, and deploying virtual machines on NUMA is a key technology. However, performance degradation in virtual machine isn't negligible due to the fact that guest OS has little or inaccurate knowledge about the underlying hardware. Our research will focus on performance profiling of VMs on multicore platform by inspecting the uncore data flow, and we design a performance profiling tool called VMMprof based on PMU (Performance Monitoring Units). It supports the uncore part of the processor, which is a new function beyond the capabilities of those existing tools. Experiments show that VMMprof can obtain typical factors which affect the performance of the processes and the whole system.

Keywords: data flow computing; memory architecture; multiprocessing systems; performance evaluation; virtual machines; NUMA multicore platform; PMU; VM performance profiling; VMMprof; hardware level support; nonuniform memory access; performance degradation; performance monitoring units; performance profiling tool; uncore data flow; uncore data flow inspection; virtual machines; Bandwidth; Hardware; Monitoring; Multicore processing; Phasor measurement units; Sockets; Virtual machining; NUMA; VMs; uncore (ID#: 16-9935)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336283&isnumber=7336120

S. H. VanderLeest and D. White, “MPSoC Hypervisor: The Safe & Secure Future of Avionics,” 2015 IEEE/AIAA 34th Digital Avionics Systems Conference (DASC), Prague, 2015, pp. 6B5-1–6B5-14. doi: 10.1109/DASC.2015.7311448

Abstract: Future avionics must provide increased performance and security while maintaining safety. The additional security capabilities now being required in commercial avionics equipment arise from integration and centralization of processing capabilities combined with passenger expectations for enhanced communications connectivity. Certification of airborne electronic hardware has long provided rigorous assurance of the safety of flight, but security of information is a more recent requirement for avionics processors and communications systems. In this paper, we explore promising options for future avionics equipment leveraging the latest embedded processing hardware and software technologies and techniques. The Xilinx Zynq® UltraScale+TM MultiProcessor System on Chip (MPSoC) provides one promising avionics solution from a hardware standpoint. The MPSoC provides a high performance heterogeneous multicore processing system and programmable logic in a single device with enhanced safety and security features. Combining this processor solution with a safe and secure software hypervisor solution unlocks many opportunities to address the next generation of airborne computing requirements while satisfying embedded multicore hardware and software certification objectives. In this paper we review the Zynq MPSoC and use of a software hypervisor to provide robust partitioning via virtualization. Partitioning is well established to support safety of flight in Integrated Modular Avionics (IMA) while maintaining reasonable performance. Security is a more recent concern, gaining attention as a vulnerability that can also affect safety in unanticipated ways. Hypervisor-based partitioning provides strong isolation that can reduce covert side channels of information exchange and support Multiple Independent Levels of Security (MILS).

Keywords: aerospace computing; air safety; avionics; certification; multiprocessing systems; security of data; software engineering; system-on-chip; virtualisation; IMA; MILS; MPSoC hypervisor; Zynq UltraScale+TM multiprocessor system on chip; airborne computing; airborne electronic hardware certification; avionics processors; commercial avionics equipment; communication systems; embedded multicore hardware certification; enhanced safety features; flight safety; high performance heterogeneous multicore processing system; hypervisor-based partitioning; integrated modular avionics; multiple independent levels of security; programmable logic; secure software hypervisor solution; security features; software certification; virtualization; Aerospace electronics; Hardware; Multicore processing; Program processors; Safety; Security; Virtual machine monitors (ID#: 16-9936)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7311448&isnumber=7311321

A. Haidar, A. YarKhan, C. Cao, P. Luszczek, S. Tomov and J. Dongarra, “Flexible Linear Algebra Development and Scheduling with Cholesky Factorization,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 861-864. doi: 10.1109/HPCC-CSS-ICESS.2015.285

Abstract: Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore CPUs and GPUs. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task-programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization.

Keywords: distributed memory systems; graphics processing units; mathematics computing; matrix decomposition; parallel processing; resource allocation; scheduling; Cholesky factorization; GPU; compute nodes; distributed heterogeneous machine; distributed memory nodes; distributed resources; flexible linear algebra development; flexible linear algebra scheduling; heterogeneous compute resources; high performance computing environments; multicore-CPU; parallel execution; resource-specialization code; serial code; task-programming abstraction; task-programming model; task-superscalar concept; workload parallelism; Graphics processing units; Hardware; Linear algebra; Multicore processing; Parallel processing; Runtime; Scalability; accelerator-based distributed memory computers; heterogeneous HPC computing; superscalar dataflow scheduling (ID#: 16-9937)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336271&isnumber=7336120

J. Xue et al., “Task-D: A Task Based Programming Framework for Distributed System,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1663-1668. doi: 10.1109/HPCC-CSS-ICESS.2015.299

Abstract: We present Task-D, a task-based distributed programming framework. Traditionally, programming for distributed programs requires using either low-level MPI or high-level pattern based models such as Hadoop/Spark. Task based models are frequently and well used for multicore and heterogeneous environment rather than distributed. Our Task-D tries to bridge this gap by creating a higher-level abstraction than MPI, while providing more flexibility than Hadoop/Spark for task-based distributed programming. The Task-D framework alleviates programmers from considering the complexities involved in distributed programming. We provide a set of APIs that can be directly embedded into user code to enable the program to run in a distributed fashion across heterogeneous computing nodes. We also explore the design space and necessary features the runtime should support, including data communication among tasks, data sharing among programs, resource management, memory transfers, job scheduling, automatic workload balancing and fault tolerance, etc. A prototype system is realized as one implementation of Task-D. A distributed ALS algorithm is implemented using the Task-D APIs, and achieved significant performance gains compared to Spark based implementation. We conclude that task-based models can be well suitable to distributed programming. Our Task-D is not only able to improve the programmability for distributed environment, but also able to leverage the performance with effective runtime support.

Keywords: application program interfaces; message passing; parallel programming; automatic workload balancing; data communication; distributed ALS algorithm; distributed programming; distributed system; heterogeneous computing node; high-level pattern based; job scheduling; low-level MPI; resource management; task-D API; task-based programming framework; Algorithm design and analysis; Data communication; Fault tolerance; Fault tolerant systems; Programming; Resource management; Synchronization (ID#: 16-9938)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336408&isnumber=7336120

A. Martínez, C. Domínguez, H. Hassan, J. M. Martínez and P. López, “Using GPU and SIMD Implementations to Improve Performance of Robotic Emotional Processes,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1876-1881. doi: 10.1109/HPCC-CSS-ICESS.2015.288

Abstract: Future robotic systems are being implemented using control architectures based on emotions. In these architectures, the emotional processes decide which behaviors the robot must activate to fulfill the objectives. The number of emotional processes increases with the complexity level of the application, limiting the processing capacity of the control processor to solve complex problems. Fortunately, the potential parallelism of emotional processes permits their execution in parallel. In this paper, different alternatives are used to exploit the parallelism of the emotional processes. On the one hand, we take advantage of the multiple cores and single instruction multiple data (SIMD) instructions sets already available on modern microprocessors. On the other hand, we also consider using a GPU. Different number of cores with and without enabling SIMD instructions and a GPU-based implementation are compared to analyze their suitability to cope with robotic applications. The applications are set-up taking into account different conditions and states of the robot. Experimental results show that the single processor can undertake most of the simple problems at a speed of 1 m/s. For a speed of 2 m/s, a 8-core processor permits solving most of the problems. When the most constrained problem is required, the solution is to combine SIMD instructions with multicore or to use a co-processor GPU to provide the needed computing power.

Keywords: graphics processing units; intelligent robots; mobile robots; multiprocessing systems; parallel processing; GPU co-processor; GPU-based implementation; SIMD instructions; application complexity level; complex problems; control architectures; control processor processing capacity; emotional process parallelism; microprocessors; multicore processer; multiple cores; robot behaviors; robotic emotional process performance improvement; single-instruction multiple data instructions sets; Appraisal; Complexity theory; Computer architecture; Graphics processing units; Instruction sets; Parallel processing; Robots; GPU; OpenMP; robotic systems (ID#: 16-9939)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336446&isnumber=7336120

C. Yount, “Vector Folding: Improving Stencil Performance via Multi-Dimensional SIMD-Vector Representation,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 865-870. doi: 10.1109/HPCC-CSS-ICESS.2015.27

Abstract: Stencil computation is an important class of algorithms used in a large variety of scientific-simulation applications. Modern CPUs are employing increasingly longer SIMD vector registers and operations to improve computational throughput. However, the traditional use of vectors to contain sequential data elements along one dimension is not always the most efficient representation, especially in the multicore and hyper-threaded context where caches are shared among many simultaneous compute streams. This paper presents a general technique for representing data in vectors for 2D and 3D stencils. This method reduces the number of memory accesses required by storing a small multi-dimensional block of data in each vector compared to the single dimension in the traditional approach. Experiments on an Intel Xeon Phi Coprocessor show performance speedups over traditional vectors ranging from 1.2x to 2.7x, depending on the problem size and stencil type. This technique is independent of and complementary to a variety of existing stencil-computation tuning algorithms such as cache blocking, loop tiling, and wavefront parallelization.

Keywords: data structures; multiprocessing systems; parallel processing; CPU; Intel Xeon Phi Coprocessor; hyper-threaded context; memory access; multidimensional SIMD-vector representation; multidimensional block; scientific-simulation application; sequential data element; stencil computation; stencil performance; vector folding; Jacobian matrices; Layout; Memory management; Registers; Shape; Three-dimensional displays; Intel; SIMD; Xeon Phi; high-performance computing; stencil; vectorization (ID#: 16-9940)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336272&isnumber=7336120

M. Raab, “Global and Thread-Local Activation of Contextual Program Execution Environments,” Object/Component/Service-Oriented Real-Time Distributed Computing Workshops (ISORCW), 2015 IEEE International Symposium on, Auckland, 2015, pp. 34-41. doi: 10.1109/ISORCW.2015.52

Abstract: Ubiquitous computing often demands applications to be both customizable and context-aware: Users expect smart devices to adapt to the context and respect their preferences. Currently, these features are not well-supported in a multi-core embedded setup. The aim of this paper is to describe a tool that supports both thread-local and global context-awareness. The tool is based on code generation using a simple specification language and a library that persists the customizations. In a case study and benchmark we evaluate a web server application on embedded hardware. Our web server application uses contexts to represent user sessions, language settings, and sensor states. The results show that the tool has minimal overhead, is well-suited for ubiquitous computing, and takes full advantage of multi-core processors.

Keywords: Internet; multiprocessing systems; program compilers; programming environments; software libraries; specification languages; ubiquitous computing; Web server application; code generation; contextual program execution environments; global context-awareness; language settings; multicore processors; sensor states; smart devices; software library; specification language; thread-local context-awareness; user sessions; Accuracy; Context; Hardware; Instruction sets; Security; Synchronization; Web servers; context oriented programming; customization; multi-core; persistency (ID#: 16-9941)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7160121&isnumber=7160107

C. E. Tuncali, G. Fainekos and Y. H. Lee, “Automatic Parallelization of Simulink Models for Multi-Core Architectures,” 2015 IEEE 17th International Conference on,High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conferen ceon Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 964-971. doi: 10.1109/HPCC-CSS-ICESS.2015.232

Abstract: This paper addresses the problem of parallelizing existing single-rate Simulink models for embedded control applications on multi-core architectures considering communication cost between blocks on different CPU cores. Utilizing the block diagram of the Simulink model, we derive the dependency graph between the different blocks. In order to solve the scheduling problem, we describe a Mixed Integer Linear Programming (MILP) formulation for optimally mapping the Simulink blocks to different CPU cores. Since the number of variables and constraints for MILP solver grows exponentially when model size increases, solving this problem in a reasonable time becomes harder. For addressing this issue, we introduce a set of techniques for reducing the number of constraints in the MILP formulation. By using the proposed techniques, the MILP solver finds solutions that are closer to the optimal solution within a given time bound. We study the scalability and efficiency of our consisting approach with synthetic benchmarks of randomly generated directed acyclic graphs. We also use the “Fault-Tolerant Fuel Control System” demo from Simulink and a Diesel engine controller from Toyota as case studies for demonstrating applicability of our approach to real world problems.

Keywords: control engineering computing; diesel engines; directed graphs; embedded systems; fault tolerant control; fuel systems; integer programming; linear programming; parallel architectures; processor scheduling; CPU cores; MILP formulation; MILP solver constraints; MILP solver variables; Simulink model parallelization problem; Toyota; block diagram; communication cost; dependency graph; diesel engine controller; embedded control applications; fault-tolerant fuel control system; mixed integer linear programming formulation; multicore architecture; randomly generated directed acyclic graphs; scheduling problem; synthetic benchmarks; Bismuth; Computational modeling; Job shop scheduling; Multicore processing; Optimization; Software packages; Multiprocessing; Simulink; embedded systems; model based development; optimization; task allocation (ID#: 16-9942)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336295&isnumber=7336120

S. Li, J. Meng, L. Yu, J. Ma, T. Chen and M. Wu, “Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 266-271. doi:10.1109/HPCC-CSS-ICESS.2015.290

Abstract: There is a growing trend towards heterogeneous systems, which contain CPUs and GPGPUs in a single chip. Managing those various on-chip resources shared between CPUs and GPGPUs, however, is a big issue and the last-level cache (LLC) is one of the most critical resources due to its impact on system performance. Some well-known cache replacement policies like LRU and DRRIP, designed for a CPU, can not be so well qualified for heterogeneous systems because the LLC will be dominated by memory accesses from thousands of threads of GPGPU applications and this may lead to significant performance downgrade for a CPU. Another reason is that a GPGPU is able to tolerate memory latency when quantity of active threads in the GPGPU is sufficient, but those policies do not utilize this feature. In this paper we propose a novel shared LLC management policy for CPU-GPGPU heterogeneous systems called Buffer Filter which takes advantage of memory latency tolerance of GPGPUs. This policy has the ability to restrict streaming requests of GPGPU by adding a buffer to memory system and vacate LLC space for cache-sensitive CPU applications. Although there is some IPC loss for GPGPU but the memory latency tolerance ensures the basic performance of GPGPU's applications. The experiments show that the Buffer Filter is able to filtrate up to 50% to 75% of the total GPGPU streaming requests at the cost of little GPGPU IPC decrease and improve the hit rate of CPU applications by 2x to 7x.

Keywords: cache storage; graphics processing units; CPU-GPGPU heterogeneous system; buffer filter; cache replacement policies; cache-sensitive CPU applications; general-purpose graphics processing unit; last-level cache management policy; memory access; memory latency tolerance; on-chip resources; shared LLC management policy; Benchmark testing; Central Processing Unit; Instruction sets; Memory management; Multicore processing; Parallel processing; System performance; heterogeneous system; multicore; shared last-level cache (ID#: 16-9943)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336174&isnumber=7336120

N. S. V. Rao, D. Towsley, G. Vardoyan, B. W. Settlemyer, I. T. Foster and R. Kettimuthu, “Sustained Wide-Area TCP Memory Transfers over Dedicated Connections,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1603-1606. doi: 10.1109/HPCC-CSS-ICESS.2015.86

Abstract: Wide-area memory transfers between on-going computations and remote steering, analysis and visualization sites can be utilized in several High-Performance Computing (HPC) scenarios. Dedicated network connections with high capacity, low loss rates and low competing traffic, are typically provisioned over current HPC infrastructures to support such transfers. To gain insights into such transfers, we collected throughput measurements for different versions of TCP between dedicated multi-core servers over emulated 10 Gbps connections with round trip times (rtt) in the range 0-366 ms. Existing TCP models and measurements over shared links are well-known to exhibit monotonically decreasing, convex throughput profiles as rtt is increased. In sharp contrast, our these measurements show two distinct regimes: a concave profile at lower rtts and a convex profile at higher rtts. We present analytical results that explain these regimes: (a) at lower rtt, rapid throughput increase due to slow-start leads to the concave profile, and (b) at higher rtt, TCP congestion avoidance phase with slower dynamics dominates. In both cases, however, we analytically show that throughput decreases with rtt, albeit at different rates, as confirmed by the measurements. These results provide practical TCP solutions to these transfers without requiring additional hardware and software, unlike Infiniband and UDP solutions, respectively.

Keywords: network servers; parallel processing; sustainable development; telecommunication congestion control; telecommunication links; telecommunication traffic; transport protocols; wide area networks; HPC; concave profile; congestion avoidance; convex profile; dedicated connection; high-performance computing; multicore server; remote steering; shared link; sustained wide area TCP memory transfer; visualization site; Current measurement; Data transfer; Hardware; Linux; Software; Supercomputers; Throughput; TCP; dedicated connections; memory transfers; throughput measurements (ID#: 16-9944)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336397&isnumber=7336120

Y. Li, Y. Zhao and H. Gao, “Using Artificial Neural Network for Predicting Thread Partitioning in Speculative Multithreading,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 823-826. doi: 10.1109/HPCC-CSS-ICESS.2015.28

Abstract: Speculative multithreading (SpMT) is a thread-level automatic parallelization technique to accelerate sequential programs on multi-core, and it partitions programs into multiple threads to be speculatively executed in the presence of ambiguous data and control dependences while the correctness of the programs is guaranteed by hardware support. Thread granularity, number of parallel threads as well as partition positions are crucial to the performance improvement in SpMT, for they determine the amount of resources (CPU, memory, cache, or waiting cycles, etc.), and affect the efficiency of every PE (Processing Element). In conventional way, these three parameters are determined by heuristic rules. Although it is simple to partition threads with them, they are a type of one-size-fits-all strategy and can not guarantee to get the optimal solution of thread partitioning. This paper proposes an Artificial Neural Network (ANN) based approach to learn and determine the thread partition strategy. Using the ANN-based thread partition approach, an unseen irregular program can obtain a stable, much higher speedup than the Heuristic Rules (HR) based approach. On Prophet, which is a generic SpMT processor to evaluate the performance of multithreaded programs, the novel thread partitioning policy is evaluated and reaches an average speedup of 1.80 on 4-core processor. Experiments show that our proposed approach can obtain a significant increase in speedup and Olden benchmarks deliver a better performance improvement of 2.36% than the traditional heuristic rules based approach. The results indicate that our approach finds the best partitioning scheme for each program and is more stable across programs.

Keywords: multi-threading; multiprocessing systems; neural nets; ANN-based thread partition approach; HR based approach; Olden benchmark; PE; Prophet; SpMT processor; artificial neural network; heuristic rules; multicore; multithreaded programs; one-size-fits-all strategy; parallel threads; partition position; processing element; sequential programs; speculative multithreading; thread granularity; thread partitioning policy; thread partitioning prediction; thread-level automatic parallelization technique; Cascading style sheets; Conferences; Cyberspace; Embedded software; High performance computing; Safety; Security; Machine learning; Prophet; thread partitioning (ID#: 16-9945)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336262&isnumber=7336120

T. C. Xu, V. Leppänen, P. Liljeberg, J. Plosila and H. Tenhunen, “Trio: A Triple Class On-Chip Network Design for Efficient Multicore Processors,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 951-956. doi: 10.1109/HPCC-CSS-ICESS.2015.44

Abstract: We propose and analyse an on-chip interconnect design for improving the efficiency of multicore processors. Conventional interconnection networks are usually based on a single homogeneous network with uniform processing of all traffic. While the design is simplified, this approach can have performance bottlenecks and limitations on system efficiency. We investigate the traffic pattern of several real world applications. Based on a directory cache coherence protocol, we characterise and categorize the traffic in terms of various aspects. It is discovered that control and unicast packets dominated the network, while the percentages of data and multicast messages are relatively low. Furthermore, we find most of the invalidation messages are multicast messages, and most of the multicast messages are invalidation message. The multicast invalidation messages usually have higher number of destination nodes compared with other multicast messages. These observations lead to the proposed triple class interconnect, where a dedicated multicast-capable network is responsible for the control messages and the data messages are handled by another network. By using a detailed full system simulation environment, the proposed design is compared with the homogeneous baseline network, as well as two other network designs. Experimental results show that the average network latency and energy delay product of the proposed design have improved 24.4% and 10.2% compared with the baseline network.

Keywords: cache storage; multiprocessing systems; multiprocessor interconnection networks; network synthesis; network-on-chip; Trio; average network latency; dedicated multicast-capable network; destination nodes; directory cache coherence protocol; energy delay product; full system simulation environment; homogeneous baseline network; multicast invalidation messages; multicore processors; on-chip interconnect design; traffic pattern; triple class on-chip network design; unicast packets; Coherence; Multicore processing; Ports (Computers); Program processors; Protocols; System-on-chip; Unicast; cache; design; efficient; multicore; network-on-chip (ID#: 16-9946)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336293&isnumber=7336120

M. Shekhar, H. Ramaprasad and F. Mueller, “Evaluation of Memory Access Arbitration Algorithm on Tilera's TILEPro64 Platform,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1154-1159. doi: 10.1109/HPCC-CSS-ICESS.2015.245

Abstract: As real-time embedded systems demand more and more computing power under reasonable energy budgets, multi-core platforms are a viable option. However, deploying real-time applications on multi-core platforms introduce several predictability challenges. One of these challenges is bounding the latency of memory accesses issued by real-time tasks. This challenge is exacerbated as the number of cores and, hence, the degree of resource sharing increases. Over the last several years, researchers have proposed techniques to overcome this challenge. In prior work, we proposed an arbitration policy for memory access requests over a Network-on-Chip. In this paper, we implement and evaluate variants of our arbitration policy on a real hardware platform, namely Tilera's TilePro64 platform.

Keywords: embedded systems; multiprocessing systems; network-on-chip; storage management;TILEPro64 platform; memory access arbitration algorithm; multicore platforms; network-on-chip; real-time embedded systems; Dynamic scheduling; Engines; Hardware; Instruction sets; Memory management; Real-time systems; System-on-chip (ID#: 16-9947)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336325&isnumber=7336120

R. Reuillon et al., “Tutorials,” High Performance Computing & Simulation (HPCS), 2015 International Conference on, Amsterdam, 2015, pp. 1-16. doi: 10.1109/HPCSim.2015.7237009

Abstract: These tutorials discusses the following: Model Exploration using OpenMOLE: A Workflow Engine for Large Scale Distributed Design of Experiments and Parameter Tuning; Getting Up To Speed On OpenMP 4.0; Science Gateways - Leveraging Modeling and Simulations in HPC Infrastructures via Increased Usability; Cloud Security, Access Control and Compliance; The EGI Federated Cloud; Getting Started with the AVX-512 on the Multicore and Manycore Platforms.

Keywords: authorisation; cloud computing; multiprocessing systems; parallel processing; EGI federated cloud; HPC Infrastructure; OpenMOLE; access control; cloud security; high performance computing-and-simulation; manycore platform; multicore platform; Biological system modeling; Computational modeling; Distributed computing; High performance computing; Logic gates; Tuning; Tutorials (ID#: 16-9948)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7237009&isnumber=7237005

M. D. Grammatikakis, P. Petrakis, A. Papagrigoriou, G. Kornaros and M. Coppola, “High-Level Security Services Based on a Hardware NoC Firewall Module,” Intelligent Solutions in Embedded Systems (WISES), 2015 12th International Workshop on, Ancona, 2015, pp. 73-78. doi: (not provided)

Abstract: Security services are typically based on deploying different types of modules, e.g. firewall, intrusion detection or prevention systems, or cryptographic function accelerators. In this study, we focus on extending the functionality of a hardware Network-on-Chip (NoC) Firewall on the Zynq 7020 FPGA of a Zedboard. The NoC Firewall checks the physical address and rejects untrusted CPU requests to on-chip memory, thus protecting legitimate processes running in a multicore SoC from the injection of malicious instructions or data to shared memory. Based on a validated kernel-space Linux system driver of the NoC Firewall which is seen as a reconfigurable, memory-mapped device on top of AMBA AXI4 interconnect fabric, we develop higher-layer security services that focus on physical address protection based on a set of rules. While our primary scenario concentrates on monitors and actors related to protection from malicious (or corrupt) drivers, other interesting use cases related to healthcare ethics, are also put into the context.

Keywords: field programmable gate arrays; firewalls; multiprocessing systems; network-on-chip; AMBA AXI4 interconnect fabric; Zedboard; Zynq 7020 FPGA; corrupt drivers; hardware NoC firewall module; healthcare ethics; high-level security services; malicious drivers; malicious instructions; multicore SoC; network-on-chip; on-chip memory; physical address protection; reconfigurable memory-mapped device; shared memory; untrusted CPU requests; validated kernel-space Linux system driver; Field programmable gate arrays; Firewalls (computing); Hardware; Linux; Network interfaces; Registers; Linux driver; NoC; firewall; multicore SoC (ID#: 16-9949)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7356985&isnumber=7356973

F. Liu, Y. Yarom, Q. Ge, G. Heiser and R. B. Lee, “Last-Level Cache Side-Channel Attacks are Practical,” 2015 IEEE Symposium on Security and Privacy, San Jose, CA, 2015, pp. 605-622. doi: 10.1109/SP.2015.43

Abstract: We present an effective implementation of the Prime+Probe side-channel attack against the last-level cache. We measure the capacity of the covert channel the attack creates and demonstrate a cross-core, cross-VM attack on multiple versions of GnuPG. Our technique achieves a high attack resolution without relying on weaknesses in the OS or virtual machine monitor or on sharing memory between attacker and victim.

Keywords: cache storage; cloud computing; security of data; virtual machines; GnuPG; IaaS cloud computing; Prime+Probe side-channel attack; covert channel; cross-VM attack; cross-core attack; last-level cache side-channel attacks; virtual machine monitor; Cryptography; Indexes; Memory management; Monitoring; Multicore processing; Probes; Virtual machine monitors; ElGamal; covert channel; cross-VM side channel; last-level cache; side-channel attack (ID#: 16-9950)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7163050&isnumber=7163005

C. Li, W. Hu, P. Wang, M. Song and X. Cao, “A Novel Critical Path Based Routing Method Based on for NOC,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1546-1551. doi: 10.1109/HPCC-CSS-ICESS.2015.159

Abstract: When more and more cores are integrated onto a single chip and connected by lines, network on chip (NOC) has provided a new on chip structure. The tasks are mapped to the cores on the chip. They have communication requirements according to their relationship. When the communication data are transmitted on the network, they need to be given a suitable path to the target cores with low latency. In this paper, we proposed a new routing method based on static critical path for NOC. The tasks with multi-threads will be analyzed first and the running paths of the tasks will be marked. The static critical path can be found according to the length of the running paths. The messages on critical path will be marked as critical messages. When the messages have arrived at the routers on chip, the critical messages will be forwarded firstly in terms of their importance. This new routing method has been tested on simulation environment. The experiment results proved that this method can accelerate the transmission speed of critical messages and improve the performance of the tasks.

Keywords: network routing; network-on-chip; NOC; chip structure; communication data transmission; communication requirements; critical messages; critical path; critical path based routing method latency; multithreads; network on chip; running path length; simulation environment; static critical path; target cores; task mapping; task performance improvement; Algorithm design and analysis; Message systems; Multicore processing; Program processors; Quality of service; Routing; System-on-chip; Critical Path; Network on Chip; Routing Method (ID#: 16-9951)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336388&isnumber=7336120

S. Zhang and S. Su, “Design of Parallel Algorithms for Super Long Integer Operation Based on Multi-Core CPUs,” 2015 11th International Conference on Computational Intelligence and Security (CIS), Shenzhen, 2015, pp. 335-339. doi: 10.1109/CIS.2015.88

Abstract: In cryptographic applications, super long integer operations are often used. However, cryptographic algorithms generally run on a computer with a single-core CPU, and the related computing process is a type of serial execution. In this paper, we investigate how to parallelize the operations of super long integers in multi-core computer environment. The significance of this study lies in that along with the promotion of multi-core computing devices, and the enhancement of multi-core computing ability, we need to make the basic arithmetic of super long integers run in paralleling, which means blocking super long integers, running all data blocks on multi-core threads respectively, converting original serial execution into multi-core parallel computation, and storing multi-thread results after formatting them. According to experiments we have observed: if scheduling thread time is longer than computation, parallel algorithms execute faster, on the contrary, serial algorithms are better. On the whole, parallel algorithms can utilize the computing ability of multi-core hardware more efficiently.

Keywords: cryptography; digital arithmetic; microprocessor chips; multiprocessing systems; parallel algorithms; cryptographic applications; data blocks; multicore CPU; multicore computer environment; multicore hardware; multicore parallel computation; parallel algorithm design; serial execution; single-core CPU; super long integer operation; super long integers; Algorithm design and analysis; Bismuth; Computers; Cryptography; Instruction sets; Operating systems; Parallel algorithms; algorithms; multi-core; multi-thread; parallel computation; super long integers (ID#: 16-9952)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7397102&isnumber=7396229

N. Khalilzad, H. R. Faragardi and T. Nolte, “Towards Energy-Aware Placement of Real-Time Virtual Machines in a Cloud Data Center,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, New York, NY, 2015, pp. 1657-1662. doi: 10.1109/HPCC-CSS-ICESS.2015.22

Abstract: Cloud computing is an evolving paradigm which is becoming an adoptable technology for a variety of applications. However, cloud infrastructures must be able to fulfill application requirements before adopting cloud solutions. Cloud infrastructure providers communicate the characteristics of their services to their customers through Service Level Agreements (SLA). In order for a real-time application to be able to use cloud technology, cloud infrastructure providers have to be able to provide timing guarantees in the SLAs. In this paper, we present our ongoing work regarding a cloud solution in which periodic tasks are provided as a service in the Software as a Service (SaS) model. Tasks belonging to a certain application are mapped in a Virtual Machine (VM). We also study the problem of VM placement on a cloud infrastructure. We propose a placement mechanism which minimizes the energy consumption of the data center by consolidating VMs in a minimum number of servers while respecting the timing requirement of virtual machines.

Keywords: cloud computing; computer centres; contracts; power aware computing; timing; virtual machines; virtual storage; SLA; SaaS; VM placement; application requirements; cloud data center; cloud infrastructure; energy aware placement; energy consumption minimisation; real-time virtual machine; service level agreement; software as a service; timing guarantee; Cloud computing; Energy consumption; Multicore processing; Power demand; Real-time systems; Servers; Timing; Real-time cloud; VM placement; energy aware allocation (ID#: 16-9953)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336407&isnumber=7336120

J. Ye, S. Li and T. Chen, “Shared Write Buffer to Support Data Sharing Among Speculative Multi-Threading Cores,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 835-838. doi: 10.1109/HPCC-CSS-ICESS.2015.287

Abstract: Speculative Multi-threading (SpMT), a.k.a Thread Level Speculation (TLS), is a most noticeable research direction of automatic extraction of thread level parallelism (TLP), which is growing appealing in the multi-core and many-core era. The SpMT threads are extracted from a single thread, and are tightly coupled with data dependences. Traditional private L1 caches with coherence mechanism will not suit such intense data sharing among SpMT threads. We propose a Shared Write Buffer (SWB) that resides in parallel with the private L1 caches, but with much smaller capacity, and short access delay. When a core writes a datum to L1 cache, it will write the SWB first, and when it reads a datum, it will read from the SWB as well as from the L1. Because the SWB is shared among the cores, it may probably return a datum quicker than the L1 if the latter needs to go through a coherence process to load the datum. This way the SWB improves the performance of SpMT inter-core data sharing, and mitigate the overhead of coherence.

Keywords: cache storage; multi-threading; multiprocessing systems; SWB; SpMT intercore data sharing; SpMT thread extraction; TLS; access delay; automatic TLP extraction; coherence overhead mitigation; data dependences; data sharing; datum; performance improvement; private L1 caches; shared write buffer; speculative multithreading cores; thread level parallelism; thread level speculation; Coherence; Delays; Instruction sets; Message systems; Multicore processing; Protocols; Cache; Multi-Core; Shared Write Buffer; SpMT; Speculative Multi-Threading (ID#: 16-9954)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336265&isnumber=7336120

D. Münch, M. Paulitsch, O. Hanka and A. Herkersdorf, “MPIOV: Scaling Hardware-Based I/O Virtualization for Mixed-Criticality Embedded Real-Time Systems Using Non Transparent Bridges to (Multi-Core) Multi-Processor Systems,” 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, 2015, pp. 579-584. doi: (not provided)

Abstract: Safety-critical systems consolidating multiple functionalities of different criticality (so-called mixed-criticality systems) require separation between these functionalities to assure safety and security properties. Performance-hungry and safety-critical applications (like a radar processing system steering an autonomous flying aircraft) may demand an embedded high-performance computing cluster of more than one (multi-core) processor. This paper presents the Multi-Processor I/O Virtualization (MPIOV) concept to enable hardware-based Input/Output (I/O) virtualization or sharing with separation among multiple (multi-core) processors in (mixed-criticality) embedded real-time systems, which usually do not have means for separation like an Input/Output Memory Management Unit (IOMMU). The concept uses a Non-Transparent Bridge (NTB) to connect each processing host to the management host, while checking the target address and source / origin ID to decide whether or not to block a transaction. It is a standardized, portable and non-proprietary platform-independent spatial separation solution that does not require an IOMMU in the processor. Furthermore, the concept sketches an approach for PCI Express (PCIe)-based systems to enable sharing of up to 2048 (virtual) functions per endpoint, while still being compatible to the plain PCIe standard. A practical evaluation demonstrates that the impact to performance degradation (transfer time, transfer rate) is negligible (about 0.01%) compared to a system without separation.

Keywords: multiprocessing systems; parallel processing; safety-critical software; virtualisation; IOMMU; MPIOV; NTB; PCI express; embedded high-performance computing cluster; hardware-based I-O virtualization; input-output memory management unit; mixed-criticality embedded real-time system; multiprocessor I/O virtualization; multiprocessor system; nontransparent bridge; safety-critical systems; Aerospace electronics; Bridges; Memory management; Multicore processing; Real-time systems; Standards; Virtualization; IOMPU; hardware-based I/O virtualization; mixed-criticality systems; multi-core; multiprocessor; non-transparent bridge (NTB); real-time embedded systems; spatial separation (ID#: 16-9955)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7092453&isnumber=7092347

P. Sun, S. Chandrasekaran, S. Zhu and B. Chapman, “Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 843-847. doi: 10.1109/HPCC-CSS-ICESS.2015.88

Abstract: Heterogeneous multicore embedded systems are rapidly growing with cores of varying types and capacity. Programming these devices and exploiting the hardware has been a real challenge. The programming models and its execution are typically meant for general purpose computation, they are mostly too heavy to be adopted for the resource-constrained embedded systems. Embedded programmers are still expected to use low-level and proprietary APIs, making the software built less and less portable. These challenges motivated us to explore how OpenMP, a high-level directive-based model, could be used for embedded platforms. In this paper, we translate OpenMP to Multicore Association Task Management API (MTAPI), which is a standard API for leveraging task parallelism on embedded platforms. Our results demonstrate that the performance of our OpenMP runtime library is comparable to the state-of-the-art task parallel solutions. We believe this approach will provide a portable solution since it abstracts the low-level details of the hardware and no longer depends on vendor-specific API.

Keywords: application program interfaces; embedded systems; multiprocessing systems; parallel processing; MCA; MTAPI; OpenMP runtime library; OpenMP task parallelism; heterogeneous multicore embedded system; high-level directive-based model; multicore association task management API; multicore embedded system; resource-constrained embedded system; vendor-specific API; Computational modeling; Embedded systems; Hardware; Multicore processing; Parallel processing; Programming; Heterogeneous Multicore Embedded Systems; OpenMP; Parallel Computing (ID#: 16-9956)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336267&isnumber=7336120

P. Bogdan and Y. Xue, “Mathematical Models and Control Algorithms for Dynamic Optimization of Multicore Platforms: A Complex Dynamics Approach,” Computer-Aided Design (ICCAD), 2015 IEEE/ACM International Conference on, Austin, TX, 2015, pp. 170-175. doi: 10.1109/ICCAD.2015.7372566

Abstract: The continuous increase in integration densities contributed to a shift from Dennard's scaling to a parallelization era of multi-/many-core chips. However, for multicores to rapidly percolate the application domain from consumer multimedia to high-end functionality (e.g., security, healthcare, big data), power/energy and thermal efficiency challenges must be addressed. Increased power densities can raise on-chip temperatures, which in turn decrease chip reliability and performance, and increase cooling costs. For a dependable multicore system, dynamic optimization (power / thermal management) has to rely on accurate yet low complexity workload models. Towards this end, we present a class of mathematical models that generalize prior approaches and capture their time dependence and long-range memory with minimum complexity. This modeling framework serves as the basis for defining new efficient control and prediction algorithms for hierarchical dynamic power management of future data-centers-on-a-chip.

Keywords: multiprocessing systems; power aware computing; temperature; Dennard scaling; chip performance; chip reliability; complex dynamics approach; control algorithm; data-centers-on-a-chip; dynamic optimization; hierarchical dynamic power management; many-core chips; multicore chips; multicore platform; on-chip temperature; power density; power management; prediction algorithm; thermal management; Autoregressive processes; Heuristic algorithms; Mathematical model; Measurement; Multicore processing; Optimization; Stochastic processes (ID#: 16-9957)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7372566&isnumber=7372533

J. Tian, W. Hu, C. Li, T. Li and W. Luo, “Multi-Thread Connection Based Scheduling Algorithm for Network on Chip,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1473-1478. doi: 10.1109/HPCC-CSS-ICESS.2015.160

Abstract: More and more cores are integrated onto a single chip to improve the performance and reduce the power consumption of CPU without the increased frequency. The core are connected by lines and organized as a network, which is called network on chip (NOC) as the promising paradigm. NOC has improved the performance of the CPU without the increased power consumption. However, there is still a new problem that how to schedule the threads to the different cores to take full advantages of NOC. In this paper, we proposed a new multi-thread scheduling algorithm based on thread connection for NOC. The connection relationship of the threads will be analyzed and divided into different thread sets. And at the same time, the network topology of the NOC is also analyzed. The connection relationship of the cores is set in the NOC model and divided into different regions. The thread sets and core regions will be establish correspondence relationship according to the features of them. And the multi-thread scheduling algorithm will map thread sets to the corresponding core regions. In the same core set, the threads in the same set will be scheduled via different proper approaches. The experiments have showed that the proposed algorithm can improve the performance of the programs and enhance the utilization of NOC cores.

Keywords: multi-threading; network theory (graphs); network-on-chip; performance evaluation; power aware computing; processor scheduling; CPU; NOC core; multithread connection based scheduling; multithread connection-based scheduling algorithm; network topology; power consumption; Algorithm design and analysis; Heuristic algorithms; Instruction sets; Multicore processing; Network topology; Scheduling algorithms; System-on-chip; Algorithm; Network on Chip; Scheduling; Thread Connection (ID#: 16-9958)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336376&isnumber=7336120

Y. Li, J. Niu, M. Qiu and X. Long, “Optimizing Tasks Assignment on Heterogeneous Multi-Core Real-Time Systems with Minimum Energy,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 577-582. doi: 10.1109/HPCC-CSS-ICESS.2015.126

Abstract: The main challenge for embedded real-time systems, especially for mobile devices, is the trade-off between system performance and energy efficiency. Through studying the relationship between energy consumption, execution time and completion probability of tasks on heterogeneous multi-core architectures, we propose an Accelerated Search algorithm based on dynamic programming to obtain a combination of various task schemes which can be completed in a given time with a confidence probability by consuming the minimum possible energy. We adopt a DAG (Directed Acyclic Graph) to represent the precedent relation between tasks and develop a Minimum-Energy Model to find the optimal tasks assignment. The heterogeneous multi-core architectures can execute tasks under different voltage level with DVFS which leads to different execution time and different consumption energy. The experimental results demonstrate our approach outperforms state-of-the-art algorithms in this field (maximum improvement of 24.6%).

Keywords: directed graphs; dynamic programming; embedded systems; energy conservation; energy consumption; mobile computing; multiprocessing systems; power aware computing; probability; search problems; DAG; DVFS; accelerated search algorithm; confidence probability; directed acyclic graph; embedded real-time systems; energy efficiency; execution time; heterogeneous multicore real-time systems; minimum energy model; mobile devices; precedent relation; system performance; task assignment optimization; task completion probability; voltage level; Algorithm design and analysis; Dynamic programming; Energy consumption; Heuristic algorithms; Multicore processing; Real-time systems; Time factors; heterogeneous multi-core real-time system; minimum energy; probability statistics; tasks assignment (ID#: 16-9959)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336220&isnumber=7336120

A. S. S. Mohamed, A. A. El-Moursy and H. A. H. Fahmy, “Real-Time Memory Controller for Embedded Multi-Core System,” 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 839-842. doi: 10.1109/HPCC-CSS-ICESS.2015.133

Abstract: Nowadays modern chip multi-cores (CMPs) become more demanding because of their high performance especially in real-time embedded systems. On the other side, bounded latencies has become vital to guarantee high performance and fairness for applications running on CMPs cores. We propose a new memory controller that prioritizes and assigns defined quotas for cores within unified epoch (MCES). Our approach works on variety of generations of double data rate DRAM(DDR DRAM). MCES is able to achieve an overall performance reached 35% for 4 cores system.

Keywords: DRAM chips; embedded systems; multiprocessing systems; CMP cores; DDR-DRAM; MCES; bounded latencies; chip multicores; double-data-rate DRAM generation; embedded multicore system; real-time embedded systems; real-time memory controller; unified epoch; Arrays; Interference; Multicore processing; Random access memory; Real-time systems; Scheduling; Time factors; CMPs; Memory Controller; Real-Time (ID#: 16-9960)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336266&isnumber=7336120

F. M. M. u. Islam and M. Lin, “A Framework for Learning Based DVFS Technique Selection and Frequency Scaling for Multi-Core Real-Time Systems,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 721-726. doi: 10.1109/HPCC-CSS-ICESS.2015.313

Abstract: Multi-core processors have become very popular in recent years due to the higher throughput and lower energy consumption compared with unicore processors. They are widely used in portable devices and real-time systems. Despite of enormous prospective, limited battery capacity restricts their potential and hence, improving the system level energy management is still a major research area. In order to reduce the energy consumption, dynamic voltage and frequency scaling (DVFS) has been commonly used in modern processors. Previously, we have used reinforcement learning to scale voltage and frequency based on the task execution characteristics. We have also designed learning based method to choose a suitable DVFS technique to execute at different states. In this paper, we propose a generalized framework which integrates these two approaches for real-time systems on multi-core processors. The framework is generalized in a sense that it can work with different scheduling policies and existing DVFS techniques.

Keywords: learning (artificial intelligence); multiprocessing systems; power aware computing; real-time systems; dynamic voltage and frequency scaling; learning-based DVFS technique selection; multicore processor; multicore real-time system; reinforcement learning-based method; system level energy management; unicore processor; Energy consumption; Heuristic algorithms; Multicore processing; Power demand; Program processors; Real-time systems; Vehicle dynamics; Dynamic voltage and frequency scaling; Energy efficiency; Machine learning; Multi-core processors; time systems (ID#: 16-9961)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336243&isnumber=7336120

M. A. Aguilar, J. F. Eusse, R. Leupers, G. Ascheid and M. Odendahl, “Extraction of Kahn Process Networks from While Loops in Embedded Software,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1078-1085. doi: 10.1109/HPCC-CSS-ICESS.2015.158

Abstract: Many embedded applications such as multimedia, signal processing and wireless communications present a streaming processing behavior. In order to take full advantage of modern multi-and many-core embedded platforms, these applications have to be parallelized by describing them in a given parallel Model of Computation (MoC). One of the most prominent MoCs is Kahn Process Network (KPN) as it allows to express multiple forms of parallelism and it is suitable for efficient mapping and scheduling onto parallel embedded platforms. However, describing streaming applications manually in a KPN is a challenging task. Especially, since they spend most of their execution time in loops with unbounded number of iterations. These loops are in several cases implemented as while loops, which are difficult to analyze. In this paper, we present an approach to guide the derivation of KPNs from embedded streaming applications dominated by multiple types of while loops. We evaluate the applicability of our approach on an eight DSP core commercial embedded platform using realistic benchmarks. Results measured on the platform showed that we are able to speedup sequential benchmarks on average by a factor up to 4.3x and in the best case up to 7.7x. Additionally, to evaluate the effectiveness of our approach, we compared it against a state-of-the-art parallelization framework.

Keywords: digital signal processing chips; embedded systems; parallel processing; program control structures; DSP core embedded platform; KPN; Kahn process network extraction; MoC; embedded software; embedded streaming applications; execution time; many-core embedded platforms; multicore embedded platforms; parallel embedded platforms; parallel model-of-computation; parallelized applications; sequential benchmarks; while loops; Computational modeling; Data mining; Long Term Evolution; Parallel processing; Runtime; Switches; Uplink; DSP; Kahn Process Networks; MPSoCs; Parallelization; While Loops (ID#: 16-9962)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336312&isnumber=7336120

V. Gunes and T. Givargis, “XGRID: A Scalable Many-Core Embedded Processor,” 2015 IEEE 17th International Conference on,High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1143-1146. doi: 10.1109/HPCC-CSS-ICESS.2015.99

Abstract: The demand for compute cycles needed by embedded systems is rapidly increasing. In this paper, we introduce the XGRID embedded many-core system-on-chip architecture. XGRID makes use of a novel, FPGA-like, programmable interconnect infrastructure, offering scalability and deterministic communication using hardware supported message passing among cores. Our experiments with XGRID are very encouraging. A number of parallel benchmarks are evaluated on the XGRID processor using the application mapping technique described in this work. We have validated our scalability claim by running our benchmarks on XGRID varying in core count. We have also validated our assertions on XGRID architecture by comparing XGRID against the Graphite many-core architecture and have shown that XGRID outperforms Graphite in performance.

Keywords: embedded systems; field programmable gate arrays; multiprocessing systems; parallel architectures; system-on-chip; FPGA-like, programmable interconnect infrastructure; XGRID embedded many-core system-on-chip architecture; application mapping technique; compute cycles; core count; deterministic communication; hardware supported message passing; parallel benchmarks; scalable many-core embedded processor; Benchmark testing; Communication channels; Discrete cosine transforms; Field programmable gate arrays; Multicore processing; Switches; Embedded Processors; Many-core; Multi-core; System-on-Chip Architectures (ID#: 16-9963)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336323&isnumber=7336120

K. Rushaidat, L. Schwiebert, B. Jackman, J. Mick and J. Potoff, “Evaluation of Hybrid Parallel Cell List Algorithms for Monte Carlo Simulation,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 1859-1864. doi: 10.1109/HPCC-CSS-ICESS.2015.260

Abstract: This paper describes efficient, scalable parallel implementations of the conventional cell list method and a modified cell list method to calculate the total system intermolecular Lennard-Jones force interactions in the Monte Carlo Gibbs ensemble. We targeted this part of the Gibbs ensemble for optimization because it is the most computationally demanding part of the force interactions in the simulation, as it involves all the molecules in the system. The modified cell list implementation reduces the number of particles that are outside the interaction range by making the cells smaller, thus reducing the number of unnecessary distance evaluations. Evaluation of the two cell list methods is done using a hybrid MPI+OpenMP approach and a hybrid MPI+CUDA approach. The cell list methods are evaluated on a small cluster of multicore CPUs, Intel Phi coprocessors, and GPUs. The performance results are evaluated using different combinations of MPI processes, threads, and problem sizes.

Keywords: Monte Carlo methods; application program interfaces; cellular biophysics; graphics processing units; intermolecular forces; materials science computing; message passing; multi-threading; parallel architectures; GPU; Intel Phi coprocessors; Monte Carlo Gibbs ensemble; Monte Carlo simulation; conventional-cell list method; distance evaluations; force interactions; hybrid MPI-plus-CUDA approach; hybrid MPI-plus-OpenMP approach; hybrid parallel cell list algorithm evaluation; modified cell list implementation; multicore CPU; performance evaluation; scalable-parallel implementations; total system intermolecular Lennard-Jones force interactions; Computational modeling; Force; Graphics processing units; Microcell networks; Monte Carlo methods; Solid modeling; Cell List; Gibbs Ensemble; Hybrid Parallel Architectures; Monte Carlo Simulations (ID#: 16-9964)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336443&isnumber=7336120

J. C. Beard and R. D. Chamberlain, “Run Time Approximation of Non-Blocking Service Rates for Streaming Systems,” 2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), New York, NY, 2015, pp. 792-797. doi: 10.1109/HPCC-CSS-ICESS.2015.64

Abstract: Stream processing is a compute paradigm that promises safe and efficient parallelism. Its realization requires optimization of multiple parameters such as kernel placement and communications. Most techniques to optimize streaming systems use queueing network models or network flow models, which often require estimates of the execution rate of each compute kernel. This is known as the non-blocking “service rate” of the kernel within the queueing literature. Current approaches to divining service rates are static. To maintain a tuned application during execution (while online) with non-static workloads, dynamic instrumentation of service rate is highly desirable. Our approach enables online service rate monitoring for streaming applications under most conditions, obviating the need to rely on steady state predictions for what are likely non-steady state phenomena. This work describes an algorithm to approximate non-blocking service rate, its implementation in the open source RaftLib framework, and validates the methodology using streaming applications on multi-core hardware.

Keywords: data flow computing; multiprocessing systems; public domain software; compute kernel execution rate; dynamic instrumentation; kernel communications; kernel placement; multicore hardware; multiple parameter optimization; nonblocking service rate approximation; nonstatic workloads; nonsteady state phenomena; online service rate monitoring; open source RaftLib framework; parallelism; run-time approximation; service rate; steady state predictions; stream processing; streaming system optimization; streaming systems; Approximation methods; Computational modeling; Instruments; Kernel; Monitoring; Servers; Timing; instrumentation; parallel processing; raftlib (ID#: 16-9965)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336255&isnumber=7336120

M. Kiperberg, A. Resh and N. J. Zaidenberg, “Remote Attestation of Software and Execution-Environment in Modern Machines,” Cyber Security and Cloud Computing (CSCloud), 2015 IEEE 2nd International Conference on, New York, NY, 2015, pp. 335-341. doi: 10.1109/CSCloud.2015.52

Abstract: The research on network security concentrates mainly on securing the communication channels between two endpoints, which is insufficient if the authenticity of one of the endpoints cannot be determined with certainty. Previously presented methods that allow one endpoint, the authentication authority, to authenticate another remote machine. These methods are inadequate for modern machines that have multiple processors, introduce virtualization extensions, have a greater variety of side effects, and suffer from nondeterminism. This paper addresses the advances of modern machines with respect to the method presented by Kennell. The authors describe how a remote attestation procedure, involving a challenge, needs to be structured in order to provide correct attestation of a remote modern target system.

Keywords: security of data; virtual machines; virtualisation; authentication authority; communication channel security; execution-environment; network security; nondeterminism; remote machine authentication; remote software attestation; remote target system; virtualization extensions; Authentication; Computer architecture; Hardware; Program processors; Protocols; Virtualization; Dynamic Root of Trust; Multicore; Rootkit Detection; Self-checksumming Code; Software-based Root-of-trust; Trusted Computing; Virtualization (ID#: 16-9966)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7371504&isnumber=7371418

Y. Shen and K. Elphinstone, “Microkernel Mechanisms for Improving the Trustworthiness of Commodity Hardware,” Dependable Computing Conference (EDCC), 2015 Eleventh European, Paris, 2015, pp. 155-166. doi: 10.1109/EDCC.2015.16

Abstract: Trustworthy isolation is required to consolidate safety and security critical software systems on a single hardware platform. Recent advances in formally verifying correctness and isolation properties of a microkernel should enable mutually distrusting software to co-exist on the same platform with a high level of assurance of correct operation. However, commodity hardware is susceptible to transient faults triggered by cosmic rays, and alpha particle strikes, and thus may invalidate the isolation guarantees, or trigger failure in isolated applications. To increase trustworthiness of commodity hardware, we apply redundant execution techniques from the dependability community to a modern microkernel. We leverage the hardware redundancy provided by multicore processors to perform transient fault detection for applications and for the microkernel itself. This paper presents the mechanisms and framework for microkernel based systems to implement redundant execution for improved trustworthiness. It evaluates the performance of the resulting system on x86-64 and ARM platforms.

Keywords: multiprocessing systems; operating system kernels; redundancy; safety-critical software; security of data; 64 platforms; ARM platforms; alpha particle strikes; commodity hardware trustworthiness; correctness formal verification; cosmic rays; dependability community; hardware redundancy; isolation properties; microkernel mechanisms; modern microkernel; multicore processors; redundant execution techniques; safety critical software systems; security critical software systems; transient fault detection; trustworthy isolation; x86 platforms; Hardware; Kernel; Multicore processing; Program processors; Security; Transient analysis; Microkernel; Reliability; SEUs; Security; Trustworthy Systems (ID#: 16-9967)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7371963&isnumber=7371940

N. Druml et al., “Time-of-Flight 3D Imaging for Mixed-Critical Systems,” 2015 IEEE 13th International Conference on Industrial Informatics (INDIN), Cambridge, 2015, pp. 1432-1437. doi: 10.1109/INDIN.2015.7281943

Abstract: Computer vision is becoming more and more important in the fields of consumer electronics, cyber-physical systems, and automotive technology. Recognizing and classifying one's environment reliably is imperative for safety-critical applications, as they are omnipresent, e.g., in the automotive or aviation domain. For this purpose, the Time-of-Flight imaging technology is suitable, which enables robust and cost-efficient three-dimensional sensing of the environment. However, the resource limitations of safety- and security-certified processor systems as well as complying to safety standards, poses a challenge for the development and integration of complex Time-of-Flight-based applications. Here we present a Time-of-Flight system approach that focuses in particular on the automotive domain. This Time-of-Flight imaging approach is based on an automotive processing platform that complies to safety and security standards. By employing state-of-the-art hardware/software and multi-core concepts, a robust Time-of-Flight system solution is introduced that can be used in a mixed-critical application context. In this work we demonstrate the feasible implementation of the proposed hardware/software architecture by means of a prototype for the automotive domain. Raw Time-of-Flight sensor data is taken and 3D data is calculated with up to 80 FPS without the usage of dedicated hardware accelerators. In a next step, safety-critical automotive applications (e.g., parking assistance) can exploit this 3D data in a mixed-critical environment respecting the needs of the ISO 26262.

Keywords: computer vision; image sensors; road safety; safety systems; software architecture; traffic engineering computing; 3D data; ISO 26262; automotive processing platform; automotive technology; consumer electronics; cyber-physical systems; hardware accelerators; hardware-software architecture; mixed-critical systems; multicore concepts; raw time-of-flight sensor data; safety-critical automotive applications; security-certified processor systems; time-of-flight 3D imaging; Automotive engineering; Cameras; Hardware; Safety; Sensors; Three-dimensional displays; 3D sensing; Time-of-Flight; automotive applications; functional safety; mixed-critical; multi-core (ID#: 16-9968)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7281943&isnumber=7281697

Note:

Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to news@scienceofsecurity.net for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.