Multicore Computing Security 2015

 

 
SoS Logo

Multicore Computing Security 2015

 

As high performance computing has evolved into larger and faster computing solutions, new approaches to security have been identified. The articles cited here focus on security issues related to multicore environments.  Multicore computing relates to the Science of Security hard topics of scalability, resilience, and metrics.  The work cited here was presented in 2015.


Dupros, F.; Boulahya, F.; Aochi, H.; Thierry, P., "Communication-Avoiding Seismic Numerical Kernels on Multicore Processors," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on ,  pp. 330-335, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.230

Abstract: The finite-difference method is routinely used to simulate seismic wave propagation both in the oil and gas industry and in strong motion analysis in seismology. This numerical method also lies at the heart of a significant fraction of numerical solvers in other fields. In terms of computational efficiency, one of the main difficulties is to deal with the disadvantageous ratio between the limited pointwise computation and the intensive memory access required, leading to a memory-bound situation. Naive sequential implementations offer poor cache-reuse and achieve in general a low fraction of peak performance of the processors. The situation is worst on multicore computing nodes with several levels of memory hierarchy. In this case, each cache miss corresponds to a costly memory access. Additionally, the memory bandwidth available on multicore chips improves slowly regarding the number of computing core which induces a dramatic reduction of the expected parallel performance. In this article, we introduce a cache-efficient algorithm for stencil-based computations using a decomposition along both the space and the time directions. We report a maximum speedup of x3.59 over the standard implementation.

Keywords: cache storage; finite difference methods; gas industry; geophysics computing; multiprocessing systems; petroleum industry; seismic waves; seismology; wave propagation; Naive sequential implementations; cache-efficient algorithm; cache-reuse; communication-avoiding seismic numerical kernel; computational efficiency; finite-difference method; gas industry; memory bandwidth; memory hierarchy; multicore chips; multicore computing nodes; multicore processors; numerical method; numerical solvers; oil industry; peak performance; pointwise computation; seismic wave propagation simulation; seismology; stencil-based computations; strong motion analysis; Memory management; Multicore processing; Optimization; Program processors; Seismic waves; Standards; communication-avoiding; multicore; seismic (ID#: 16-9382)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336184&isnumber=7336120

 

Xiaohao Lin; Weichen Liu; Chunming Xiao; Jie Dai; Xianlu Luo; Dan Zhang; Duo Liu; Kaijie Wu; Qingfeng Zhuge; Sha, E.H.-M., "Realistic Task Parallelization of the H.264 Decoding Algorithm for Multiprocessors," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 871-874, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.33

Abstract: There is a phenomenon that hardware technology has developed ahead of software technology in recent years. Companies lack of software techniques that can fully utilize the modern multi-core computing resources, mainly due to the difficulty of investigating the inherent parallelism inside a software. This problem exists in products ranging from energy-sensitive smartphones to performance-eager data centers. In this paper, we present a case study on the parallelization of the complex industry standard H.264 HDTV decoder application in multi-core systems. An optimal schedule of the tasks is obtained and implemented by a carefully-defined software parallelization framework (SPF). The parallel software framework is proposed together with a set of rules to direct parallel software programming (PSPR). A pre-processing phase based on the rules is applied to the source code to make the SPF applicable. The task-level parallel version of the H.264 decoder is implemented and tested extensively on a workstation running Linux. Significant performance improvement is observed for a set of benchmarks composed of 720p videos. The SPF and the PSPR will together serve as a reference for future parallel software implementations and direct the development of automated tools.

Keywords: Linux; high definition television; multiprocessing systems; parallel programming; source code (software);video coding;H.264 HDTV decoder application;H.264 decoding algorithm; Linux; PSPR; SPF; data centers; energy-sensitive smart phones; multicore computing resources; multiprocessors; optimal task schedule; parallel software implementations; parallel software programming; performance improvement; preprocessing phase; realistic task parallelization; software parallelization framework; source code ;task-level parallel; workstation; Decoding; Industries; Parallel processing; Parallel programming; Software; Software algorithms; Videos (ID#: 16-9383)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336273&isnumber=7336120

 

Cilardo, A.; Flich, J.; Gagliardi, M.; Gavila, R.T., "Customizable Heterogeneous Acceleration for Tomorrow's High-Performance Computing," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1181-1185, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.303

Abstract: High-performance computing as we know it today is experiencing unprecedented changes, encompassing all levels from technology to use cases. This paper explores the adoption of customizable, deeply heterogeneous manycore systems for future QoS-sensitive and power-efficient high-performance computing. At the heart of the proposed architecture is a NoC-based manycore system embracing medium-end CPUs, GPU-like processors, and reconfigurable hardware regions. The paper discusses the high-level design principles inspiring this innovative architecture as well as the key role that heterogeneous acceleration, ranging from multicore processors and GPUs down to FPGAs, might play for tomorrow's high-performance computing.

Keywords: field programmable gate arrays; graphics processing units; multiprocessing systems; network-on-chip; parallel processing; power aware computing; quality of service; FPGA; GPU-like processors; NoC-based many-core system; QoS-sensitive computing; customizable heterogeneous acceleration; heterogeneous acceleration; heterogeneous manycore systems; high-level design principles; high-performance computing; innovative architecture; medium-end CPU; multicore processors; power-efficient high-performance computing; reconfigurable hardware regions; Acceleration; Computer architecture; Field programmable gate arrays; Hardware; Program processors; Quality of service; Registers (ID#: 16-9384)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336329&isnumber=7336120 

 

Haidar, A.; YarKhan, A.; Chongxiao Cao; Luszczek, P.; Tomov, S.; Dongarra, J., "Flexible Linear Algebra Development and Scheduling with Cholesky Factorization," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 861-864, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.285

Abstract: Modern high performance computing environments are composed of networks of compute nodes that often contain a variety of heterogeneous compute resources, such as multicore CPUs and GPUs. One challenge faced by domain scientists is how to efficiently use all these distributed, heterogeneous resources. In order to use the GPUs effectively, the workload parallelism needs to be much greater than the parallelism for a multicore-CPU. Additionally, effectively using distributed memory nodes brings out another level of complexity where the workload must be carefully partitioned over the nodes. In this work we are using a lightweight runtime environment to handle many of the complexities in such distributed, heterogeneous systems. The runtime environment uses task-superscalar concepts to enable the developer to write serial code while providing parallel execution. The task-programming model allows the developer to write resource-specialization code, so that each resource gets the appropriate sized workload-grain. Our task-programming abstraction enables the developer to write a single algorithm that will execute efficiently across the distributed heterogeneous machine. We demonstrate the effectiveness of our approach with performance results for dense linear algebra applications, specifically the Cholesky factorization.

Keywords: distributed memory systems; graphics processing units; mathematics computing; matrix decomposition; parallel processing; resource allocation; scheduling; Cholesky factorization; GPU; compute nodes; distributed heterogeneous machine; distributed memory nodes; distributed resources; flexible linear algebra development; flexible linear algebra scheduling; heterogeneous compute resources; high performance computing environments; multicore-CPU; parallel execution; resource-specialization code; serial code; task-programming abstraction; task-programming model; task-superscalar concept; workload parallelism; Graphics processing units; Hardware; Linear algebra; Multicore processing; Parallel processing; Runtime; Scalability; Cholesky factorization; accelerator-based distributed memory computers; heterogeneous HPC computing; superscalar dataflow scheduling (ID#: 16-9385)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336271&isnumber=7336120

 

Qiuming Luo; Feng Xiao; Yuanyuan Zhou; Zhong Ming, "Performance Profiling of VMs on NUMA Multicore Platform by Inspecting the Uncore Data Flow," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 914-917, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.47

Abstract: Recently, NUMA(Non-Uniform Memory Access) multicore platform becomes more and more popular which provides hardware level support for many hot fields, such as cloud computing and big data, and deploying virtual machines on NUMA is a key technology. However, performance degradation in virtual machine isn't negligible due to the fact that guest OS has little or inaccurate knowledge about the underlying hardware. Our research will focus on performance profiling of VMs on multicore platform by inspecting the uncore data flow, and we design a performance profiling tool called VMMprof based on PMU(Performance Monitoring Units). It supports the uncore part of the processor, which is a new function beyond the capabilities of those existing tools. Experiments show that VMMprof can obtain typical factors which affect the performance of the processes and the whole system.

Keywords: data flow computing; memory architecture; multiprocessing systems; performance evaluation; virtual machines; NUMA multicore platform; PMU; VM performance profiling; VMMprof; hardware level support; nonuniform memory access; performance degradation; performance monitoring units; performance profiling tool; uncore data flow; uncore data flow inspection; virtual machines; Bandwidth; Hardware; Monitoring; Multicore processing; Phasor measurement units; Sockets; Virtual machining; NUMA; PMU; VMMprof; VMs; uncore (ID#: 16-9386)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336283&isnumber=7336120

 

Jiachen Xue; Chong Chen; Lin Ma; Teng Su; Chen Tian; Wenqin Zheng; Ziang Hu, "Task-D: A Task Based Programming Framework for Distributed System," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1663-1668, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.299

Abstract: We present Task-D, a task-based distributed programming framework. Traditionally, programming for distributed programs requires using either low-level MPI or high-level pattern based models such as Hadoop/Spark. Task based models are frequently and well used for multicore and heterogeneous environment rather than distributed. Our Task-D tries to bridge this gap by creating a higher-level abstraction than MPI, while providing more flexibility than Hadoop/Spark for task-based distributed programming. The Task-D framework alleviates programmers from considering the complexities involved in distributed programming. We provide a set of APIs that can be directly embedded into user code to enable the program to run in a distributed fashion across heterogeneous computing nodes. We also explore the design space and necessary features the runtime should support, including data communication among tasks, data sharing among programs, resource management, memory transfers, job scheduling, automatic workload balancing and fault tolerance, etc. A prototype system is realized as one implementation of Task-D. A distributed ALS algorithm is implemented using the Task-D APIs, and achieved significant performance gains compared to Spark based implementation. We conclude that task-based models can be well suitable to distributed programming. Our Task-D is not only able to improve the programmability for distributed environment, but also able to leverage the performance with effective runtime support.

Keywords: application program interfaces; message passing; parallel programming; automatic workload balancing; data communication; distributed ALS algorithm; distributed programming; distributed system; heterogeneous computing node; high-level pattern based; job scheduling; low-level MPI; resource management; task-D API; task-based programming framework; Algorithm design and analysis; Data communication; Fault tolerance; Fault tolerant systems; Programming; Resource management; Synchronization (ID#: 16-9387)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336408&isnumber=7336120

 

Beard, J.C.; Chamberlain, R.D., "Run Time Approximation of Non-blocking Service Rates for Streaming Systems," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 792-797, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.64

Abstract: Stream processing is a compute paradigm that promises safe and efficient parallelism. Its realization requires optimization of multiple parameters such as kernel placement and communications. Most techniques to optimize streaming systems use queueing network models or network flow models, which often require estimates of the execution rate of each compute kernel. This is known as the non-blocking "service rate" of the kernel within the queueing literature. Current approaches to divining service rates are static. To maintain a tuned application during execution (while online) with non-static workloads, dynamic instrumentation of service rate is highly desirable. Our approach enables online service rate monitoring for streaming applications under most conditions, obviating the need to rely on steady state predictions for what are likely non-steady state phenomena. This work describes an algorithm to approximate non-blocking service rate, its implementation in the open source RaftLib framework, and validates the methodology using streaming applications on multi-core hardware.

Keywords: data flow computing; multiprocessing systems; public domain software; compute kernel execution rate; dynamic instrumentation; kernel communications; kernel placement; multicore hardware; multiple parameter optimization; nonblocking service rate approximation; nonstatic workloads; nonsteady state phenomena; online service rate monitoring; open source RaftLib framework; parallelism; run-time approximation; service rate; steady state predictions; stream processing; streaming system optimization; streaming systems; Approximation methods; Computational modeling; Instruments; Kernel; Monitoring; Servers; Timing; instrumentation; parallel processing; raftlib; stream processing (ID#: 16-9388)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336255&isnumber=7336120

 

Bogdan, P.; Yuankun Xue, "Mathematical Models and Control Algorithms for Dynamic Optimization of Multicore Platforms: A Complex Dynamics Approach," in Computer-Aided Design (ICCAD), 2015 IEEE/ACM International Conference on, pp. 170-175, 2-6 Nov. 2015. doi: 10.1109/ICCAD.2015.7372566

Abstract: The continuous increase in integration densities contributed to a shift from Dennard's scaling to a parallelization era of multi-/many-core chips. However, for multicores to rapidly percolate the application domain from consumer multimedia to high-end functionality (e.g., security, healthcare, big data), power/energy and thermal efficiency challenges must be addressed. Increased power densities can raise on-chip temperatures, which in turn decrease chip reliability and performance, and increase cooling costs. For a dependable multicore system, dynamic optimization (power / thermal management) has to rely on accurate yet low complexity workload models. Towards this end, we present a class of mathematical models that generalize prior approaches and capture their time dependence and long-range memory with minimum complexity. This modeling framework serves as the basis for defining new efficient control and prediction algorithms for hierarchical dynamic power management of future data-centers-on-a-chip.

Keywords: multiprocessing systems; power aware computing; temperature; Dennard scaling; chip performance; chip reliability; complex dynamics approach; control algorithm; data-centers-on-a-chip; dynamic optimization; hierarchical dynamic power management; many-core chips; multicore chips; multicore platform; on-chip temperature; power density; power management; prediction algorithm; thermal management; Autoregressive processes; Heuristic algorithms; Mathematical model; Measurement; Multicore processing; Optimization; Stochastic processes (ID#: 16-9389)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7372566&isnumber=7372533

 

Mohamed, A.S.S.; El-Moursy, A.A.; Fahmy, H.A.H., "Real-Time Memory Controller for Embedded Multi-core System," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 839-842, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.133

Abstract: Nowadays modern chip multi-cores (CMPs) become more demanding because of their high performance especially in real-time embedded systems. On the other side, bounded latencies has become vital to guarantee high performance and fairness for applications running on CMPs cores. We propose a new memory controller that prioritizes and assigns defined quotas for cores within unified epoch (MCES). Our approach works on variety of generations of double data rate DRAM(DDR DRAM). MCES is able to achieve an overall performance reached 35% for 4 cores system.

Keywords: DRAM chips; embedded systems; multiprocessing systems; CMP cores; DDR-DRAM; MCES; bounded latencies; chip multicores; double-data-rate DRAM generation; embedded multicore system; real-time embedded systems; real-time memory controller; unified epoch; Arrays; Interference; Multicore processing; Random access memory; Real-time systems; Scheduling; Time factors; CMPs; Memory Controller; Real-Time (ID#: 16-9390)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336266&isnumber=7336120

 

Songyuan Li; Jinglei Meng; Licheng Yu; Jianliang Ma; Tianzhou Chen; Minghui Wu, "Buffer Filter: A Last-Level Cache Management Policy for CPU-GPGPU Heterogeneous System," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 266-271, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.290

Abstract: There is a growing trend towards heterogeneous systems, which contain CPUs and GPGPUs in a single chip. Managing those various on-chip resources shared between CPUs and GPGPUs, however, is a big issue and the last-level cache (LLC) is one of the most critical resources due to its impact on system performance. Some well-known cache replacement policies like LRU and DRRIP, designed for a CPU, can not be so well qualified for heterogeneous systems because the LLC will be dominated by memory accesses from thousands of threads of GPGPU applications and this may lead to significant performance downgrade for a CPU. Another reason is that a GPGPU is able to tolerate memory latency when quantity of active threads in the GPGPU is sufficient, but those policies do not utilize this feature. In this paper we propose a novel shared LLC management policy for CPU-GPGPU heterogeneous systems called Buffer Filter which takes advantage of memory latency tolerance of GPGPUs. This policy has the ability to restrict streaming requests of GPGPU by adding a buffer to memory system and vacate LLC space for cache-sensitive CPU applications. Although there is some IPC loss for GPGPU but the memory latency tolerance ensures the basic performance of GPGPU's applications. The experiments show that the Buffer Filter is able to filtrate up to 50% to 75% of the total GPGPU streaming requests at the cost of little GPGPU IPC decrease and improve the hit rate of CPU applications by 2x to 7x.

Keywords: cache storage; graphics processing units; CPU-GPGPU heterogeneous system; buffer filter; cache replacement policies; cache-sensitive CPU applications; general-purpose graphics processing unit; last-level cache management policy; memory access; memory latency tolerance; on-chip resources; shared LLC management policy; Benchmark testing; Central Processing Unit; Instruction sets; Memory management; Multicore processing; Parallel processing; System performance; heterogeneous system; multicore; shared last-level cache (ID#: 16-9391)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336174&isnumber=7336120

 

Muhammad Mahbub ul Islam, F.; Man Lin, "A Framework for Learning Based DVFS Technique Selection and Frequency Scaling for Multi-core Real-Time Systems," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 721-726, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.313

Abstract: Multi-core processors have become very popular in recent years due to the higher throughput and lower energy consumption compared with unicore processors. They are widely used in portable devices and real-time systems. Despite of enormous prospective, limited battery capacity restricts their potential and hence, improving the system level energy management is still a major research area. In order to reduce the energy consumption, dynamic voltage and frequency scaling (DVFS) has been commonly used in modern processors. Previously, we have used reinforcement learning to scale voltage and frequency based on the task execution characteristics. We have also designed learning based method to choose a suitable DVFS technique to execute at different states. In this paper, we propose a generalized framework which integrates these two approaches for real-time systems on multi-core processors. The framework is generalized in a sense that it can work with different scheduling policies and existing DVFS techniques.

Keywords: learning (artificial intelligence); multiprocessing systems; power aware computing; real-time systems; dynamic voltage and frequency scaling; learning-based DVFS technique selection; multicore processor; multicore real-time system; reinforcement learning-based method; system level energy management; unicore processor; Energy consumption; Heuristic algorithms; Multicore processing; Power demand; Program processors; Real-time systems; Vehicle dynamics; Dynamic voltage and frequency scaling; Energy efficiency; Machine learning; Multi-core processors; time systems (ID#: 16-9392)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336243&isnumber=7336120

 

Ying Li; Jianwei Niu; Meikang Qiu; Xiang Long, "Optimizing Tasks Assignment on Heterogeneous Multi-core Real-Time Systems with Minimum Energy," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 577-582, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.126

Abstract: The main challenge for embedded real-time systems, especially for mobile devices, is the trade-off between system performance and energy efficiency. Through studying the relationship between energy consumption, execution time and completion probability of tasks on heterogeneous multi-core architectures, we propose an Accelerated Search algorithm based on dynamic programming to obtain a combination of various task schemes which can be completed in a given time with a confidence probability by consuming the minimum possible energy. We adopt a DAG (Directed Acyclic Graph) to represent the precedent relation between tasks and develop a Minimum-Energy Model to find the optimal tasks assignment. The heterogeneous multi-core architectures can execute tasks under different voltage level with DVFS which leads to different execution time and different consumption energy. The experimental results demonstrate our approach outperforms state-of-the-art algorithms in this field (maximum improvement of 24.6%).

Keywords: directed graphs; dynamic programming; embedded systems; energy conservation; energy consumption; mobile computing; multiprocessing systems; power aware computing; probability; search problems; DAG; DVFS; accelerated search algorithm; confidence probability; directed acyclic graph; dynamic programming; embedded real-time systems; energy consumption; energy efficiency; execution time; heterogeneous multicore real-time systems; minimum energy model; mobile devices; precedent relation; system performance; task assignment optimization; task completion probability; voltage level; Algorithm design and analysis; Dynamic programming; Energy consumption; Heuristic algorithms; Multicore processing; Real-time systems; Time factors; heterogeneous multi-core real-time system; minimum energy; probability statistics; tasks assignment (ID#: 16-9393)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336220&isnumber=7336120

 

Aguilar, M.A.; Eusse, J.F.; Leupers, R.; Ascheid, G.; Odendahl, M., "Extraction of Kahn Process Networks from While Loops in Embedded Software," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1078-1085, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.158

Abstract: Many embedded applications such as multimedia, signal processing and wireless communications present a streaming processing behavior. In order to take full advantage of modern multi-and many-core embedded platforms, these applications have to be parallelized by describing them in a given parallel Model of Computation (MoC). One of the most prominent MoCs is Kahn Process Network (KPN) as it allows to express multiple forms of parallelism and it is suitable for efficient mapping and scheduling onto parallel embedded platforms. However, describing streaming applications manually in a KPN is a challenging task. Especially, since they spend most of their execution time in loops with unbounded number of iterations. These loops are in several cases implemented as while loops, which are difficult to analyze. In this paper, we present an approach to guide the derivation of KPNs from embedded streaming applications dominated by multiple types of while loops. We evaluate the applicability of our approach on an eight DSP core commercial embedded platform using realistic benchmarks. Results measured on the platform showed that we are able to speedup sequential benchmarks on average by a factor up to 4.3x and in the best case up to 7.7x. Additionally, to evaluate the effectiveness of our approach, we compared it against a state-of-the-art parallelization framework.

Keywords: digital signal processing chips; embedded systems; parallel processing; program control structures; DSP core embedded platform; KPN; Kahn process network extraction; MoC; embedded software; embedded streaming applications; execution time; many-core embedded platforms; multicore embedded platforms; parallel embedded platforms; parallel model-of-computation; parallelized applications; sequential benchmarks; while loops; Computational modeling; Data mining; Long Term Evolution; Parallel processing; Runtime; Switches; Uplink; DSP; Kahn Process Networks; MPSoCs; Parallelization; While Loops (ID#: 16-9394)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336312&isnumber=7336120

 

Rushaidat, K.; Schwiebert, L.; Jackman, B.; Mick, J.; Potoff, J., "Evaluation of Hybrid Parallel Cell List Algorithms for Monte Carlo Simulation," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1859-1864, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.260

Abstract: This paper describes efficient, scalable parallel implementations of the conventional cell list method and a modified cell list method to calculate the total system intermolecular Lennard-Jones force interactions in the Monte Carlo Gibbs ensemble. We targeted this part of the Gibbs ensemble for optimization because it is the most computationally demanding part of the force interactions in the simulation, as it involves all the molecules in the system. The modified cell list implementation reduces the number of particles that are outside the interaction range by making the cells smaller, thus reducing the number of unnecessary distance evaluations. Evaluation of the two cell list methods is done using a hybrid MPI+OpenMP approach and a hybrid MPI+CUDA approach. The cell list methods are evaluated on a small cluster of multicore CPUs, Intel Phi coprocessors, and GPUs. The performance results are evaluated using different combinations of MPI processes, threads, and problem sizes.

Keywords: Monte Carlo methods; application program interfaces; cellular biophysics; graphics processing units; intermolecular forces; materials science computing; message passing; multi-threading; parallel architectures; GPU; Intel Phi coprocessors; Monte Carlo Gibbs ensemble; Monte Carlo simulation; conventional-cell list method; distance evaluations; force interactions; hybrid MPI-plus-CUDA approach; hybrid MPI-plus-OpenMP approach; hybrid parallel cell list algorithm evaluation; modified cell list implementation; multicore CPU; performance evaluation; scalable-parallel implementations; total system intermolecular Lennard-Jones force interactions; Computational modeling; Force; Graphics processing units; Microcell networks; Monte Carlo methods; Solid modeling; Cell List; Gibbs Ensemble; Hybrid Parallel Architectures; Monte Carlo Simulations (ID#: 16-9395)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336443&isnumber=7336120

 

Peng Sun; Chandrasekaran, S.; Suyang Zhu; Chapman, B., "Deploying OpenMP Task Parallelism on Multicore Embedded Systems with MCA Task APIs," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 843-847, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.88

Abstract: Heterogeneous multicore embedded systems are rapidly growing with cores of varying types and capacity. Programming these devices and exploiting the hardware has been a real challenge. The programming models and its execution are typically meant for general purpose computation, they are mostly too heavy to be adopted for the resource-constrained embedded systems. Embedded programmers are still expected to use low-level and proprietary APIs, making the software built less and less portable. These challenges motivated us to explore how OpenMP, a high-level directive-based model, could be used for embedded platforms. In this paper, we translate OpenMP to Multicore Association Task Management API (MTAPI), which is a standard API for leveraging task parallelism on embedded platforms. Our results demonstrate that the performance of our OpenMP runtime library is comparable to the state-of-the-art task parallel solutions. We believe this approach will provide a portable solution since it abstracts the low-level details of the hardware and no longer depends on vendor-specific API.

Keywords: application program interfaces; embedded systems; multiprocessing systems; parallel processing; MCA; MTAPI; OpenMP runtime library; OpenMP task parallelism; heterogeneous multicore embedded system; high-level directive-based model; multicore association task management API; multicore embedded system; resource-constrained embedded system; vendor-specific API; Computational modeling; Embedded systems; Hardware; Multicore processing; Parallel processing; Programming; Heterogeneous Multicore Embedded Systems; MTAPI; OpenMP; Parallel Computing (ID#: 16-9396)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336267&isnumber=7336120

 

Xu, T.C.; Leppanen, V.; Liljeberg, P.; Plosila, J.; Tenhunen, H., "Trio: A Triple Class On-chip Network Design for Efficient Multicore Processors," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 951-956, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.44

Abstract: We propose and analyse an on-chip interconnect design for improving the efficiency of multicore processors. Conventional interconnection networks are usually based on a single homogeneous network with uniform processing of all traffic. While the design is simplified, this approach can have performance bottlenecks and limitations on system efficiency. We investigate the traffic pattern of several real world applications. Based on a directory cache coherence protocol, we characterise and categorize the traffic in terms of various aspects. It is discovered that control and unicast packets dominated the network, while the percentages of data and multicast messages are relatively low. Furthermore, we find most of the invalidation messages are multicast messages, and most of the multicast messages are invalidation message. The multicast invalidation messages usually have higher number of destination nodes compared with other multicast messages. These observations lead to the proposed triple class interconnect, where a dedicated multicast-capable network is responsible for the control messages and the data messages are handled by another network. By using a detailed full system simulation environment, the proposed design is compared with the homogeneous baseline network, as well as two other network designs. Experimental results show that the average network latency and energy delay product of the proposed design have improved 24.4% and 10.2% compared with the baseline network.

Keywords: cache storage; multiprocessing systems; multiprocessor interconnection networks; network synthesis; network-on-chip; Trio; average network latency; dedicated multicast-capable network; destination nodes; directory cache coherence protocol; energy delay product; full system simulation environment; homogeneous baseline network; multicast invalidation messages; multicore processors; on-chip interconnect design; traffic pattern; triple class on-chip network design; unicast packets; Coherence; Multicore processing; Ports (Computers); Program processors; Protocols; System-on-chip; Unicast; cache; design; efficient; multicore; network-on-chip (ID#: 16-9397)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336293&isnumber=7336120

 

Rao, N.S.V.; Towsley, D.; Vardoyan, G.; Settlemyer, B.W.; Foster, I.T.; Kettimuthu, R., "Sustained Wide-Area TCP Memory Transfers over Dedicated Connections," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1603-1606, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.86

Abstract: Wide-area memory transfers between on-going computations and remote steering, analysis and visualization sites can be utilized in several High-Performance Computing (HPC) scenarios. Dedicated network connections with high capacity, low loss rates and low competing traffic, are typically provisioned over current HPC infrastructures to support such transfers. To gain insights into such transfers, we collected throughput measurements for different versions of TCP between dedicated multi-core servers over emulated 10 Gbps connections with round trip times (rtt) in the range 0-366 ms. Existing TCP models and measurements over shared links are well-known to exhibit monotonically decreasing, convex throughput profiles as rtt is increased. In sharp contrast, our these measurements show two distinct regimes: a concave profile at lower rtts and a convex profile at higher rtts. We present analytical results that explain these regimes: (a) at lower rtt, rapid throughput increase due to slow-start leads to the concave profile, and (b) at higher rtt, TCP congestion avoidance phase with slower dynamics dominates. In both cases, however, we analytically show that throughput decreases with rtt, albeit at different rates, as confirmed by the measurements. These results provide practical TCP solutions to these transfers without requiring additional hardware and software, unlike Infiniband and UDP solutions, respectively.

Keywords: network servers; parallel processing; sustainable development; telecommunication congestion control; telecommunication links; telecommunication traffic; transport protocols; wide area networks; HPC; concave profile; congestion avoidance; convex profile; dedicated connection; high-performance computing; multicore server; remote steering; shared link; sustained wide area TCP memory transfer; visualization site; Current measurement; Data transfer; Hardware; Linux; Software; Supercomputers; Throughput; TCP; dedicated connections; memory transfers; throughput measurements (ID#: 16-9398)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336397&isnumber=7336120

 

Shekhar, M.; Ramaprasad, H.; Mueller, F., "Evaluation of Memory Access Arbitration Algorithm on Tilera's TILEPro64 Platform," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1154-1159, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.245

Abstract: As real-time embedded systems demand more and more computing power under reasonable energy budgets, multi-core platforms are a viable option. However, deploying real-time applications on multi-core platforms introduce several predictability challenges. One of these challenges is bounding the latency of memory accesses issued by real-time tasks. This challenge is exacerbated as the number of cores and, hence, the degree of resource sharing increases. Over the last several years, researchers have proposed techniques to overcome this challenge. In prior work, we proposed an arbitration policy for memory access requests over a Network-on-Chip. In this paper, we implement and evaluate variants of our arbitration policy on a real hardware platform, namely Tilera's TilePro64 platform.

Keywords: embedded systems; multiprocessing systems; network-on-chip; storage management;TILEPro64 platform; memory access arbitration algorithm; multicore platforms; network-on-chip; real-time embedded systems; Dynamic scheduling; Engines; Hardware; Instruction sets; Memory management; Real-time systems; System-on-chip (ID#: 16-9399)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336325&isnumber=7336120

 

Gunes, V.; Givargis, T., "XGRID: A Scalable Many-Core Embedded Processor," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1143-1146, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.99

Abstract: The demand for compute cycles needed by embedded systems is rapidly increasing. In this paper, we introduce the XGRID embedded many-core system-on-chip architecture. XGRID makes use of a novel, FPGA-like, programmable interconnect infrastructure, offering scalability and deterministic communication using hardware supported message passing among cores. Our experiments with XGRID are very encouraging. A number of parallel benchmarks are evaluated on the XGRID processor using the application mapping technique described in this work. We have validated our scalability claim by running our benchmarks on XGRID varying in core count. We have also validated our assertions on XGRID architecture by comparing XGRID against the Graphite many-core architecture and have shown that XGRID outperforms Graphite in performance.

Keywords: embedded systems; field programmable gate arrays; multiprocessing systems; parallel architectures; system-on-chip; FPGA-like, programmable interconnect infrastructure; XGRID embedded many-core system-on-chip architecture; application mapping technique; compute cycles; core count; deterministic communication; hardware supported message passing; parallel benchmarks; scalable many-core embedded processor; Benchmark testing; Communication channels; Discrete cosine transforms; Field programmable gate arrays; Multicore processing; Switches; Embedded Processors; Many-core; Multi-core; System-on-Chip Architectures (ID#: 16-9400)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336323&isnumber=7336120

 

Jia Tian; Wei Hu; Chunqiang Li; Tianpei Li; Wenjun Luo, "Multi-thread Connection Based Scheduling Algorithm for Network on Chip," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on,  pp. 1473-1478, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.160

Abstract: More and more cores are integrated onto a single chip to improve the performance and reduce the power consumption of CPU without the increased frequency. The core are connected by lines and organized as a network, which is called network on chip (NOC) as the promising paradigm. NOC has improved the performance of the CPU without the increased power consumption. However, there is still a new problem that how to schedule the threads to the different cores to take full advantages of NOC. In this paper, we proposed a new multi-thread scheduling algorithm based on thread connection for NOC. The connection relationship of the threads will be analyzed and divided into different thread sets. And at the same time, the network topology of the NOC is also analyzed. The connection relationship of the cores is set in the NOC model and divided into different regions. The thread sets and core regions will be establish correspondence relationship according to the features of them. And the multi-thread scheduling algorithm will map thread sets to the corresponding core regions. In the same core set, the threads in the same set will be scheduled via different proper approaches. The experiments have showed that the proposed algorithm can improve the performance of the programs and enhance the utilization of NOC cores.

Keywords: multi-threading; network theory (graphs); network-on-chip; performance evaluation; power aware computing; processor scheduling; CPU; NOC core; multithread connection based scheduling; multithread connection-based scheduling algorithm; network topology; network-on-chip; power consumption; Algorithm design and analysis; Heuristic algorithms; Instruction sets; Multicore processing; Network topology; Scheduling algorithms; System-on-chip; Algorithm; Network on Chip; Scheduling; Thread Connection (ID#: 16-9401)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336376&isnumber=7336120

 

Yuxiang Li; Yinliang Zhao; Huan Gao, "Using Artificial Neural Network for Predicting Thread Partitioning in Speculative Multithreading," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 823-826, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.28

Abstract: Speculative multithreading (SpMT) is a thread-level automatic parallelization technique to accelerate sequential programs on multi-core, and it partitions programs into multiple threads to be speculatively executed in the presence of ambiguous data and control dependences while the correctness of the programs is guaranteed by hardware support. Thread granularity, number of parallel threads as well as partition postions are crucial to the performance improvement in SpMT, for they determine the amount of resources (CPU, memory, cache, or waiting cycles, etc), and affect the efficiency of every PE (Processing Element). In conventional way, these three parameters are determined by heuristic rules. Although it is simple to partition threads with them, they are a type of one-size-fits-all strategy and can not guarantee to get the optimal solution of thread partitioning. This paper proposes an Artificial Neural Network (ANN) based approach to learn and determine the thread partition strategy. Using the ANN-based thread partition approach, an unseen irregular program can obtain a stable, much higher speedup than the Heuristic Rules (HR) based approach. On Prophet, which is a generic SpMT processor to evaluate the performance of multithreaded programs, the novel thread partitioning policy is evaluated and reaches an average speedup of 1.80 on 4-core processor. Experiments show that our proposed approach can obtain a significant increase in speedup and Olden benchmarks deliver a better performance improvement of 2.36% than the traditional heuristic rules based approach. The results indicate that our approach finds the best partitioning scheme for each program and is more stable across programs.

Keywords: multi-threading; multiprocessing systems; neural nets; ANN-based thread partition approach; HR based approach; Olden benchmark; PE; Prophet; SpMT processor; artificial neural network; heuristic rules; multicore; multithreaded programs; one-size-fits-all strategy; parallel threads; partition position; processing element; sequential programs; speculative multithreading; thread granularity; thread partitioning policy; thread partitioning prediction; thread-level automatic parallelization technique; Cascading style sheets; Conferences; Cyberspace; Embedded software; High performance computing; Safety; ecurity; Machine learning; Prophet; speculative multithreading; thread partitioning (ID#: 16-9402)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336262&isnumber=7336120

 

Yanyan Shen; Elphinstone, K., "Microkernel Mechanisms for Improving the Trustworthiness of Commodity Hardware," in Dependable Computing Conference (EDCC), 2015 Eleventh European, pp. 155-166, 7-11 Sept. 2015. doi: 10.1109/EDCC.2015.16

Abstract: Trustworthy isolation is required to consolidate safety and security critical software systems on a single hardware platform. Recent advances in formally verifying correctness and isolation properties of a microkernel should enable mutually distrusting software to co-exist on the same platform with a high level of assurance of correct operation. However, commodity hardware is susceptible to transient faults triggered by cosmic rays, and alpha particle strikes, and thus may invalidate the isolation guarantees, or trigger failure in isolated applications. To increase trustworthiness of commodity hardware, we apply redundant execution techniques from the dependability community to a modern microkernel. We leverage the hardware redundancy provided by multicore processors to perform transient fault detection for applications and for the microkernel itself. This paper presents the mechanisms and framework for microkernel based systems to implement redundant execution for improved trustworthiness. It evaluates the performance of the resulting system on x86-64 and ARM platforms.

Keywords: multiprocessing systems; operating system kernels; redundancy; safety-critical software; security of data;64 platforms; ARM platforms; alpha particle strikes; commodity hardware trustworthiness; correctness formal verification; cosmic rays; dependability community; hardware redundancy; isolation properties; microkernel mechanisms; modern microkernel; multicore processors; redundant execution techniques; safety critical software systems; security critical software systems; transient fault detection; trustworthy isolation;x86 platforms; Hardware; Kernel; Multicore processing; Program processors; Security; Transient analysis; Microkernel; Reliability; SEUs; Security; Trustworthy Systems (ID#: 16-9403)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7371963&isnumber=7371940

 

Grammatikakis, M.D.; Petrakis, P.; Papagrigoriou, A.; Kornaros, G.; Coppola, M., "High-Level Security Services Based On A Hardware NoC Firewall Module," in Intelligent Solutions in Embedded Systems (WISES), 2015 12th International Workshop on, pp.73-78, 29-30 Oct. 2015. Doi:  (not provided)

Abstract: Security services are typically based on deploying different types of modules, e.g. firewall, intrusion detection or prevention systems, or cryptographic function accelerators. In this study, we focus on extending the functionality of a hardware Network-on-Chip (NoC) Firewall on the Zynq 7020 FPGA of a Zedboard. The NoC Firewall checks the physical address and rejects untrusted CPU requests to on-chip memory, thus protecting legitimate processes running in a multicore SoC from the injection of malicious instructions or data to shared memory. Based on a validated kernel-space Linux system driver of the NoC Firewall which is seen as a reconfigurable, memory-mapped device on top of AMBA AXI4 interconnect fabric, we develop higher-layer security services that focus on physical address protection based on a set of rules. While our primary scenario concentrates on monitors and actors related to protection from malicious (or corrupt) drivers, other interesting use cases related to healthcare ethics, are also put into the context.

Keywords: field programmable gate arrays; firewalls; multiprocessing systems; network-on-chip; AMBA AXI4 interconnect fabric; Zedboard; Zynq 7020 FPGA; corrupt drivers; hardware NoC firewall module; healthcare ethics; high-level security services; malicious drivers; malicious instructions; multicore SoC; network-on-chip; on-chip memory; physical address protection; reconfigurable memory-mapped device; shared memory; untrusted CPU requests; validated kernel-space Linux system driver; Field programmable gate arrays; Firewalls (computing); Hardware; Linux; Network interfaces; Registers; Linux driver; NoC; firewall; multicore SoC (ID#: 16-9404)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7356985&isnumber=7356973

 

Tuncali, C.E.; Fainekos, G.; Yann-Hang Lee, "Automatic Parallelization of Simulink Models for Multi-core Architectures," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on,  pp. 964-971, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.232

Abstract: This paper addresses the problem of parallelizing existing single-rate Simulink models for embedded control applications on multi-core architectures considering communication cost between blocks on different CPU cores. Utilizing the block diagram of the Simulink model, we derive the dependency graph between the different blocks. In order to solve the scheduling problem, we describe a Mixed Integer Linear Programming (MILP) formulation for optimally mapping the Simulink blocks to different CPU cores. Since the number of variables and constraints for MILP solver grows exponentially when model size increases, solving this problem in a reasonable time becomes harder. For addressing this issue, we introduce a set of techniques for reducing the number of constraints in the MILP formulation. By using the proposed techniques, the MILP solver finds solutions that are closer to the optimal solution within a given time bound. We study the scalability and efficiency of our consisting approach with synthetic benchmarks of randomly generated directed acyclic graphs. We also use the "Fault-Tolerant Fuel Control System" demo from Simulink and a Diesel engine controller from Toyota as case studies for demonstrating applicability of our approach to real world problems.

Keywords: control engineering computing; diesel engines; directed graphs; embedded systems; fault tolerant control; fuel systems; integer programming; linear programming; parallel architectures; processor scheduling; CPU cores; MILP formulation; MILP solver constraints; MILP solver variables; Simulink model parallelization problem; Toyota; block diagram; communication cost; dependency graph; diesel engine controller; embedded control applications; fault-tolerant fuel control system; mixed integer linear programming formulation; multicore architecture; randomly generated directed acyclic graphs; scheduling problem; synthetic benchmarks; Bismuth; Computational modeling; Job shop scheduling; Multicore processing; Optimization; Software packages; Multiprocessing; Simulink; embedded systems; model based development; optimization; task allocation (ID#: 16-9405)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336295&isnumber=7336120

 

Ye, J.; Songyuan Li; Tianzhou Chen, "Shared Write Buffer to Support Data Sharing Among Speculative Multi-threading Cores," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 835-838, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.287

Abstract: Speculative Multi-threading (SpMT), a.k.a Thread Level Speculation (TLS), is a most noticeable research direction of automatic extraction of thread level parallelism (TLP), which is growing appealing in the multi-core and many-core era. The SpMT threads are extracted from a single thread, and are tightly coupled with data dependences. Traditional private L1 caches with coherence mechanism will not suit such intense data sharing among SpMT threads. We propose a Shared Write Buffer (SWB) that resides in parallel with the private L1 caches, but with much smaller capacity, and short access delay. When a core writes a datum to L1 cache, it will write the SWB first, and when it reads a datum, it will read from the SWB as well as from the L1. Because the SWB is shared among the cores, it may probably return a datum quicker than the L1 if the latter needs to go through a coherence process to load the datum. This way the SWB improves the performance of SpMT inter-core data sharing, and mitigate the overhead of coherence.

Keywords: cache storage; multi-threading; multiprocessing systems; SWB; SpMT intercore data sharing; SpMT thread extraction; TLS; access delay; automatic TLP extraction; coherence overhead mitigation; data dependences; data sharing; datum; performance improvement; private L1 caches; shared write buffer; speculative multithreading cores; thread level parallelism; thread level speculation; Coherence; Delays; Instruction sets; Message systems; Multicore processing; Protocols; Cache; Multi-Core; Shared Write Buffer; SpMT; Speculative Multi-Threading (ID#: 16-9406)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336265&isnumber=7336120

 

Chunqiang Li; Wei Hu; Puzhang Wang; Mengke Song; Xinwei Cao, "A Novel Critical Path Based Routing Method Based on for NOC," in High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS), 2015 IEEE 12th International Conference on Embedded Software and Systems (ICESS), 2015 IEEE 17th International Conference on, pp. 1546-1551, 24-26 Aug. 2015. doi: 10.1109/HPCC-CSS-ICESS.2015.159

Abstract: When more and more cores are integrated onto a single chip and connected by lines, network on chip (NOC) has provided a new on chip structure. The tasks are mapped to the cores on the chip. They have communication requirements according to their relationship. When the communication data are transmitted on the network, they need to be given a suitable path to the target cores with low latency. In this paper, we proposed a new routing method based on static critical path for NOC. The tasks with multi-threads will be analyzed first and the running paths of the tasks will be marked. The static critical path can be found according to the length of the running paths. The messages on critical path will be marked as critical messages. When the messages have arrived at the routers on chip, the critical messages will be forwarded firstly in terms of their importance. This new routing method has been tested on simulation environment. The experiment results proved that this method can accelerate the transmission speed of critical messages and improve the performance of the tasks.

Keywords: network routing; network-on-chip; NOC; chip structure; communication data transmission; communication requirements; critical messages; critical path; critical path based routing method; latency; multithreads; network on chip; running path length; simulation environment; static critical path; target cores; task mapping; task performance improvement; Algorithm design and analysis; Message systems; Multicore processing; Program processors; Quality of service; Routing; System-on-chip; Critical Path; Network on Chip; Routing Method (ID#: 16-9407)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7336388&isnumber=7336120

 

Raab, M., "Global and Thread-Local Activation of Contextual Program Execution Environments," in Object/Component/Service-Oriented Real-Time Distributed Computing Workshops (ISORCW), 2015 IEEE International Symposium on, pp. 34-41, 13-17 April 2015. doi: 10.1109/ISORCW.2015.52

Abstract: Ubiquitous computing often demands applications to be both customizable and context-aware: Users expect smart devices to adapt to the context and respect their preferences. Currently, these features are not well-supported in a multi-core embedded setup. The aim of this paper is to describe a tool that supports both thread-local and global context-awareness. The tool is based on code generation using a simple specification language and a library that persists the customizations. In a case study and benchmark we evaluate a web server application on embedded hardware. Our web server application uses contexts to represent user sessions, language settings, and sensor states. The results show that the tool has minimal overhead, is well-suited for ubiquitous computing, and takes full advantage of multi-core processors.

Keywords: Internet; multiprocessing systems; program compilers; programming environments; software libraries; specification languages; ubiquitous computing; Web server application; code generation; contextual program execution environments; global context-awareness; language settings; multicore processors; sensor states; smart devices; software library; specification language; thread-local context-awareness; ubiquitous computing; user sessions; Accuracy; Context; Hardware; Instruction sets; Security; Synchronization; Web servers; context oriented programming; customization; multi-core; persistency; ubiquitous computing (ID#: 16-9408)

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7160121&isnumber=7160107


Note:

Articles listed on these pages have been found on publicly available internet pages and are cited with links to those pages. Some of the information included herein has been reprinted with permission from the authors or data repositories. Direct any requests via Email to news@scienceofsecurity.net for removal of the links or modifications to specific citations. Please include the ID# of the specific citation in your correspondence.