Pervasive Parallelism Seminar Series 2016/17: Abstracts and Biographies

Pervasive Parallelism Seminar Series 2016/17: Abstracts and Biographies


Blair Archibald (EPCC): Power, Precision and a PhD – A Whirlwind Tour of My Time at EPCC

Wednesday, 21 Sep 2016 @14:00
JCMB 4325A [King’s Buildings]

Abstract: Unfortunately my time at EPCC is coming to an end. In this talk I will give a whirlwind tour of my time here.

We start in the Adept project, looking at how energy scales with parallel efficiency, the potential hidden cost of programming languages, and some interesting future research directions.

Next stop: the ExaFLOW project, where we will discover how to analyse floating-point behavior of scientific (CFD) codes by way of binary instrumentation.

Finally I’ll briefly discuss the work I’m going back to for the remainder of my PhD.


Toomas Remmelg: Matrix Multiplication Beyond Auto-Tuning: rewriting based GPU code generation

Thursday, 22 Sept 2016 @15:30

Abstract: Graphics Processing Units (GPUs) are used as general purpose parallel accelerators in a wide range of applications. They are found in most computing systems, and mobile devices are no exception. The recent availability of programming APIs such as OpenCL for mobile GPUs promises to open up new types of applications on these devices.

However, producing high performance GPU code is extremely difficult.

Subtle differences in device characteristics can lead to large performance variations when different optimizations are applied. As we will see, this is especially true for a mobile GPU such as the ARM Mali GPU which has a very different architecture than desktop-class GPUs. Code optimized and tuned for one type of GPUs is unlikely to achieve the performance potential on another type of GPUs.

Auto-tuners have traditionally been an answer to this performance portability challenge. For instance, they have been successful on CPUs for matrix operations, which are used as building blocks in many high-performance applications. However, they are much harder to design for different classes of GPUs, given the wide variety of hardware characteristics.

In this paper, we take a different perspective and show how performance portability for matrix multiplication is achieved using a compiler approach. This approach is based on a recently developed generic technique that combines a high-level programming model with a system of rewrite rules. Programs are automatically rewritten in successive steps, where optimizations decision are made.This approach is truly performance portable, resulting in high-performance code for very different types of architectures such as desktop and mobile GPUs.

In particular, we achieve a speedup of 1.7x over a state-of-the-art auto-tuner on the ARM Mali GPU.

Tom Ashby [Imec, Belgium]: (A small slice of) High Performance Computing, Machine Learning and the Life Sciences

Monday, 26 September 2016 @13:00

Abstract: In this talk I will introduce Imec (one of Europe’s foremost technology research institutes), the Intel ExaScience lab at Imec, and the topics that we’re working on. These bring together parts of HPC and machine learning to solve issues in the applied life sciences. The technical part of the talk will focus on a machine learning algorithm developed in collaboration with academics and industrial researchers in drug discovery, and how it is implemented in an HPC setting. I will also briefly touch on some systems research topics in some of these areas. I am very interested in the work of others on any of these topics – if you are available to chat on Monday morning before the talk please let me know at

Bio: Tom Ashby first studied Artificial Intelligence and Computer Science at the University of Edinburgh, followed by a PhD at ICSA on HPC and computational science. After moving to Imec in Belgium, he has worked on software tools for parallelisation and data movement in embedded systems, image analysis software for miniaturized hyperspectral cameras, and the use of HPC for applications in drug discovery and development. He has the most fun when learning about other domains and how to apply systems engineering ideas to them.

Michael Wong [Codeplay]: Towards Heterogeneous Programming in C++17 and beyond

Thursday, 13 October 2016 @15:00
JCMB 4325A

Abstract: Current semiconductor trends show a major shift in computer system architectures towards heterogeneous systems that combine a CPU with other processors such as GPUs, DSPs and FPGAs that work together, performing different tasks in parallel. This shift has brought a dramatic change in programming paradigms in the pursuit of a common language that will provide efficiency and performance portability across these systems. A wide range of programming languages and models including OpenMP has emerged over the last decade, all with this goal in their sights.

This new world of heterogeneous systems brings with it many new challenges for the C++ community and for the language itself, and if these were to be overcome then C++ would have the potential to standardize the future of heterogeneous programming. C++17 has added the Parallelism Technical Specification (TS), which includes parallel algorithms, and the Concurrency TS already contains so-called “futures.” But this is only the beginning and is only for CPUs. What people don’t know is that when married with Parallelism and with the linchpin of executors being worked on now, it can form the basis of Massive Parallel Dispatch for Heterogeneous Computing for C++.

In fact this work has already begun in the C++ Study Group SG14 in form of a mandate to support massive parallel dispatch for heterogeneous devices in the C++ standard using recent models such as SYCL and HPX. A recent approach to solving these challenges comes in the form of SYCL, a shared-source C++ programming model for OpenCL. SYCL takes a very different approach from many before it, in that it is specifically designed to be standard C++ without any extensions to the language.

Mario Antonioletti, Selina Aragon, Paul Graham, Neil Chue Hong, Malcolm Illingworth, Mike Jackson, Giacomo Peru [EPCC]: The Software Sustainability Institute

Wednesday October 19th @ 14:00
JCMB 4325A

Abstract: 92% of academics use research software, 69% say that their research would not be practical without it, and 56% develop their own software. Worryingly, 21% of those have no training in software development!*

The Software Sustainability Institute is a national facility for these researchers, the UK’s research software community. Funded by EPSRC, BBSRC, ESRC and JISC, the Institute is led by EPCC in collaboration with the universities of Manchester, Oxford and Southampton. The Institute’s mission is to cultivate better, more sustainable research software to enable world-class research.

In this seminar, the EPCC members of the Institute will give an overview of the Institute, its five teams – community, policy, software, training, and communications – and how EPCC contributes to the goal of the Institute – “better software, better research”!

Dan Sorin [Duke University, USA]: Designing Processors to Accelerate Robot Motion Planning

Thursday, 27 October @ 16:00

Abstract: We have developed a hardware accelerator for motion planning, a critical operation in robotics. I will present the microarchitecture of our accelerator and describe a prototype implementation on an FPGA.

Experimental results show that, compared to the state of the art, the accelerator improves performance by three orders of magnitude and improves power consumption by more than one order of magnitude. These gains are achieved through careful hardware/software co-design. We have modified conventional motion planning algorithms to aggressively pre-compute collision data, and we have implemented a microarchitecture that leverages the parallelism present in the problem.

Time permitting, I will briefly present one other research project in my lab: verification-aware computer architecture.

Bio: Daniel J. Sorin is the W.H. Gardner Jr. Professor of Electrical and Computer Engineering at Duke University. His research interests are in computer architecture, with a focus on fault tolerance, verification, and memory system design. He is the author of “Fault Tolerant Computer Architecture” and a co-author of “A Primer on Memory Consistency and Cache Coherence.” He is the recipient of an NSF Career Award and Duke’s Imhoff Distinguished Teaching Award. He received a PhD and MS in electrical and computer engineering from the University of Wisconsin, and he received a BSE in electrical engineering from Duke University.

Fred de Haro [CEO, Pycom]: LoRa, Sigfox and LTE-M: How do they compare?

Friday, 28th October @ 09:30

Abstract: Low-Power Wide-Area Networks (LPWANs) are being widely promoted for IoT deployment, with LoRa, Sigfox and LTE-M amongst the leading contenders to win the technology race. For developers, a persistent question is “what’s the best wireless network for this or that type of connected application?” In this talk, Fred de Haro will present Pycom’s views on “the race” and will explain the defining characteristics of these competing LPWAN technologies and their suitability for different IoT use-cases.

Bio: Fred de Haro is CEO and Co-Founder of Pycom. He has 27 years experience building start-up and pre-IPO companies, including NAVTEQ and Tele Atlas (sold for $4B to TomTom). His diverse industry and vertical market experience spans IT hardware, software, intellectual property and licensing, as well as automotive, retail, Internet, VARs, system integrators, ISPs, ASPs, independent software vendors, telecom (mobile operators and handset manufacturers), B2B, OEMs, consumer products, aerospace, defense and manufacturing.

John McAllister [Queen’s University Belfast]: Dataflow in the Age of MPSoC-FPGA

Friday, 4th November @ 14:00

Abstract: ‘MPSoC’- FPGA, incorporating multicore, GPU and FPGA programmable logic components means that never before has such an array of resources been available to a designer on a single, programmable, commodity device. It poses serious questions of embedded programming and development processes and is forcing the previously separated disciplines of development for software-programmable devices and configurable logic to find a way to co-exist which allows developers to realise high performance, efficient systems in a productive manner.

Dataflow programming and autocoding techniques have a major role to play for these platforms in the case of streaming signal processing workloads. Their ability to emphasise crucial features of the behaviour of such applications for efficient implementation is invaluable given the heterogeneity of the target platforms, but there is a long way to go before that potential is realised. Specifically, the major thrust of dataflow programming research to date is (unintentionally) ‘stove-piped’, exploiting specific dialects for harness specific aspects of specific embedded technologies, such as task, data or pipeline parallelism. Given the heterogeneity of MPSoC-FPGA, is it time for a rethink of dataflow programming and design so that all of the resources available can be harnessed in the most efficient way?

This talk will present initial work which suggests there may be considerable benefit to doing so. In particular, it studies the suitability of dataflow as an enabling technology for high performance programmable accelerators and for High Level Synthesis (HLS) for FPGA and presents initial ideas on how these can extend to rapid deployment of applications on MPSoC applications.

Bio: Dr. John McAllister is a Senior Lecturer at Queen’s University Belfast focused on languages, compilers and custom computing architectures for data processing.

Ole C. Weidner [LEGO, Denmark]: A Scalable Framework for Operational Context Awareness on HPC Platforms

Friday, 4th November @ 15:00

Abstract: With computational methods, tools and workflows becoming ubiquitous in more scientific domains and disciplines, the software applications and user communities on high performance computing (HPC) platforms are rapidly growing diverse. Many of the newly emerging HPC applications move beyond tightly-coupled, compute-centric methods and algorithms and embrace more heterogeneous, multi-component workflows, dynamic and ad hoc computation and data-centric methodologies. For frameworks and applications diverging from the traditional HPC profile, a significant category of performance problems and waste of computational resources arise from suboptimal mapping of application components to HPC resources due to a lack of insight into their operational context. There are many tools which allow scientists and application developers to observe, diagnose and understand aspects of the behaviour and performance characteristics of their applications, but almost none of them are comprehensive across architectures, environments and applications.
Our research aims to develop a comprehensive framework to deliver operational context awareness on HPC platforms through (i) an extensible context awareness model, the C* Model, (ii) a generic context awareness interface, and (iii) an architecture proposal for supporting context awareness at extreme scales. Use-cases and existing approaches for context awareness are evaluated in depth and a set of characteristics and requirements are defined along with a precise definition of context awareness. Based on these requirements, the extensible, graph-based C* Model which introduces and amalgamates the concepts of physical and virtual or derived context is formalised. A generic interface to C* is designed, which exposes context information and a set of interaction methods for applications, users and platform services to aid them in the most crucial tasks, detection and mitigation of performance problems and waste of computational resources. This framework is hypothetically examined in the context of peta- and exascale applications and platforms and the details of the architectural challenges are worked out. Suggestions are made for a scalable system architecture that supports real-time context awareness.

Bio: Ole is a part-time PhD student in the Data Intensive Research (DIR) Group under Malcolm Atkinson. His research interests lie at the intersection of dynamic, self-aware, and autonomous distributed applications, and the application of data-intensive paradigms to the operation and real-time analysis of large-scale HPC platforms.
In his day job, Ole works as a Big Data Engineer at the LEGO Group in Denmark where he is designing an enterprise Big Data platform and application ecosystem. Before he joined LEGO in late 2015, Ole spent 8 years in the U.S., working as a Research Associate at Louisiana State University and Rutgers University, New Jersey. He was a key contributor to the design and implementation of several widely used distributed computing frameworks, including SAGA (The Simple API for Grid Applications) and RADICAL-Pilot, a distributed Pilot-Job system.

Anastasis Georgoulas: Probabilistic Programming Process Algebra

Tuesday, 15th November 2016 @16:00
IF 4.31/4.33

Abstract: Stochastic processes such as Continuous Time Markov Chains are oftenused to model natural or engineered systems. A large class of languages, such as process algebras, have been developed to formally describe and analyse such systems, but are not applicable when our knowledge of the system is incomplete, as is often the case. On the other hand, the probabilistic programming paradigm is currently receiving much attention as a way of describing probabilistic models and automatically applying sophisticated inference algorithms to match observed behaviour; however, existing languages are not well-suited to dynamical systems. In this talk, I will present the work of my PhD, particularly the definition of ProPPA — a modelling language incorporating aspects of probabilistic programming into the framework of process algebras.  The idea behind the language is to enable the formal description of Markovian stochastic systems with uncertainty, as well as the statistical inference of their parameters based on observed data. I will discuss the implications that introducing uncertainty into the system description has on the syntax and, particularly, the semantics of the language.

(Joint work with Jane Hillston and Guido Sanguinetti)

KC Sivaramakrishnan: Practical Algebraic Effect Handlers in Multicore OCaml

Tuesday, 22nd November 2016 @16:00
IF 4.31/4.33

Abstract: Algebraic effects and handlers have recently garnered well-deserved attention as a modular abstraction for programming with computational effects. Algebraic effects declare abstract side-effecting operations whose interpretation is given by the handlers. Languages such as Eff have demonstrated that handlers can be used as a more composable alternative to monads for implementing effects in a pure language.

In this talk, I will describe the native implementation of algebraic effect handlers in multicore OCaml. The original motivation for this work was to provide built-in support for concurrency in OCaml without tying the language to a particular concurrency implementation. However, algebraic effects support many interesting examples beyond concurrency. A key requirement while adding effect handlers to OCaml is performance backwards compatibility. I will discuss the challenges and the design choices we have made in order to make algebraic effect handlers a practical programming abstraction for Multicore OCaml.

Jos Martin [MathWorks]: Parallel Computing with MATLAB

Thursday, 24th November 2016 @11:00
IF 4.31/33

Abstract: MATLAB is used extensively in many industrial contexts, from designing and simulating planes and cars to modelling risk in the finance sector. The commonality in all its uses is the need to undertake significant mathematical computation. And the need for this is only growing. As it grows, so does the need for more computational power and hence the demand for cluster and GPU computing. This talk aims to introduce the parallel language in MATLAB and put its use in context with some industrial applications.

Bio: Jos is currently a Senior Engineering Manager at MathWorks with responsibility for the development of all HPC and parallel computing products, MATLAB Drive and MATLAB Connector. He has lead the parallel computing team since its inception in 2003 and in that time has architected much of the toolbox, particularly the core infrastructure and parallel language areas. At SC08, on behalf of MathWorks, he was a joint winner of the HPC Challenge Class 2 award for Most Productive Implementation. He received a D.Phil in Atomic and Laser Physics and an MA in Physics from Oxford University, UK. After completing his D.Phil he held a Royal Society Post-Doctoral Fellowship at the University of Otago, New Zealand. His area of research was Experimental Bose-Einstein Condensation (BEC), a branch of low-temperature atomic physics.

Aline Viana: Toward a More Tactful Networking

Thursday, 9 February 2017 @15:30
IF 4.31/33

Abstract: The urbanization worldwide is bringing a variety of challenges to any city and in particular, to the telecommunications networks. In order to manage the complexity of the urban smart environment of tomorrow, the understanding of human behavior has to become part integrant of networking system/protocol/service design. This brings the idea of a more personalized networking. Networking (protocols, systems, services) had for years being impersonalize from the point of view of “who and how” are the users, and centered on the needs of networking actors (operators, providers, or protocols). Thus, networks have been usually designed to adapt to network conditions (e.g., physical link conditions, topology changes) and are protocol or service specific (e.g., successful delivery of messages, geographical network coverage). Hence, they were very often oblivious to users behavior and current needs. This is now over. Human beings behind a smartdevice or a vehicle (electric and/or smart) can no more be ignored since their behavior impacts the way they use the network resources and services.

Thus, the future of networking systems lies on the better understanding of the way human behaves: The social norms and structure dictating their behavior will influence the way they interact with network services and demand resources. In this talk, I will present my works on this direction and how the “human behavior” have been leveraged in a variety of networking solutions.

Bio: Aline Carneiro Viana is a CR1 at INRIA Saclay – Ile de France. She received her habilitation from Université Pierre et Marie Curie, Paris, France in 2011. From November 2009 to October 2010, Dr. Viana was in a sabbatical leave at the Telecommunication Networks Group (TKN) of the Technischen Universität Berlin (TU-Berlin), Germany. Dr. Viana got her PhD in Computer Science from the University Pierre et Marie Curie -Paris VI in 2005. After having hold a postdoctoral position at IRISA/INRIA Rennes – Bretagne Atlantique in the PARIS research team, she obtained a permanent position at INRIA Saclay – Ile de France, in 2006.

Dr. Viana’s research addresses the design of solutions for smart cities, mobile and self-organizing networks with the focus on: human behavior analysis, opportunistic routing and data dissemination, and social mobile wireless networks. She has published more than 90 papers, presented in these fields in known conferences such as ACM MobiHoc, IEEE SECON, IEEE Infocom, ACM MSWiM, IEEE PERCOM and journals such as IEEE Transaction on Mobile Computing, Pervasive and Mobile Computing (PMC) Elsevier, Ad Hoc Networks Elsevier, ACM Computing Surveys, Computer Networks Elsevier, main conferences and journals on mobile and wireless network community. She is also Associate Editor of ACM Computer Communication Review (ACM CCR) and member of the editorial Board of Wireless Communications and Mobile Computing Open Access Journal of John Wiley&Sons and Hindawi.

Michele Weiland and Adrian Jackson [EPCC]: NEXTGenIO: making IO great again

Wednesday, 8 March @14:00
James Clerk Maxwell Building 4325B

Abstract: Current HPC systems perform on the order of tens to hundreds of petaFLOPs. Although this already represents one million billion computations per second, more complex demands on scientific modelling and simulation mean even faster computation is necessary. The next step is Exascale computing, which is up to 1000x faster than current Petascale systems. Researchers in HPC are aiming to build an HPC system capable of Exascale computation by 2022.

One of the major roadblocks to achieving this goal is the I/O bottleneck. Current systems are capable of processing data quickly, but speeds are limited by how fast the system is able to read and write data. This represents a significant loss of time and energy in the system. Being able to widen, and ultimately eliminate, this bottleneck would majorly increase the performance and efficiency of HPC systems.

NEXTGenIO aims to solve this problem by bridging the gap between memory and storage. This will use Intel’s revolutionary new 3D XPoint non-volatile memory, which will sit between conventional memory and disk storage. NEXTGenIO will design the hardware and software to exploit the new memory technology. The goal is to build a system with 100x faster I/O than current HPC systems, a significant step towards Exascale computation.

In this seminar we will talk about the NEXTGenIO project. We will give an overview of the aims of the project, introduce our system architecture, and go on to discuss the challenges and potential benefits of persistent memory in compute nodes.

Fiona Reid [ARM]: Memory Models and Transactional Memory – from textbook to reality

Wednesday, 15 March @14:00
James Clerk Maxwell Building 5215

Abstract: In CP2K is a popular computational chemistry code and currently the second most heavily used code on ARCHER. As part of the Intel Parallel Computing Centre (IPCC) Project EPCC has been looking at the performance of CP2K on the Intel Xeon Phi hardware. Previous seminars in 2013 and 2015 described the first generation Xeon Phi (aka Knight’s Corner) hardware and the porting/performance of CP2K to this hardware. This seminar will focus on the performance of CP2K on second generation of Xeon Phi processors (aka Knight’s Landing). We will describe the process of porting to KNL including various gotchas and also present performance data from Xeon Phi (both KNC and KNL) and Intel Xeon processors.

Stephan Diestelhorst [ARM]: Memory Models and Transactional Memory – from textbook to reality

Friday, 17 March @14:00
IF 4.31/4.33

Abstract: In parallel architectures, communication between agents frequently happens through shared memory. The semantics of concurrently accessing locations in that memory is governed by models of coherence and memory consistency. Recently, transactional memory has been proposed to raise the level of abstraction and simplify reasoning about concurrent applications.

In this presentation, I will quickly characterise the current state of affairs in modelling and implementing these approaches, but will put a focus on showing real world challenges that occur when embedding these pristine concepts into complex real world substrates.

Bio:  Stephan Diestelhorst is a Principal Research Engineer at ARM Ltd, in Cambrige, UK. He and his group research memory technologies and system interactions for future compute systems, including non-volatile memories, transactional memory, and new power-efficient system architectures. Before joining ARM 3.5 years ago, Stephan worked at AMD’s Operating System Research Center in Dresden, Germany where he deeply investigated transactional memory and worked on memory consistency model challenges for 7 years.

Jos Martin [MathWorks]: Parallel Computing with Matlab

Tuesday, 21 March @14:00
JCMB Lecture Theatre C

Abstract: MATLAB is used extensively in many industrial contexts, from designing and simulating
planes and cars to modelling risk in the finance sector. The commonality in all its uses is the need to undertake significant mathematical computation, and the need for this is only growing. As it grows, so does the need for more computational power and hence the demand for cluster and GPU computing. This talk aims to introduce the parallel language in MATLAB and put its use in context with some industrial applications.

Bio:  Jos is currently a Senior Engineering Manager at MathWorks with responsibility for the development of all HPC and parallel computing products, MATLAB Drive and MATLAB Connector. He has led the parallel computing team since its inception in 2003 and in that time has architected much of the toolbox, particularly the core infrastructure and parallel language areas. At SC08, on behalf of MathWorks, he was a joint winner of the HPC Challenge Class 2 award for Most Productive Implementation. He received a D.Phil in Atomic and Laser Physics and an MA in Physics from Oxford University, UK. After completing his D.Phil he held a Royal Society Post-Doctoral Fellowship at the University of Otago, New Zealand. His area of research was Experimental Bose-Einstein Condensation (BEC), a branch of low-temperature atomic physics.

Professor Guevara Noubir [Northeastern University]: Robustness and Privacy in Wireless Systems

Tuesday, 21 March @15:00
IF 4.31/4.33

Abstract:  Wireless communication is not only a key technology underlying the mobile revolution, it is also used to connect, monitor, alert, and interact with physical infrastructures such as smart-grids, transportation networks, and even implantable devices. Building secure and robust wireless networks raises several theoretical and practical problems. Solving such problems requires novel approaches to circumvent the resource limitations of such systems. In this talk, I will review some of the major vulnerabilities inherent to the design of current wireless networks. I will then present specific problems and results that address some of the issues in wireless and mobile networks including smart-interference mitigation, and location tracking.

Bio: Guevara Noubir is a Professor in the College of Computer and Information Science at Northeastern University. He received a PhD is Computer Science from the Swiss Federal Institute of Technology in Lausanne (EPFL 1996) and an engineering diploma (MS) from École Nationale Supérieure d’Informatique et de Mathématiques Appliquées at Grenoble (ENSIMAG 1991).

Prior to joining the faculty at Northeastern University, he was a senior research scientist at CSEM SA (Switzerland) where he led several research projects in the area of wireless and mobile networking. In particular, he contributed to the definition of the third generation Universal Mobile Telecommunication System (UMTS) standardized as 3GPP WCDMA and was the lead of the Data Networking Stack for the first 3G demonstrator in the world (as part of the FRAMES EU Research Project). In 2013, Noubir led Northeastern University’s team in the DARPA Spectrum Challenge competition winning the 2013 Cooperative Challenge. Dr Noubir held visiting research positions at Eurecom, MIT, and UNL.

He is a Senior Member of the IEEE, a member of the ACM, and a recipient of the NSF CAREER Award. He serves on the editorial boards of IEEE Transactions on Mobile Computing, ACM Transaction on Privacy and Security, and co-chaired several ACM and IEEE conferences in the fields of mobile, wireless, and security (ACM WiSec, IEEE CNS, IEEE SECON, IEEE WoWMoM).

His research covers both theoretical and practical aspects of secure and robust wireless and mobile systems. His current interests include leveraging mechanisms such as social networking authentication and low power ZigBee, to secure residential broadband networks, and boosting the robustness of wireless systems against smart attacks.

Ben Bennett [SGI/HPE]: Memory-Centric Computing – an HPE Perspective

Tuesday, 28 March @14:00
JCMB Lecture Theatre C

Abstract: Dr Ben Bennett, SGI’s Director of HPC Strategy, will outline the shared memory SGI UV technology and how it fits into the HPE product portfolio, as well as lifting the lid on the HPE Laboratories work on The Machine technology.

SGI products and services are used for high-performance computing (HPC) and big data analytics in the scientific, technical, business and government communities to solve challenging data-intensive computing, data management and virtualization problems. The company has approximately 1,100 employees worldwide.

In November 2016, Hewlett Packard Enterprise (HPE) announced that it had completed its acquisition of SGI. SGI’s highly complementary portfolio, including its in-memory high-performance data analytics technology, will extend and strengthen HPE’s current leadership position in the growing mission-critical and high-performance computing segments of the server market. The combined HPE and SGI portfolio, including a comprehensive services capability, will support private and public sector customers seeking larger supercomputer installations, including U.S. federal agencies as well as enterprises looking to leverage high-performance computing for business insights and a competitive edge.

HPE and SGI believe that by combining complementary product portfolios and go-to-market approaches they will be able to strengthen the leading position and financial performance of the combined business. The combined HPE and SGI portfolio will accelerate the development of new solutions in fields such as weather mapping, genomics research, life sciences, and cybersecurity for both public and private organizations.

Owen Thomas (Founding Partner, Red Oak HPC consultancy): HPC: What’s it worth to you?

Friday, 31 March @ 14:00-16:00
JCMB 6206

Abstract: By and large the HPC community struggles to articulate the value of HPC as an enterprise. This is further compounded by a general unwillingness to engage in “tawdry” discussions about money.

This is a hugely damaging state of affairs for the community of users and businesses that rely on HPC and the economy in general; the failure to engage with these discussions is often interpreted as evidence of a lack of benefit, followed by “uncomfortable discussions” about future funding.

This lecture will attempt to remove some of the fears about discussion of value and hopefully encourage people to engage more positively with them. After a brief introduction to the main vocabulary and ideas behind some of the metrics, I shall present some value models and then talk about how they might be applied.

Bio: Owen has over 20 years’ experience in the high-performance computing sector, advising on strategy, supporting procurements and managing highly technical projects from initiation to completion. He has worked client side and supplier side across the commercial and public sectors and he is a successful and highly experienced consultant. He has dealt with every type of challenge that can arise in a technical project – contract, financial, risk, personnel and quality concerns.

Red Oak Consulting ( is a specialist high-performance computing consultancy, providing expert advice to all parts of the HPC lifecycle.

Founded in 2004, Red Oak has built up a substantial customer base of users in government, industry, research and academia. We offer tailored solutions for high-end computer technologies and their applications and we also have an active innovation and research team that undertakes studies on behalf of customers.

Kenji Takeda [Microsoft Azure]: Real HPC in the Cloud, really!

Tuesday, 4 April @15:00
JCMB Lecture Theatre B

Abstract: Cloud computing empowers researchers, scientists and engineers to quickly experiment, develop, and deploy compute and data-intensive applications at large-scale. We will describe how hyper-scale, global cloud computing pushes the limits of distributed computing design, architecture, and operations.

We will describe how real high-performance computing with GPUs is now possible in the cloud, scaling to thousands of cores with InfiniBand networking and bare-metal performance. We will look at the future of FPGA (programmable silicon) computing at scale. Finally, we will explain  how the Azure for Research program can provide awards, training, and guidance on how you can use cloud computing for your own research –

Bio: Dr Kenji Takeda is Director of the Microsoft Azure for Research program. He has extensive experience in Cloud Computing, Data Science, High Performance and High Productivity Computing, Data-intensive Science, Scientific Workflows, Scholarly Communication, Engineering and Educational Outreach. He has a passion for developing novel computational approaches to tackle fundamental and applied problems in science and engineering. Kenji is a visiting industry fellow at the Alan Turing Institute, and visiting senior lecturer at the University of Southampton. He was previously Co-Director of the Microsoft Institute for High Performance Computing, and Senior Lecturer in Aeronautics, at the University of Southampton, UK. There he worked with leading high value manufacturing companies such as Airbus, AgustaWestland,

BAE Systems, Rolls-Royce and Formula One teams, to develop state-of-the-art capability for improving science and engineering processes. He also worked in the areas of aerodynamics, aeroacoustics, CFD, and flight simulation.

Derek Williams [IBM Austin]: Power8 Hardware Transactional Memory implementation

Friday, 7 April @14:30
IF 4.31/33

Abstract: A discussion of the details of the Power8 Hardware Transactional Memory implementation including a discussion of the various tracking structures and mechanisms used as well as a general overview of how Transactional Memory fits into the Power ISA weak memory.

Bio: Derek Williams is a Senior Engineer in the IBM Systems Group working on POWER processor storage subsystem development. He received a B.S. and M.S. degree in electrical engineering from the University of Texas in 1990 in 1994. He has worked at IBM on RS/6000, PowerPC and Power architectures in lab bringup, verification, logic design, and microarchitecture roles for cache controllers and storage subsystems. He has coauthored papers on the Power ISA memory model and POWER8 transactional memory architecture. Mr. Williams is an IBM Master Inventor and has filed for more than 250 patents and holds 150 U.S. patents.


Paul Gratz [Texas A&M University]: Coordinated Speculation in the Memory System

Thursday, 18th May @15:30
IF 4.31/33

Abstract: The scaling of multi-core processors poses a challenge to memory system design. Increased cores generate more accesses to shared caches causing conflict misses as unrelated processes compete for the same cache sets. Each miss represents significant waste: wasted time as the requested data is transferred from a slow main memory, wasted energy and bandwidth when transferring cache block words that will ultimately go unused.

In this talk I will explore the means to leverage memory reference speculation to reduce waste and improve efficiency in multi-core processor memory systems. In particular, we will examine memory reference speculation in the context of holistic cache management, merging data prefetching and replacement policy. I will outline a novel confidence path-based prefetching scheme, the Signature Path Prefetcher (SPP). SPP uses a compressed history based signature that accurately predicts complex address patterns, allowing the prefetcher to run far ahead of the current demand stream. SPP uses a dynamically constructed path confidence in its prediction to adaptively throttle itself on a per-prefetch stream basis. Unlike other algorithms, SPP tracks complex patterns across physical page boundaries and continues prefetching as soon as they move to new pages. Memory reference speculation can be leveraged far beyond data prefetching. In particular, the path confidence developed in SPP represents a proxy of use distance for future memory access, providing new possibilities to integrate memory management techniques at different levels of cache.

In the second half of my talk I will outline how the path confidence derived from the SPP prefetcher can be used as a direct representation of reuse distance to inform replacement policy as well as cache-level placement in the larger cache hierarchy. Finally, I will discuss how similar mechanisms can be used for page placement, prefetching and replacement in systems incorporating emerging non-volatile memory technologies.

Bio: Paul V. Gratz is an Associate Professor in the department of Electrical and Computer Engineering at Texas A&M University. His research interests include energy efficient and reliable design in the context of high performance computer architecture, processor memory systems and on-chip interconnection networks. He received his B.S. and M.S. degrees in Electrical Engineering from The University of Florida in 1994 and 1997 respectively. From 1997 to 2002 he was a design engineer with Intel Corporation. He received his Ph.D. degree in Electrical and Computer Engineering from the University of Texas at Austin in 2008. His papers “Path Confidence based Lookahead Prefetching” and “B-Fetch: Branch Prediction Directed Prefetching for Chip-Multiprocessors” were nominated for best papers at MICRO ’16 and MICRO ’14 respectively. At ASPLOS ’09, Dr. Gratz received a best paper award for “An Evaluation of the TRIPS Computer System.” In 2016 he received the “Distinguished Achievement Award in Teaching – College Level” from the Texas A&M Association of Former Students.

Prof. David Kaeli [Northeastern University, US]: A Cross-Layer Approach to Accelerating Heterogenenous Computing

Monday, 12th June @11:00
IF 4.31/33

Abstract: GPU computing is alive and well! The GPU has allowed researchers to overcome a number of computational barriers in important problem domains. But still, there remain challenges to use a GPU to target more general purpose applications. GPUs achieve impressive speedups when compared to CPUs, since GPUs have a large number of compute cores and high memory bandwidth. Recent GPU performance is approaching 10 teraflops of single precision performance on a single device.

In this talk we will discuss current trends with GPUs, including some advanced features that allow them exploit multi-kernel and multi-context grains of parallelism. Further, we consider how GPUs can be treated as cloud-based resources, enabling a GPU-enabled server to deliver HPC cloud services by leveraging virtualization and collaborative filtering.

Finally, we argue for for new heterogeneous workloads and discuss the role of the Heterogeneous Systems Architecture (HSA), a standard that further supports integration of the CPU and GPU into a common framework. We present a new class of benchmarks specifically tailored to evaluate the benefits of features supported in the new HSA programming model.

Bio: David Kaeli received his BS and PhD in Electrical Engineering from Rutgers University, and an MS in Computer Engineering from Syracuse University. He served as the Associate Dean of Undergraduate Programs for the College of Engineering and is presently a COE Distinguished Full Processor on the ECE faculty at Northeastern University, Boston, MA. He is the Director of the Northeastern University Computer Architecture Research Laboratory (NUCAR). Prior to joining Northeastern in 1993, Kaeli spent 12 years at IBM, the last 7 at T.J.
Watson Research Center, Yorktown Heights, NY.

Dr. Kaeli has published over 300 critically reviewed publications, 7 books, and 13 patents. His research spans a range of areas including microarchitecture to back-end compilers and big data applications. His current research topics include graphics processors, hardware/software security, virtualization, heterogeneous computing and multi-layer reliability.

He serves as an Associate Editor of the IEEE Transactions on Parallel and Distributed Systems, the ACM Transactions on Computer Architecture and Code Optimization, and the Journal of Parallel and Distributed Computing. Dr. Kaeli an IEEE Fellow and an ACM Distinguished Scientist.

Prof. Bill Langdon [UCL]: Applying Genetic Improvement to Bioinformatics Software

Thursday, 15th June @11:00
IF 4.31/33

Abstract: Genetic Improvement (GI) uses modern search based software engineering (SBSE) techniques, often Genetic Programming (GP), to optimise existing programs. After a short introduction to genetic programming, including a quick survey of applications of GP, I will introduce the use of evolutionary computing to improve human written code. In particular using GP to speed-up parallel applications of DNA sequence alignment written in C/C++ and nVidia’s CUDA.

The current volume of “Genetic Programming and Evolvable Machines” contains a special issue on Genetic Improvement (March 2017), which in turn includes a survey of genetic improvement for general purpose computing on graphics cards ( Experiments with growing and grafting new code with genetic programming (GGGP) in combination with human input on existing CUDA GPU code show that semi-automated approaches can give useful speed ups. In the case of BarraCUDA ( these have been adopted and the GI version, 0.7.107x, exceeds 3000 sourceForge downloads.

W. B. Langdon is a professorial research fellow in UCL. He worked on distributed real time databases for control and monitoring of power stations at the Central Electricity Research Laboratories. He then joined Logica to work on distributed control of gas pipelines and later on computer and telecommunications networks. After returning to academe to gain a PhD in genetic programming at UCL (sponsored by National Grid plc.), he worked at the University of Birmingham, the CWI, UCL, GSK, Essex University, King’s College, London and now for a third time at University College, London. Research visits have included Canada, Sweden and Germany. He has been working on genetic programming for 20 years and has co-authored three GP books.

 Prof. Tobias Weinzierl [Durham University]: It is all still an (Exa)HyPE

Wednesday, 28 June @14:00
JCMB 4325A

Abstract: ExaHyPE ( is a H2020 project where an international consortium of scientists writes a simulation engine for hyperbolic equation system solvers based upon the ADER-DG paradigm. Two grand challenges are tackled with this engine: long-range seismic risk assessment and the search for gravitational waves emitted by rotating binary neutron stars. The code itself is based upon a merger of flexible spacetree data structures with highly optimised compute kernels for the majority of the simulation cells. It provides a very simple and transparent domain specific language as front-end that allows to rapidly set up parallel PDE solvers discretised with ADER-DG or Finite Volumes on dynamically adaptive Cartesian meshes.

This talk starts with a brief overview of ExaHyPE and demonstrates how ExaHyPE codes are programmed, before it sketches the algorithmic workflow of the underlying ADER-DG scheme. We rephrase steps of this workflow in the language of tasks.

We then focus on a few methodological questions: how can we deploy these tasks to manycores, what execution patterns do arise, and are the new OpenMP task features of any use? How can we rearrange ADER-DG’s workflow such that we reduce accesses to the memory, i.e. weaken the pressure on the memory subsystem? How can we reprogram the most expensive tasks such that they exploit the wide vector registers coming along with the manycores? A brief outlook on MPI parallelisation wraps up this methodological talk.

We focus on results obtained on Intel KNL nodes provided by the RSC Group, on Intel Broadwell results from Durham’s supercomputer Hamilton, and on results from the SuperMUC phase 2 supercomputer at Leibniz Supercomputing Centre.

This is joint work with groups from Frankfurt’s FIAS, the University of Trento, as well as Ludwig-Maximilians-University Munich and Technical University of Munich.