Subscribe free to our newsletters via your
. Space Industry and Business News .




TECH SPACE
Bug repellent for supercomputers proves effective
by Anne M Stark
Livermore CA (SPX) Nov 20, 2012


File image.

Lawrence Livermore National Laboratory (LLNL) researchers have used the Stack Trace Analysis Tool (STAT), a highly scalable, lightweight tool to debug a program running more than one million MPI processes on the IBM Blue Gene/Q (BGQ)-based Sequoia supercomputer.

The debugging tool is a significant milestone in LLNL's multi-year collaboration with the University of Wisconsin (UW), Madison and the University of New Mexico (UNM) to ensure supercomputers run more efficiently.

Playing a significant role in scaling up the Sequoia supercomputer, STAT, a 2011 R and D 100 Award winner, has helped both early access users and system integrators quickly isolate a wide range of errors, including particularly perplexing issues that only manifested at extremely large scales up to 1,179,648 compute cores.

During the Sequoia scale-up, bugs in applications as well as defects in system software and hardware have manifested themselves as failures in applications. It is important to quickly diagnose errors so they can be reported to experts who can analyze them in detail and ultimately solve the problem.

"STAT has been indispensable in this capacity, helping the multi-disciplined integration team keep pace with the aggressive system scale-up schedule," said LLNL computer scientist Greg Lee.

"While testing a subsystem of Blue/Gene Q, my test program consistently failed only when scaled to 1,179,648 MPI processes. Although the test program was simple, the sheer scale at which this program ran made debugging efforts highly challenging. But when I applied STAT, it quickly revealed that one particular rank process was consistently stuck in a system call," said Dong Ahn, a computer scientist in Livermore Computing.

Based on this finding, a system expert took a close look at the compute core on which this rank process was running and discovered a hardware defect. "Replacing the component suddenly got the entire Sequoia system back to life," Ahn said.

"Putting this exercise into perspective, this error was due to a defect in a tiny hardware unit, the decrementor, of a single hardware thread out of a total of 4.7 million hardware threads. I felt it was like finding a needle in a haystack over a coffee break."

Sequoia delivers 20 petaflops of peak power and was ranked No. 1 in June of this year's TOP500 list. It is currently ranked No. 2, behind Oak Ridge National Laboratory's Titan.

LLNL plans to use Sequoia's impressive computational capability to advance understanding of fundamental physics and engineering questions that arise in the National Nuclear Security Administration's (NNSA) program to ensure the safety, security and effectiveness of the United States' nuclear deterrent without testing.

Sequoia also will support NNSA/DOE programs at LLNL that focus on nonproliferation, counterterrorism, energy, security, health and climate change.

As LLNL takes delivery of the Sequoia system and works to move it into production, computer scientists will migrate applications that have been running on earlier systems to this newer architecture.

This is a period of intense activity for LLNL's application teams as they gain experience with the new hardware and software environment.

"Having a highly effective debugging tool that scales to the full system is vital to the installation and acceptance process for Sequoia. It is critical that our development teams have a comprehensive parallel debugging tool set as they iron out the inevitable issues that come up with running on a new system like Sequoia," said Kim Cupps, leader of the Livermore Computing Division at LLNL.

STAT is particularly important for LLNL because supercomputer simulations are essential in virtually every mission area of the Laboratory.

The tool also has been used at other sites and proved to be effective on a wide range of supercomputer platforms, including Linux clusters and Cray systems.

The team is actively pursuing further optimization of STAT technologies and is exploring commercialization strategies. More information about STAT, including a link to the source code, is available on the Web.

.


Related Links
Lawrence Livermore National Laboratory
Space Technology News - Applications and Research






Comment on this article via your Facebook, Yahoo, AOL, Hotmail login.

Share this article via these popular social media networks
del.icio.usdel.icio.us DiggDigg RedditReddit GoogleGoogle








TECH SPACE
Keeneland Project Deploys New GPU Supercomputing System for the National Science Foundation
Atlanta GA (SPX) Nov 20, 2012
Georgia Tech, along with partner research organizations on the Keeneland Project, including the University of Tennessee-Knoxville, the National Institute for Computational Sciences and Oak Ridge National Laboratory, have announced that the project has completed installation and acceptance of the Keeneland Full Scale System (KFS). This supercomputing system, which is available to the Nation ... read more


TECH SPACE
Bug repellent for supercomputers proves effective

Keeneland Project Deploys New GPU Supercomputing System for the National Science Foundation

Lockheed Martin Expands Range Of Cloud Computing Services for UK Government

Invisibility cloaking to shield floating objects from waves

TECH SPACE
Lockheed Martin to Demonstrate Key Component of Tactical MilSat Communications System

The Skynet 5D secure telecom satellite is received in French Guiana for Arianespace's December Ariane 5 mission

Lockheed Martin Completes On Orbit Testing of Second AEHF Satellite

LynuxWorks LynxOS-SE Deployed by ITT Exelis in New Line of Software-Defined Radios

TECH SPACE
France, Germany seek Ariane compromise at ESA space meet

ILS Launches the EchoStar XVI Satellite

Arianespace's fourth Spaceport mission with Soyuz ready for fueling

Ariane 5's sixth launch of 2012

TECH SPACE
Lockheed Martin Completes Critical Environmental Test on GPS III Pathfinder

Roscosmos Requests Glonass Project Contractor Head's Dismissal

Mobile GPS Tracking capability on JCB ruggedized mobile phones

Quattro Group Gains Visibility And Control With Ctrack

TECH SPACE
India to buy nearly 130 Su-30 fighter jets from Russia

Chile phasing out C-212 tactical aircraft

Boeing Statement Supporting House Vote on Russia PNTR

China's home-grown plane rises to the challenge

TECH SPACE
USC scientists 'clone' carbon nanotubes to unlock their potential for use in electronics

Intel to seek new CEO, Otellini to retire in May

First noiseless single photon amplifier

New study reveals challenge facing designers of future computer chips

TECH SPACE
How many Russian Earth observation satellites will be in orbit by 2015?

A SPOT 6 Success Story

China launches third environment monitoring satellite

What Goes Down Must Come Back Up

TECH SPACE
Earth on Acid: The Present and Future of Global Acidification

Technology can spot hazardous materials

Greenpeace warns of chemicals in global fashion

Cleanup of Most Challenging US Contaminated Groundwater Sites Unlikely for Many Decades




The content herein, unless otherwise known to be public domain, are Copyright 1995-2014 - Space Media Network. AFP, UPI and IANS news wire stories are copyright Agence France-Presse, United Press International and Indo-Asia News Service. ESA Portal Reports are copyright European Space Agency. All NASA sourced material is public domain. Additional copyrights may apply in whole or part to other bona fide parties. Advertising does not imply endorsement,agreement or approval of any opinions, statements or information provided by Space Media Network on any Web page published or hosted by Space Media Network. Privacy Statement