Summary of Second Meeting (July 8-9, 2009, Los Alamos, NM)

Context

The first workshop reinforced the need for new reliability approaches and solutions as we scaled to smaller device feature sizes and larger system sizes. We further identified a host of promising directions to improve reliability above the device level while containing or reducing reliability overhead. It was clear that we had a problem and research vectors to address them.

While there was broad agreement that problems were coming or were already here, there was less agreement on the exact nature of the problem, how it would impact our systems, and how fast it was coming. In the meeting and the aftermath of the meeting, it became clear that we lacked a common roadmap and agreement on the key metrics and composibility. Further, it was clear that we needed to more carefully discuss the complex multi-dimensional design space (area-delay-energy-reliability-thermal...) in which we operated as well as the various environments (sea-level, atmosphere, space) and system sizes (hand helds to supercomputers). It was not yet clear if the reliability problems encountered in harsh environments like space, the reliability problems of large-scale systems like supercomputers, and the reliability challenges associated with voltage scaling to reduce power in consumer electronics had common underlying causes that would benefit from the same kinds of solutions. Alternately, it might turn out that these different domains demand distinct solutions because they operate in different pieces of this complex state with different primary limitations.

Workshop Target

From this experience, it was clear we needed to divide the group into narrower constituencies to get a better characterization of the problems in each domain. The hope was, and is, that with a clear view of the problems in each domain, we could then look for underlying commonality as well as identifying necessary points of divergence. This led us to form a set of constituency groups organized around key systems types, including:

The idea was for each of these groups to identify their own key challenge problems without getting bogged down, at this point, with the needs of the other areas. We could then compare the challenges and identify common problems and opportunities for common solutions if they arise. These constituency groups were to help quantify the challenge and help identify the impact of not solving these problems (or the beneficial impact of solving).

We succeeded in forming three of these challenge groups (space/avionics, large-scale systems, and consumer electronics) between the meetings, and they were able to meet and develop their thoughts to varying degrees before the meeting. The meeting served as an opportunity to compare notes among these groups, clarify the target for the groups, and meet face-to-face to further the effort. All the groups presented works in progress and will continue working toward crisper, quantified, and prioritized description of their challenges.

We failed to find critical mass in life-critical systems and infrastructure. Forming these groups continues to be an ongoing effort and the workshop helped identify more targets to contact. The hope is that these groups can be formed soon, work during August and September, and be ready to brief out results at the final workshop in October.

In addition to the constituency groups, we also created a group on metrics to help identify the proper way to define the challenge problems and a group on roadmapping to provide a common reference for how bad silicon devices could become and how fast. The metrics group expanded its scope to thinking about composition of metrics and which metrics would enable and characterize cross-layer cooperation and optimization. These groups also presented their status and plans at the workshop and compared notes with the constiuency challenge groups.

As a result, this workshop was very much a work-in-progress status report meeting for all involved, giving everyone glimpses of the key issues for the study groups and possible study outcomes and helping keep the groups converging toward the overall goals of the visioning effort. The workshop was roughly organized as:

Presentations and Status

Workshop Opener

Workshop opened with a brief on the goal and status of the study by the organizers (slides). This served both as a introduction and reminder of the goals of the study for the participants and as an opportunity to try out a refined version of the study vision. The briefing highlighted some of the areas of need (e.g. quantified challenge problems) that the focus groups would be providing in order to provide a complete and solid story. This was also a chance to review the timeline for the study itself for all the participants so everyone could see how it should come together.

Metrics

The metrics group has formed a healthy team that was represented by a smaller subset at the meeting itself. Through pre-meeting telectonferences they had scoped their effort and identified subgroups within their effort. (slides)

Roadmap

Roadmapping group was just being constituted at the time of the meeting and has since begun to meet via regular teleconferences. The workshop provided a jumping-off point to define and refine the goals for the roadmap effort. (slides)

Space/Avionics

The space and avionics group formed earliest, was able to meet in person at other conferences, and had the most in-depth discussions prior to the workshop. (brief-in slides brief-out slides)

Large-Scale Systems

The large-scale systems group was able to initiate some discussions in advance of the meeting. (brief-in slides brief-out slides)

Consumer Electronics

The consumer electronics group used the workshop to kickoff their discussions. (brief-in slides brief-out slides) The commercial sector has been seeing increased need to address failures. So far, solutions have been ad hoc fixes here and there as they trip over acceptable failure rates (e.g. apply ECC here and there to reduce failure rate). Challenges:

Potential Solutions:

Common themes identifiable so far...

Groups continue to work and refine their challenges. From the discussions at the workshop, a few themes that are beginning to emerge include:

Next Steps

All of these groups need further work to refine their story and make their challenges more quantitative. They will continue to work during the coming month to draft more complete answers for input into the full study report.

Our working schedule is:

We still need to constitute the infrastructure and life-critical groups. We took further contact suggestions from the workshop and are pursuing this further. Additional suggestions are still appreciated, especially in the life-critical area.

The organizers are planning to try to talk with individual NSF Program Mangers in August to help them see where we are and get input on what else may be needed to provide a compelling, convincing, and useful story to them. The organizers are open to talking to program managers in other agencies that may be appropriate. Suggestions and contacts are welcomed.

Final workshop will be at the IBM conference cneter in Austin, TX in October. We are taking input on the dates and are particularly trying to select a date that will allow NSF program managers to attend. Preliminary polling is leaning toward late October (29/30th), but we have not, yet, set a final date.

Meetings/Second/Summary (last edited 2009-08-10 23:00:39 by AndreDeHon)