Differences between revisions 1 and 2

Final Report Outline

(Draft in Progress)

Four Pieces:

Executive Summary
Research Solicitation
Program Vision (main description)
Appendicies

Executive Summary

Target: 2p (starting point might be vision document from June)

(Congressional Slogan?)
scaling good, computers everywhere...critical to our lives and economy at many levels
challenges ahead due to reliability and power
hints there is something we can do
need research to develop new solutions
here's what we think we can do for you
need leadership (government leadership)

Research Solicitation

Target: 2p

outline key challenge ahead
kind of research solicited
program structure (? program coverage of promising directions?)
evaluation criteria

Program Vision

What are you trying to do?
- Allow continued scaling benefits
  - Reduce energy/operation
  - Reduce $$/gate
  - Increase ops/time with limited power-density budget
- While maintaining or improving safety
- Navigate inflection points in Energy and reliability
Why now?
- inflection points in reliability, energy
- critical deployment of computation
- system size?
How is it done today?
- Demand reliable, consistent device operation
  - Margin for worst-case device effect Of billions, over multi-year lifetime
  - Discard components when devices fail
- System-level redundancy
- The niches where above is not good enough are small but important (avionics, medical)
  - Spend considerable $$, energy for reliability
  - E.g. Brute-force replication
  - Many-year performance lag behind commercial systems
Trends?
- power limited trends? ... gap from margining?
- voltage scaling (ITRS)
- decreased dopants --> variability (ITRS)
- roadmap work on rate of variation-induced defects
- increasing transistors/chip
- increasing system system sizes (supercomp, data centers)
- decreased opportunity for burnin
- increase wearout effects
- decreased critical charge --> increased upset susceptibility
- roadmap work on {intrinsic,extrinsic} upsets?
- GDP in electronics?
- electronics in critical systems
What can we accomplish?
- Build reliable systems from unreliable components Efficiently compensate for unpredictable devices through cooperation at higher levels of system stack
- ground goals
  - scale how much further?
  - allow how many more ops/Joule?
  - how close to raw scaling?
  - extend component life by how much?
  - ...more... (depend on / synch with challenges)
What's new? (Ideas and promising directions)
- Ubiquitously/pervasively exploit:
  - Cross-layer codesign --- Multi-level tradeoffs (generalization of hardware/software)
    - (incl. tools to support, system abstractions, algorithm design)
  - Design prepared for self-assessment of safety margins and repair
  - Cooperative filtering of errors at multiple levels
  - Strategic, low-overhead redundancy
  - Differential reliability
  - Scalable and adaptive solutions
Why do this?
- reliability matters for everything moving forward...just a matter of how much
- Allow scaling to continue without sacrificing safety
  - Continued reduction in energy/op
  - Continued reduction in $$/op
  - Maintain or extend component lifetimes
  - How much further?
- Allow construction of larger, dependable systems
- Make infrastructural technology worthy of the trust we place in it
- Specific big wins (challenges overcome) from focus groups?
  - feasible to fly commercial? and/or have any access to most advanced technology? (close commercial/aerospace component gap?); allow areospace to exploit modern electronics?
  - advanced tech safe for drive-by-wire?
  - enable larger (more components, computation) medical devices?
  - enable supercomputers able to solve XXX problems?
  - ??? review focus group output and select appropriate to highlight here ???
- Can't have security w/out getting resilience under control ?
- Why government leadership?
Big Science Questions?
- How do we organize, manage, and analyze layering for cooperative fault mitigation?
- How do we best accommodate repair?
- What is the right level of filtering at each level of the hierarchy?
- Can we establish a useful theory and collection of design patterns for lightweight checking?
- What would a theory and framework for expressing and reasoning about differential reliability look like?
- Can a scalable theory and architectures that will allow adaptation to various upset rates and system reliability targets be developed?
Critical Questions? (big risk items? ... more strategic questions?)
- enable concurrent research in understanding low-level upset, fatigue effects with ever changing technology along with high-level mitigation
- manage developer burden (avoid increasing)
- pushing more complexity into software, when haven't mastered reliable software
- ??? others
Metrics, Goals, Measure and manage programs
- Goals of metrics
  - assess if proposed research proposing to attack the right problems?
  - measure if research making progress on solving the problem?
- Some possible, primary metrics
  - Energy/Op at noise rate and performance target (Noise rate: defects, variation wear, transients)
  - Post-fab adaptability to range of noise rates
  - Timeliness and quality of adaptation
- Recommendations from metrics group
Examples and Illustrative Scenarios
- Processor
- SoC
- High-level software
Challenge problems and areas of pain
- common/cross-cutting challenges
- by focus group (5)
Mission Impacts? (perhaps related to stuff trying to summarize above under specific challenges overcome---but, the detail support goes here; this is also the challenges/targets/opportunties for more mission oriented agencies)
- security
- cyberphysical
- satellite
- supercomputer
- ???
Infrastructure
Layer discussion and community contributions: probably good to have a clear description of the layers and callout the wide range of communities that could participate in this research
Process and Participants
- summarize activities of group (meetings, wiki, focus groups...)
- comprehensive list of participants

Appendix

focus group reports

-  ⇤ ← Revision 1 as of 2009-10-02 18:58:54 → 
  Size: 6804
  Editor: AndreDeHon
  Comment: initial draft
+   ← Revision 2 as of 2009-10-02 21:34:24 → ⇥
  Size: 7206
  Editor: AndreDeHon
  Comment: rules between sections
-Deletions are marked like this.
+Additions are marked like this.
 Line 11:
+-------------------------------------------------------------------------------------------------
 Line 27:
+-------------------------------------------------------------------------------------------------
-Line 36:
+Line 38:
+-------------------------------------------------------------------------------------------------
-Line 163:
+Line 167:
+-------------------------------------------------------------------------------------------------