Differences between revisions 11 and 12

Final Report Outline

(Draft in Progress)

Four Pieces:

Executive Summary
Research Solicitation
Program Vision (main description)
Appendices

Executive Summary

Target: 2p (starting point might be vision document from June)

(It would be good to have a pithy congressional slogan. Extending the technology revolution?)
scaling good, computers everywhere...critical to our lives and economy at many levels
challenges ahead due to reliability and power
hints there is something we can do
need research to develop new solutions
here's what we think we can do for you
need leadership (government leadership)

Research Solicitation

Target: 2p

(This should be a stand-alone piece. There is some question of whether or not it should be up front like this. Also, there are questions about whether or not this steps on PMs feet.)

outline key challenge ahead
kind of research solicited
program structure (? program coverage of promising directions?)
evaluation criteria

Program Vision

What are you trying to do?
- Allow continued scaling benefits
  - Reduce energy/operation
  - Reduce $$/gate
  - Increase ops/time with limited power-density budget
- While maintaining or improving safety
- Navigate inflection points in energy and reliability
Why now?
- inflection points in reliability, energy
- critical deployment of computation
- system size?
How is it done today?
- Demand reliable, consistent device operation
  - Margin for worst-case device effect Of billions, over multi-year lifetime
  - Discard components when devices fail
- System-level redundancy
- The niches where above is not good enough are small but important (avionics, medical)
  - Spend considerable $$, energy for reliability
  - E.g. Brute-force replication
  - Many-year performance lag behind commercial systems
Trends?
- power limited trends? ... gap from margining?
- voltage scaling (ITRS)
- decreased dopants --> variability (ITRS)
- roadmap work on rate of variation-induced defects
- increasing transistors/chip
- increasing system system sizes (supercomp, data centers)
- decreased opportunity for burnin
- increase wearout effects
- decreased critical charge --> increased upset susceptibility
- roadmap work on {intrinsic,extrinsic} upsets?
- GDP in electronics?
- electronics in critical systems
What can we accomplish?
- Build reliable systems from unreliable components
  - Efficiently compensate for unpredictable devices through cooperation at higher levels of system stack
- Ground goals
  - scale how much further?
  - allow how many more ops/Joule?
  - how close to raw scaling?
  - extend component life by how much?
  - ...more... (depend on / synch with challenges)
What's new? (Ideas and promising directions)
- Ubiquitously/pervasively exploit:
  - Cross-layer codesign --- Multi-level tradeoffs (generalization of hardware/software)
    - (incl. tools to support, system abstractions, algorithm design)
  - Design prepared for self-assessment of safety margins and repair
  - Cooperative filtering of errors at multiple levels
  - Strategic, low-overhead redundancy
  - Differential reliability
  - Scalable and adaptive solutions
Why do this?
- Reliability matters for everything moving forward...just a matter of how much (in the past, for a large class of applications and system sizes, the base technology was reliable enough that there was no need to address reliability in the design; as we move forward, between increasing technology noise and increasing base system sizes, it must be addressed in design for all applications.)
- Allow scaling to continue without sacrificing safety
  - Continued reduction in energy/op
  - Continued reduction in $$/op
  - Maintain or extend component lifetimes
  - How much further?
- Allow construction of larger, dependable systems
- Make infrastructural technology worthy of the trust we place in it
- Specific big wins (challenges overcome) from focus groups?
  - feasible to fly commercial? and/or have any access to most advanced technology? (close commercial/aerospace component gap?); allow areospace to exploit modern electronics?
  - advanced technology safe for drive-by-wire?
  - enable larger (more components, computation) medical devices?
  - enable supercomputers able to solve XXX problems?
  - ??? review focus group output and select appropriate to highlight here ???
- Can't have security without getting resilience under control
- Why government leadership?
Challenge problems and areas of pain
- Common/cross-cutting challenges
  - Varying demands, workloads, environment (and uncertainty about the environment) means worst-case design is overdesign for most uses. This motivates adaptive solutions.
  - Worst-case design independent of the application and its needs is too expensive. Similarly, worst-case design for uncommon, but potentially avoidable, worst-case scenarios is also a large, unnecessary cost. These motivate cross-layer, application-aware solutions and/or models/middleware that support management of operational aspect of application.
  - Fully custom/unique construction of all components is not viable (costs, manpower) for anyone. Some domains see more acute versions of this, but no domain is really able to do everything custom themselves these days. This motivate: interfaces/metrics/tools to perform composition/analysis/optimization/validation of separately sourced (sub)components.
  - Across the board, there is considerable conservative overdesign. This motivates system assessment methodology, tools support energy-delay-area-reliability-thermal-mechanical space.
  - Environment, energy demands, deployed system context, and even technology noise and maturity are all late bound, possibly not known during and design, and maybe not known until deployment. This motivates modes and configuration options that allow the component to tune what it spends on reliability. This could allow commercial devices to enhance yield or operate at extremely low energy levels while also making the same parts more usable in larger scale systems or harsher environments.
- By focus group (5)
  - Commercial
    - Address growing reliability challenge with small enough overhead to avoid negating benefit of scaling
    - Reduce energy per operation while retaining reliable operation
    - Maintain or extend lifetimes in face of increasing wear effects
    - Economically address demand for components with different reliability needs
    - Navigating complex, multidimensional design space
  - Aerospace
    - System lifetimes >> changes in political and scientific need
    - Navigating complex, multidimensional design space
    - Widening gap between commercial and mil/areo components
    - Design for (uncommon) worst-case environment
    - Bottleneck in testing
    - Focus on part reliability over system reliability
  - Large Scale
    - Overhead required to achieve reliability using current and traditional fault-tolerance approaches is too high.
  - Life Critical
  - Infrastructure
Big Science Questions?
- How do we organize, manage, and analyze layering for cooperative fault mitigation?
- How do we best accommodate repair?
- What is the right level of filtering at each level of the hierarchy?
- Can we establish a useful theory and collection of design patterns for lightweight checking?
- What would a theory and framework for expressing and reasoning about differential reliability look like?
- Can a scalable theory and architectures that will allow adaptation to various upset rates and system reliability targets be developed?
Mission Impacts? (perhaps related to stuff trying to summarize above under specific challenges overcome---but, the detail support goes here; this is also the challenges/targets/opportunities for more mission-oriented agencies)
- security
- cyberphysical
- satellite
- supercomputer
- ???
Critical Questions? (big risk items? ... more strategic questions?)
- enable concurrent research in understanding low-level upset, fatigue effects with ever changing technology along with high-level mitigation
- manage developer burden (avoid increasing)
- must be careful pushing more complexity into software, when haven't mastered software reliability
- ??? others
Metrics, Goals, Measure and manage programs
- Goals of metrics
  - assess if proposed research proposing to attack the right problems?
  - measure if research making progress on solving the problem?
- Some possible, primary metrics
  - Energy/Op at noise rate and performance target (Noise rate: defects, variation wear, transients)
  - Post-fab adaptability to range of noise rates
  - Timeliness and quality of adaptation
- Recommendations from metrics group

Research Organization and Infrastructure (see discussion starter ResearchOrgInfra)
Examples and Illustrative Scenarios
- Processor
- SoC
- High-level software
Layer discussion and community contributions: probably good to have a clear description of the layers and callout the wide range of communities that could participate in this research
Process and Participants
- summarize activities of group (meetings, wiki, focus groups...)
- comprehensive list of participants

Appendix

focus group reports

-  ⇤ ← Revision 11 as of 2009-10-06 20:49:38 → 
  Size: 10566
  Editor: AndreDeHon
  Comment: answering question
+   ← Revision 12 as of 2009-10-06 22:18:23 → ⇥
  Size: 10495
  Editor: AndreDeHon
  Comment: comment about solicitation, trying to fix positioning of last few bullets
-Deletions are marked like this.
+Additions are marked like this.
 Line 33:
-    (''Is this supposed to be a separate document, because it reads separate''---AMD: it should be stand-alone;  this is the thing that might be close to an actual call for proposals, while the details which follow provides rationale, details, and programatics)
+    (''This should be a stand-alone piece.  There is some question of whether or not it should be up front like this.  Also, there are questions about whether or not this steps on PMs feet.'')
 Line 189:
-  * Layer discussion and community contributions: probably good to have a clear description of the layers and callout the wide range of communities that could participate in this research

  * Process and Participants
+ * Layer discussion and community contributions: probably good to have a clear description of the layers and callout the wide range of communities that could participate in this research

 * Process and Participants