Differences between revisions 1 and 2
Revision 1 as of 2009-10-02 18:58:54
Size: 6804
Editor: AndreDeHon
Comment: initial draft
Revision 2 as of 2009-10-02 21:34:24
Size: 7206
Editor: AndreDeHon
Comment: rules between sections
Deletions are marked like this. Additions are marked like this.
Line 11: Line 11:
-------------------------------------------------------------------------------------------------
Line 27: Line 27:
-------------------------------------------------------------------------------------------------
Line 36: Line 38:

-------------------------------------------------------------------------------------------------
Line 163: Line 167:

-------------------------------------------------------------------------------------------------

Final Report Outline

(Draft in Progress)

Four Pieces:

  1. Executive Summary
  2. Research Solicitation
  3. Program Vision (main description)
  4. Appendicies


Executive Summary

Target: 2p (starting point might be vision document from June)

  • (Congressional Slogan?)
  • scaling good, computers everywhere...critical to our lives and economy at many levels
  • challenges ahead due to reliability and power
  • hints there is something we can do
  • need research to develop new solutions
  • here's what we think we can do for you
  • need leadership (government leadership)


Research Solicitation

Target: 2p

  • outline key challenge ahead
  • kind of research solicited
  • program structure (? program coverage of promising directions?)
  • evaluation criteria


Program Vision

  • What are you trying to do?
    • Allow continued scaling benefits
      • Reduce energy/operation
      • Reduce $$/gate
      • Increase ops/time with limited power-density budget
    • While maintaining or improving safety
    • Navigate inflection points in Energy and reliability
  • Why now?
    • inflection points in reliability, energy
    • critical deployment of computation
    • system size?
  • How is it done today?
    • Demand reliable, consistent device operation
      • Margin for worst-case device effect Of billions, over multi-year lifetime
      • Discard components when devices fail
    • System-level redundancy
    • The niches where above is not good enough are small but important (avionics, medical)
      • Spend considerable $$, energy for reliability
      • E.g. Brute-force replication
      • Many-year performance lag behind commercial systems
  • Trends?
    • power limited trends? ... gap from margining?
    • voltage scaling (ITRS)
    • decreased dopants --> variability (ITRS)

    • roadmap work on rate of variation-induced defects
    • increasing transistors/chip
    • increasing system system sizes (supercomp, data centers)
    • decreased opportunity for burnin
    • increase wearout effects
    • decreased critical charge --> increased upset susceptibility

    • roadmap work on {intrinsic,extrinsic} upsets?
    • GDP in electronics?
    • electronics in critical systems
  • What can we accomplish?
    • Build reliable systems from unreliable components Efficiently compensate for unpredictable devices through cooperation at higher levels of system stack
    • ground goals
      • scale how much further?
      • allow how many more ops/Joule?
      • how close to raw scaling?
      • extend component life by how much?
      • ...more... (depend on / synch with challenges)
  • What's new? (Ideas and promising directions)
    • Ubiquitously/pervasively exploit:
      • Cross-layer codesign --- Multi-level tradeoffs (generalization of hardware/software)
        • (incl. tools to support, system abstractions, algorithm design)
      • Design prepared for self-assessment of safety margins and repair
      • Cooperative filtering of errors at multiple levels
      • Strategic, low-overhead redundancy
      • Differential reliability
      • Scalable and adaptive solutions
  • Why do this?
    • reliability matters for everything moving forward...just a matter of how much
    • Allow scaling to continue without sacrificing safety
      • Continued reduction in energy/op
      • Continued reduction in $$/op
      • Maintain or extend component lifetimes
      • How much further?
    • Allow construction of larger, dependable systems
    • Make infrastructural technology worthy of the trust we place in it
    • Specific big wins (challenges overcome) from focus groups?
      • feasible to fly commercial? and/or have any access to most advanced technology? (close commercial/aerospace component gap?); allow areospace to exploit modern electronics?
      • advanced tech safe for drive-by-wire?
      • enable larger (more components, computation) medical devices?
      • enable supercomputers able to solve XXX problems?
      • ??? review focus group output and select appropriate to highlight here ???
    • Can't have security w/out getting resilience under control ?
    • Why government leadership?
  • Big Science Questions?
    • How do we organize, manage, and analyze layering for cooperative fault mitigation?
    • How do we best accommodate repair?
    • What is the right level of filtering at each level of the hierarchy?
    • Can we establish a useful theory and collection of design patterns for lightweight checking?
    • What would a theory and framework for expressing and reasoning about differential reliability look like?
    • Can a scalable theory and architectures that will allow adaptation to various upset rates and system reliability targets be developed?
  • Critical Questions? (big risk items? ... more strategic questions?)
    • enable concurrent research in understanding low-level upset, fatigue effects with ever changing technology along with high-level mitigation
    • manage developer burden (avoid increasing)
    • pushing more complexity into software, when haven't mastered reliable software
    • ??? others
  • Metrics, Goals, Measure and manage programs
    • Goals of metrics
      • assess if proposed research proposing to attack the right problems?
      • measure if research making progress on solving the problem?
    • Some possible, primary metrics
      • Energy/Op at noise rate and performance target (Noise rate: defects, variation wear, transients)
      • Post-fab adaptability to range of noise rates
      • Timeliness and quality of adaptation
    • Recommendations from metrics group
  • Examples and Illustrative Scenarios
    • Processor
    • SoC
    • High-level software
  • Challenge problems and areas of pain
    • common/cross-cutting challenges
    • by focus group (5)
  • Mission Impacts? (perhaps related to stuff trying to summarize above under specific challenges overcome---but, the detail support goes here; this is also the challenges/targets/opportunties for more mission oriented agencies)
    • security
    • cyberphysical
    • satellite
    • supercomputer
    • ???
  • Infrastructure
  • Layer discussion and community contributions: probably good to have a clear description of the layers and callout the wide range of communities that could participate in this research
  • Process and Participants
    • summarize activities of group (meetings, wiki, focus groups...)
    • comprehensive list of participants


Appendix

  • focus group reports

FinalReportOutline (last edited 2010-04-01 16:41:34 by AndreDeHon)