= Research Organization and Infrastructure =


[''This is a discussion starter rather than any recommendation at this point.  It is an issue that we should address in our final recommendations.'']

An issue that turned up during discussion with NSF program managers was
that of infrastructure for the study.  Their view was that NSF was not very
good at supporting projects that required infrastructural support and maybe
it would work better if NSF worked in partnership with another agency like
DARPA or SRC.

The issue here is that cross-layer work means we're doing some things at
high levels (OS, software, architecture) that should address low-level
problems (transient upsets, noise).  If we had a perfect model of what
low-level effects looked like, then we may comfortably separate the problem;
higher-level work can test ideas against the model.  However, if we have a poor
idea about the low-level effects of current and future technologies, there
may be no way to be certain about the success of a high-level technique
other than to measure it in an appropriate system.  This is particularly limiting
if we're trying to address technology that does not, yet, exist.

It seems clear from the study:
 * we don't have a perfect characterization of what's happening at the device and component level
 * almost all of our characterizations are empirical, not bottom-up phenomenological---we're always playing catchup with the scientific understanding 
 * things change as we scale, and it is unlikely we will fully know what a technology does until it has had years of use---when it's already too late

Given the above, we cannot afford to simply wait around to create models.
However, nor can we expect high-level work that isn't grounded in some
experimentation to stay relevant.

Low level experiments require expensive resources and infrastructure:
 * fabricating chips to prove out circuit and architecture ideas
 * accelerated particle beam testing of chips and techniques


Similarly, how do individuals or small groups participate?   If we had a common research infrastructure platform for this cross-layer cooperation, then groups could work with the common infrastructure and augment or replace components.  However, some of the big research questions to address now are how to design the interfaces and organize such an infrastructure.  Designing, assembling, and packaging the infrastructure framework is part of the initial research needed.
  
''How can any potential programs allow concurrent effort at various levels
while keeping the research relevant?''

 * monolithic/integrated teams

  A DARPA-style model might force team
  formation across the layers.  This way, high-level and low-level people
  work together and plan experimentation.  This gives good validation, but
  demands large teams, and only a limited number of designs can be fully
  fabricated and tested.  This model is not consistent with the smaller grants
  given by organizations like NSF.

 * hardware research prototyping platforms

  There has been some success with hardware fault
  injection platforms based on FPGAs or other hardware simulators.
  Here, the injection models are validated versus beam tests, and 
  architecture and software work can run on the hardware fault injection
  platforms (reference Sanda/Power6, Quinn/LANL?, Sass/UNCC).  Perhaps a common FPGA
  platform (maybe the NSF-supported RAMP platform) would be accurate enough
  for validation of high-level research.  Some researchers could provide
  various fault injection models (single bit flip, correlated, bursty...).
  Researchers  working directly with beams could validate/correlate models
  with beam tests.  This allows smaller, more decoupled efforts while
  providing points of interface between the various experts.  The key question here
  is whether the injection models can be made good enough. 

  These platforms assist with decoupling of beam tests from higher levels, but still leaves
  a need to develop the full system above the circuit level.

 * centers

   As noted above, a research infrastructure platform is a key need and, itself, demands research and experimentation. This, too, is multi-layer and demands close cooperation of a large team of experts.  One model to bootstrap this would be to fund a modest number of parallel, team efforts to develop such infrastructure stacks.  This might be consistent with a focus center model.



 * shared databases/repositories

  Provide a common repository for information on transients and aging.  Those working at device and component level provide input to these databases, and those working at higher levels build from those.   Another candidate for shared databases may be upsets from supercomputers and public grids (e.g. NSF TeraGrid).   Organizations like NASA might have some databases on component reliability the community could leverage.



 * shared grid platforms

  Industry, supercomputer centers, or NSF might also provide grid platforms  that are useful for some classes of experiments.  This would be limited to experimentation on current (old) technology and architectures, but could prove out some ideas and facilitate raw data collection. It might also be useful to consider dedicated clusters that would allow the use of experimental OS (and firmware, FPGA-configurations, as appropriate).  For example, the RAMP effort has proposed a cluster of servers with RAMP-based platforms available for community use.


[''Please suggest additional ideas and considerations.'']