= Research Organization and Infrastructure = [''This is a discussion starter rather than any recommendation at this point. It is an issue that we should address in our final recommendations.''] An issue that turned up during discussion with NSF program managers was that of infrastructure for the study. Their view was that NSF was not very good at supporting projects that required infrastructural support and maybe it would work better if NSF worked in partnership with another agency like DARPA or SRC. The issue here is that cross-layer work means we're doing some things at high levels (OS, software, architecture) that should address low-level problems (transient upsets, noise). If we had a perfect model of what low-level effects looked like, then we may comfortably separate the problem; higher-level work can test ideas against the model. However, if we have a poor idea about the low-level effects of current and future technologies, there may be no way to be certain about the success of a high-level technique other than to measure it in an appropriate system. This is particularly limiting if we're trying to address technology that does not, yet, exist. It seems clear from the study: * we don't have a perfect characterization of what's happening at the device and component level * almost all of our characterizations are empirical, not bottom-up phenomenological---we're always playing catchup with the scientific understanding * things change as we scale, and it is unlikely we will fully know what a technology does until it has had years of use---when it's already too late Given the above, we cannot afford to simply wait around to create models. However, nor can we expect high-level work that isn't grounded in some experimentation to stay relevant. Low level experiments require expensive resources and infrastructure: * fabricating chips to prove out circuit and architecture ideas * accelerated particle beam testing of chips and techniques Similarly, how do individuals or small groups participate? If we had a common research infrastructure platform for this cross-layer cooperation, then groups could work with the common infrastructure and augment or replace components. However, some of the big research questions to address now are how to design the interfaces and organize such an infrastructure. Designing, assembling, and packaging the infrastructure framework is part of the initial research needed. ''How can any potential programs allow concurrent effort at various levels while keeping the research relevant?'' * monolithic/integrated teams A DARPA-style model might force team formation across the layers. This way, high-level and low-level people work together and plan experimentation. This gives good validation, but demands large teams, and only a limited number of designs can be fully fabricated and tested. This model is not consistent with the smaller grants given by organizations like NSF. * hardware research prototyping platforms There has been some success with hardware fault injection platforms based on FPGAs or other hardware simulators. Here, the injection models are validated versus beam tests, and architecture and software work can run on the hardware fault injection platforms (reference Sanda/Power6, Quinn/LANL?, Sass/UNCC). Perhaps a common FPGA platform (maybe the NSF-supported RAMP platform) would be accurate enough for validation of high-level research. Some researchers could provide various fault injection models (single bit flip, correlated, bursty...). Researchers working directly with beams could validate/correlate models with beam tests. This allows smaller, more decoupled efforts while providing points of interface between the various experts. The key question here is whether the injection models can be made good enough. These platforms assist with decoupling of beam tests from higher levels, but still leaves a need to develop the full system above the circuit level. * centers As noted above, a research infrastructure platform is a key need and, itself, demands research and experimentation. This, too, is multi-layer and demands close cooperation of a large team of experts. One model to bootstrap this would be to fund a modest number of parallel, team efforts to develop such infrastructure stacks. This might be consistent with a focus center model. * shared databases/repositories Provide a common repository for information on transients and aging. Those working at device and component level provide input to these databases, and those working at higher levels build from those. Another candidate for shared databases may be upsets from supercomputers and public grids (e.g. NSF TeraGrid). Organizations like NASA might have some databases on component reliability the community could leverage. * shared grid platforms Industry, supercomputer centers, or NSF might also provide grid platforms that are useful for some classes of experiments. This would be limited to experimentation on current (old) technology and architectures, but could prove out some ideas and facilitate raw data collection. It might also be useful to consider dedicated clusters that would allow the use of experimental OS (and firmware, FPGA-configurations, as appropriate). For example, the RAMP effort has proposed a cluster of servers with RAMP-based platforms available for community use. [''Please suggest additional ideas and considerations.'']