== Question == What would a theory and framework for expressing and reasoning about differential reliability look like? == Summary == Not all computations need the same level of protection (e.g. computations on noisy input need produce results no more accurate than the input noise level would allow). Further, reliability in some parts of a computation may allow other parts to tolerate lower reliability (e.g. a reliable control loop with a lightweight check allows the checked computation to execute at lower reliability; convergent computations may self correct when errors occur as long as the convergence test is reliable). == Subquestions == * How do we express (identify, analyze) and exploit allowable noise (error rates) from substrate for computation or piece of a computation? * What is the value of reflecting/exposing errors to the application level and what is the proper way to do so? * How do we express the reliability needs of an application, or how can we analyze the needs of subcomponents based on structure of the application? == Relevant Scenarios == * Conventional DRAMs provide a good example. The array core is prone to error. The peripheral circuitry is implemented in a more reliable technology (usually a coarser feature size) and can correct errors in the core (e.g., ECC circuitry corrects bit errors in memory). * [[Scenarios/S1|Scenario 1]] == Workshop Materials == * [[attachment:Meetings/First/Program/diffrel.pdf|Workshop 1 Slides]] == Existing Work == * ''add additional references here'' == Comments == * Impact of a bad data bit on a computation: catastrophic? blow away the application? soft state (blow away the state)? optimization only (like branch prediction)? Change instruction opcode? Currently, there is no way to type data for the compiler/cad tool and/or "tag" data in memory to indicate how it will be used (tag could come from application, compiler, or system). If we had this, could handle different data with different levels of redundancy, corruption detection, etc. (Compare: large scale distributed systems and avionics do this by putting different kinds of state in different state stores that embody different failure requirements/assumptions. Can we do this at finer grain? Automatically?) * Another degree of freedom for this is temporal - when to turn on/off self-checking mechanisms that have low areal overhead but high power cost. * ''To comment, please add another bullet to this list.''