njk: Stabilizing PC hardware

Update

I don't use burnMMX anymore. On modern hardware, it's necessary to stabilize CPU cores, integrated memory controllers, and RAM. RAM and IMCs are obviously deeply intertwined. To properly stress the CPU cores requires tight loops that exercise the execution and scheduling units of the CPU. Stressing the memory subsystem requires calculations to be performed on very large datasets (typically sparse matrices in practice). Therefore, it's generally necessary to run at least two different stress programs at the very least.

The ideal methods of stability testing vary by hardware platform, but I generally use both linpack (using very large datasets that fill all of the system RAM) and prime95.

BurnMMX is still a good test, but I have seen a system pass p95 and parallel burnmmx, but then quickly fail linpack.

Introduction

PC hardware is often unstable. Even machines built by major manufacturers are often unstable under extremely heavy loads, or when long-running, data-intensive operations are performed over long intervals of time. Such problems often are hard to diagnose, leading only to random data corruption that may go unnoticed -- it is statistically not very common that program pointer data or instructions are corrupted, and thus silent data corruption happens much more frequently than program error. Fortunately, it is not difficult to stabilize most machines.

Sometimes it is hopeless

Not all hardware is of good quality; there are some processors, motherboards, chipsets, RAM, etc that are simply of poor design, or that suffer from manufacturing defects or poor quality control. Sometimes these problems can be corrected by disabling performance features, raising voltage, or downclocking components -- but sometimes hardware just needs to be replaced. If you're unable to stabilize a machine, it's quite possible that some component is simply defective. In these cases, the best that can be done is to isolate the defective component by process of elimination (most often by severe underclocking of other components, or swapping known-good components to isolate potential problems) and to then replace the defective component.

Method

There are many workloads that are used to test PC stability; one needs only to look at overclockers to notice that both the definition of stability and the methods used to assure stability vary greatly from person to person. Therefore, it's necessary for me to define what I consider to be a stable system. I take an extreme viewpoint: stability to me is a system that has no perceptible random corruption or crashes that can be traced to hardware problems (driver problems, software bugs, etc are other issues). If any of my files fail checksums, or if any of my programs randomly crash, no matter how infrequent, I consider the system to be untrustworthy and not worth using. There is a simple reason for this: silent data corruption is very deadly to data over time. Data is copied many times, and even rare errors acculmulate over time. Further, if a system is sufficiently unstable, even hash checks may supriously fail or pass, meaning that it is not possible to trust whether data is valid or not, even when methods are used to assure data integrity (of course, it is still often possible to move data to a known-stable machine and test there, but that assumes that one already *has* a known-stable machine, which does not remove the problem of assuring and creating stability)!

I've found one method of testing for stability that works better than anything else I've tried: multiple parallel sessions of burnMMX. I've seen few people reccomend this approach, but in my experience, systems that can pass 24 or more hours of Prime95 or similar programs are often still unstable when subjected to extremely heavy data intensive loads (running multiple compiles in parallel is such a test). Therefore, I have largely abandoned Prime95 as a stability test in favor of burnMMX.

Burning in

First, obtain and compile burnMMX. Then run as many "burnMMX P" processes as is necessary to use all available RAM. My approach is to run enough processes that the HDD begins to swap. If swapping continues without end, then I terminate processes until swap stops but RAM is nearly entirely used. Then, I suggest counting the number of burnMMX processes and allowing the PC to sit for 24h. Failure is defined as any burnMMX process prematurely exiting; burnMMX checksums all data after operations are performed and exits as soon as unexpected data is encountered. Extremely unstable machines will have burnMMX processes crash and exit. Such machines will usually exhibit obvious instability in other applications. Most unstable machines fail quickly -- within 1-30m. Even seemingly stable machines may fail within 30m-8h. Such machines are unstable, and do silently corrupt data, but not with such frequency that they are perceived as being unstable by many users and workloads. Any machine that can last 24h with this test is probably stable enough to perform all but the most critical tasks.

An astute reader may ask why so many processes are run in parallel, since it would seem that the CPU could be put under greater load by running a single process that consumed all CPU time, minimizing the number of context switches required. The reason is simple: a single process would indeed load the CPU more heavily, but in most cases, CPUs are much more reliable than RAM or chipsets. The extremely heavy memory and bus load created by the many burnMMX processes seems to be extremely effective at exposing unstable memory subsystems, which are often the cause of much instability that is otherwise ignored by other methods of CPU stability testing.

How do I become stable?

Simply put, downclock, raise voltage (perhaps lower voltage in the case of RAM -- RAM is notoriously picky over voltage), back off on RAM or chipset timings, or buy new, higher-quality hardware as a last resort. I'm not going to detail the exact methodology used, since the approach must either be from intuition gained from long experience or by methodical process of elimination. Choose your approach and solve the problem.