Friday, November 12, 2010

Enabling ECC memory in Linux without BIOS support

I build computers for reliability and low(er) power; I've been doing so long before the somewhat recent green kick. In particular, I want ECC memory, and a lot of it, and a good power supply. I don't care about CPU speed or the video card. I like to leave my linux box up for months, even a year. And ECC memory is necessary for this. I used to have to buy specific chipsets for Intel processors, but in the past 3 years I have chosen AMD processors solely largely because they support ECC. The AMD Athlon CPUs have a built-in memory controller and it has supported unbuffered ECC RAM all this time. So any motherboard is largely fine... or so I thought.

I finally assembled my new system with a Phenom II X4 and a lovely Gigabyte GA-MA785GM-US2H MB with nice copper wiring and good capacitors. I chose this M/B since it has the latest AMD 785G video, and it supports DDR2 which was cheaper than DDR3 when buying ECC RAM (I've been buying Kingston ECC RAM, and for this system it was 8G of KVR533D2E4K2/4G since it was amazingly cheap.). But this stupid mother ***** does not support ECC in the BIOS, which is a bit odd as the CPU talks to the memory directly. Apparently Gigabyte does not provide for this in their BIOS settings http://forums.amd.com/forum/messageview.cfm?catid=21&threadid=123883, see the response from Gigabyte.

I had the following fails:
  1. Running memtest86+ v4.10, the memory is not recognized as ECC. Argh.
  2. Flashing the latest BIOS for this M/B did not help. Argh.
  3. I tried adding the kernel boot parameter to GRUB ecc_enable_override, but that did not work. Argh.
To make a long story short, the solution is that you can force the Linux kernel module that enables ECC to load via:

% modprobe -v amd64_edac_mod ecc_enable_override=1
To verify that the ECC was turned on run
% dmesg | grep -i edac
And you should see something like:

[ 658.399849] EDAC amd64_edac: Ver: 3.3.0 Sep 19 2010
[ 658.400082] EDAC amd64: This node reports that Memory ECC is currently disabled, set F3x44[22] (0000:00:18.3).
[ 658.400102] EDAC amd64: Forcing ECC checking on!
[ 658.400198] EDAC MC: F10h CPU detected
[ 658.400230] EDAC MC: DCT0 chip selects:
[ 658.400236] EDAC MC: 0: 1024MB 1: 1024MB
[ 658.400242] EDAC MC: 2: 1024MB 3: 1024MB
[ 658.400246] EDAC MC: 4: 0MB 5: 0MB
[ 658.400251] EDAC MC: 6: 0MB 7: 0MB
[ 658.400254] EDAC MC: DCT1 chip selects:
[ 658.400259] EDAC MC: 0: 1024MB 1: 1024MB
[ 658.400263] EDAC MC: 2: 1024MB 3: 1024MB
[ 658.400267] EDAC MC: 4: 0MB 5: 0MB
[ 658.400271] EDAC MC: 6: 0MB 7: 0MB
[ 658.400333] EDAC amd64: This node reports that DRAM ECC is currently Disabled; ENABLING now
[ 658.400339] EDAC amd64: Hardware accepted DRAM ECC Enable
[ 658.401685] EDAC MC0: Giving out device to 'amd64_edac' 'Family 10h': DEV 0000:00:18.2
[ 658.401731] EDAC PCI0: Giving out device to module 'amd64_edac' controller 'EDAC PCI controller': DEV '0000:00:18.2' (POLLED)

The Linux modules that deal with ECC are labelled "enad". Some other commands you can run, are lsmod (to verify the amd enad module is loaded) and dmidecode --type memory (to see how the BIOS is reporting memory, which shows non-ECC RAM in this particular case).