Single Board Revolution: Preventing Flash Memory Corruption

An SD card is surely not an enterprise grade storage solution, but single board computers also aren’t just toys anymore. You find them in applications far beyond the educational purpose they have emerged from, and the line between non-critical and critical applications keeps getting blurred.

Laundry notification hacks and arcade machines fail without causing harm. But how about electronic access control, or an automatic pet feeder? Would you rely on the data integrity of a plain micro SD card stuffed into a single board computer to keep your pet fed when you’re on vacation and you back in afterward? After all, SD card corruption is a well-discussed topic in the Raspberry Pi community. What can we do to keep our favorite single board computers from failing at random, and is there a better solution to the problem of storage than a stack of SD cards?

Understanding Flash Memory


The special properties of Flash memory reach deep down to the silicon, where individual memory cells (floating gates) are grouped in pages (areas that are programmed simultaneously), and pages are grouped in blocks (areas that are erased simultaneously). Because entire blocks have to be erased before new data can be written to them, adding data to an existing block is a complex task: At a given block size (i.e. 16 kB), storing a smaller amount of data (i.e. 1 kB), requires reading the existing block, modifying it in cache, erasing the physical block, and writing back the cached version.
This behavior makes Flash memory (including SSDs, SD-cards, eMMCs, and USB thumb drives) slightly more susceptible to data corruption than other read-write media: There is always a short moment of free fall between erasing a block and restoring its content.

The Flash Translation Layer

flash-layers-02The Flash Translation Layer (FTL) is a legacy interface that maps physical memory blocks to a logical block address space for the file system. Both SSDs and removable Flash media typically contain a dedicated Flash memory controller to take care of this task. Because individual Flash memory blocks wear out with every write cycle, this mapping usually happens dynamically. Referred to as wear-leveling, this technique causes physical memory blocks wander around in the logical address space over time, thus spreading the wear across all available physical blocks. The current map of logical block addresses (LBAs) is stored
in a protected region of the Flash memory and updated as necessary. Flash memory controllers in SSDs typically use more effective wear-leveling strategies than SD-cards and therefore live significantly longer. During their regular lifetime, however, they may perform just as reliably.

Upper IC: SD card controller (CC BY-SA 2.0 by Uwe Hermann)
USB flash drive controller (CC BY 2.0 by VIA Gallery)

Retroactive Data Corruption

Fire
Blocks on fire (CC-BY-SA 3.0 by Minecraft Wiki)

A write operation on Flash typically includes caching, erasing and reprogramming previously written data. Therefore, in the case of a write abort, data corruption on Flash memory can retroactively corrupt existing data entirely unrelated to the data being written.

The amount of corrupted data depends on the device-dependent block size, which can vary from 16 kB to up to 3 MB. This is bad, but the risk of encountering retroactive data corruption is also relatively low. After all, it requires a highly unusual event to slice right in between the erasing and the reprogramming cycle of a block. It is mostly ignored outside of data centers, but it is certainly a threat to critical applications that rely on data integrity.

Unexpected Power Loss

sparkfun_raspi
One of these cables is vital. (CC-BY 2.0 by Sparkfun)

The most likely cause of write abort related data corruption are unexpected power losses, and especially Flash memory does not take them very well. Neither consumer grade SSDs nor SD cards are built to maintain data integrity in an environment that is plagued with an unsteady power supply. The more often power losses occur, the higher is the chance of data corruption. Industrial SSDs, preferably found in UPS powered server racks, additionally contain military grade fairy dust impressive banks of Tantalum capacitors (or even batteries), which buy them enough time to flush their large caches to physical memory in case of a power loss.

While laptops, tablets and smartphones don’t particularly have to fear running out of juice before they can safely shut down, SBCs are often left quite vulnerable to power losses. Looking at the wiggly micro USB jack and the absence of a shutdown button on my Pi, the power loss is effectively built in. In conjunction with Flash memory, this is indeed an obstacle in achieving data integrity.

The Role Of The File System

File systems provide a file-based structure on top of the logical block address space and also implement mechanisms to detect and repair corruptions of their own. If something goes wrong, a repair program will scan the entire file system and do it’s best to restore its integrity. Additionally, most modern file systems offer journaling, a technique where write operations are logged before they are executed. In the case of a write abort, the journal can be used to either restore the before state or to complete the write operation. This speeds up filesystem repairs and increases the chance that an error can actually be fixed.

Unfortunately, journaling is not free. If every write operation was written to the journal first, the effective write speed would be cut into half while the Flash memory wear would be doubled. Therefore, commonly used file systems like HFS+ and ext4 only keep stripped down journals, mostly covering metadata. It is this practical tradeoff, that makes file systems a particularly bad candidate to stand in for data integrity after a failure in the underlying storage medium. They can restore integrity, but they can also fail. And they can’t restore lost data.

In the age of Flash memory, the role of the file system is changing, and it’s about to absorb the functions of the FTL. The file system JFFS2 is prepared to directly manage raw NAND Flash memory, resulting in more effective wear-leveling techniques and the avoidance of unnecessary write cycles. JFFS2 is commonly used on many OpenWRT devices, but despite its advantages, SBCs that run on Flash media with FTL (SD cards, USB thumb drives, eMMCs) will not benefit from such a file system. It is worth mentioning, that the Beaglebone Black actually features a 512 MB portion of raw accessible NAND flash (in addition to its 4 GB eMMC), which invites for experiments with JFFS2.

The Pi Of Steel

To answer the initial question of how to effectively prevent data corruption on single board computers: the physical layer matters. Especially for the use in single board computers, high-quality SD cards happen to perform better and live longer. Employing a larger SD card than the absolute minimum adds an additional margin to make up for suboptimal wear-leveling.

lifepopi
The LiFePo4wered/Pi adds a power button and UPS to the Raspberry Pi.

The next step on the way to the Pi Of Steel should deal with unexpected power losses. Adopting a battery-based UPS will reduce them to homeopathic doses, and over at hackaday.io, Patrick Van Oosterwijck has worked out a great UPS solution to keep a Raspberry Pi alive at all times.

For some applications, this may still not be enough and for others, the added cost and weight of a battery pack may not be practical. In those cases, there is actually only one thing you can do: Set the Pi’s root partition to read-only. This practically gives you the power of an SBC and the reliability and longevity of a microcontroller.

Eventually, a single Flash cell in read-write mode can only be so reliable, and just by looking at the facts, I would think twice before employing SD-card based single board computers in certain applications. But what do our readers you think? What’s your strategy to keep your SD cards sane? Let us know in the comments!

Filed under: Hackaday Columns, Raspberry Pi