Notes on the NTFS on-disc format

I managed to destroy a 20G NTFS partition to the point that the Windows installer crashed when looking for previous installations on that partition, so I decided I had to fdisk and reinstall from scratch. But before this, I noticed that I had 18G free on my Linux partition, so took a compressed copy of the whole partition's raw data, by booting into Linux and (as root) issuing a series of commands like

dd if=/dev/hda1 count=1024 skip=3072 bs=1048576 | gzip -c > block03.gz

SuSE 7.3 is extremely unresponsive when doing very large disc operations in the background, but I hear this will be fixed in the Linux-2.6 kernel, whenever that emerges from its long development effort.

Of course, actually working with this back-up is not entirely trivial. I don't have enough spare disc space to restore it to a partition of its own; indeed, I don't have enough spare disc space to uncompress more than two of the one-gigabyte blocks at a time. So I couldn't use the NTFS functionality in the Linux kernel to work with it, and had to build my own.

Fortunately, I didn't want to do anything particularly complicated; my main aim was to recover a hundred or so Dungeon Siege saved games. These are single files, with names ending .dssave, and sizes of up to about three megabytes; the first eight bytes contain the text 'DSigTank', and a dword at offset 0x10 contains a number quite close to the length of the file, but smaller. So my first draft recovery program searched the entire backup for 'DSigTank' appearing at a sector boundary, read out the approximate length, increased it (I'd checked earlier that the game didn't object to files with garbage at the end, and discovered that it did sanity checks on all the files in its saved-game directory and only listed ones it could read; this made consistency checking much easier), read out the file-name which is stored within the file, in Unicode, at offset 0x1f8, and saved a contiguous block starting with the signature and of the appropriate length.

This recovered one file; my volume was clearly very fragmented, and there was only once space to store a whole file contiguously. If the one file had been from near my current position in the game, I'd have stopped at this stage. But it wasn't, so I needed to figure out how NTFS stores fragmented files. My main references for this were the source code in /usr/src/linux/fs/ntfs and the documentation on file recovery at ntfs.com.

MFT records for files

The main record on the disc of which files are where is called the MFT - Master File Table, presumably. It consists of a number of records, whose length is given in a table at the start of the partition but seems usually to be 1024 bytes. Records for files start, on a sector boundary, with the text FILE. I'm sure the location is also given in a table at the start of the partition; I just scanned through the whole disc for occurences of a couple of (Unicoded) known filenames in sectors beginning "FILE", and when I noticed that all the matches were in the third gigabyte, scanned through that entire block looking for (Unicoded) ".dssave".

The trickiest part of the recovery for me was something that the Linux source code calls fix-ups. The last two bytes of each 512-byte sector are not stored at the end of the sector, but in a separate table at the start of the FILE record; at the end of the sector, a magic code is stored. I'm not quite sure what kind of corruption this is supposed to detect and prevent, but were it not for the comments in /usr/src/linux/fs/ntfs/super.c I'd have been completely lost.

The address of the fix-up table within the FILE record is stored in bytes 4 and 5 of the record; the number of fix-ups is in bytes 6 and 7. The fix-up table contains a collection of 2-byte values; the first is the magic-code which is what's stored on disc at the end of each sector of the record, and the rest are the correct pair of final bytes for each sector. Load the record into memory and apply the fix-ups before proceeding; only the truly conscientious would check that the on-disc final bytes have the claimed value.

The full file name is very often stored at offset 0x168 from the start of the record; the format is one byte counting the number of unicode characters, a padding byte which seems to be 1 for Win32 filenames and 2 for DOS filenames (each file has two names in its FILE record, an 8.3 one stored earlier on and a long one), and then the appropriate number of two-byte characters.

The data is stored on disc as a series of "runs", which are described in a "run list" stored in the record. Just before this run list are stored three eight-byte values, which for all the files I looked at were

  1. The length of the file rounded up to a multiple of 0x1000 bytes.
  2. The exact length of the file
  3. The exact length of the file, again

Also, in all the files I looked at, these eight bytes were all aligned on an eight-byte boundary. So I ended up finding the run list as the first occurence, after the filename, of roundup(N), N, N in three consecutive aligned eight-byte words. This worked for me. I have no idea where the run list would be stored if it were more than a kilobyte long and spanned several sectors.

The structure of the run list

I'm going to give this by example. Say the file consisted of 160 blocks: 100000 -> 100129, 101035 -> 101059, 101005 -> 101009. We compute the differences between the starts of consecutive blocks: 100000, 1035, -30, and the block lengths: 130, 25, 5. 100000 is >2^15 so coded in three bytes, 1035 is coded in two, and -30 in one. 130 is >2^7 so coded in two bytes (since these are signed values). Each run is encoded by one byte whose high nibble is the number of bytes encoding the offset and whose low nibble the number of bytes encoding the run length. So the run list would be (notice I've marked lengths in blue and offsets in orange)

32
82 00 A0 86 01 21
19 0B 04 11
05 E2

Thankfully, NTFS uses a flat addressing system for blocks, so block 0x1086A0 starts 0x1086A0000 bytes from the start of the disc.

Some efficiency notes

The absolutely critical thing, given that I stored each one-gigabyte block separately and it took my computer a couple of minutes to decompress one block, was to avoid decompressing any block more than once. So I construct a blockset type, comprising a length and an offset; as I read the directory, I record every blockset encountered in two places:

vector<blockset> full_list;
map<string, vector<blockset> > file_blocks;
vector<string> sane_files;
map<string, long long> last_blocks;
map<long long, string> terminal;

full_list.push_back(block);
file_blocks[filename].push_back(block);
if (last_block) last_blocks[filename] = offset_of_last_block;

where I add a filename to sane_files if my sanity check -- that the total number of clusters in the list of blocksets equals the length given in the file descriptor -- passes.

After reading the directory, we do

for (int i=0; i<sane_files.size(); i++)
  terminal[last_blocks[sane_files[i]]] = sane_files[i];
sort(full_list.begin(), full_list.end());

where operator< for blocksets simply compares the offsets of the first block. Then we run through full_list loading whole blocksets into memory; if terminal[j] is non-empty for some start-block j, I run through all the blocksets in file_blocks[terminal[j]] and write them to a file named terminal[j].

The result was an almost total success - one file lost because it crossed a gigabyte boundary and I didn't bother handling that correctly - and took less than a quarter as long as it would have to fight back from the Farmlands to the Goblin Inventor's Cave. It's nothing like complete, but I hope it gives anyone with NTFS problems a place to start.