File Systems

Overview

Virtual File System Layer
A Brief Unix/Linux/macOS Example
Directories

Case Study: ext2/ext3/ext4 Filesystem
More inode Details
Extents
References

Overview

Files are a collection of bytes stored on secondary storage, but the questions is:

How are they stored and accessed?

The answer is determined by the particulars of the "file system".
A file system provides methods to:

create and delete files
allocate and free blocks of bytes to files.
assign names to files.
organize files into directories.
read and write contents of files.
and lots of other things...

There are probably 100 different types of file systems
Linux (by default) supports more than 40 types!
Some popular file systems:

Window's systems:

FATFAT (old DOS, smaller USB drives)

8.3 file names
No permissions
RASH attributes (Use the attrib command to view them.)

Try this command: attrib c:\*.* and see what you get.

C:\>attrib c:\*.*
A  SH                C:\DumpStack.log
A  SH                C:\DumpStack.log.tmp
A  SH   I            C:\hiberfil.sys
A  SH                C:\pagefile.sys
A  SH                C:\swapfile.sys

FAT32FAT32 (32-bit version for Windows, larger USB drives)

FAT12, FAT16, FAT32, VFAT
Extensions (e.g. longer filenames, larger files)

NTFSNTFS (Modern Windows)

Completely different
Very powerful and modern: compression, quotas, journaling, encryption, etc.
All NT-based systems, i.e. NT, 2000, XP, Vista, 7, 8, 10, 11, etc.

ReFSReFS Resilient FS (Future Windows)

Implements a subset of NTFS
Early version don't support compression, quotas, encryption
Designed for very large files and huge number of files.
Designed "not to fail"

Linux:

ext2ext2 (second extended file system, standard until several years ago, no journaling; still good for flash drives)
ext3ext3 (ext2+journaling)
ext4ext4 (ext3+extents)
Plus about 40 others including FAT and NTFS
Other popular file systems:

BtrfsBtrfs B-tree file system using copy-on-write, created by Oracle Corporation for Linux.
JFSJFS Journaled File System, created by IBM for their AIX operating system.
ReiserFSReiserFS A general-purpose journaled file system created by Hans Reiser.
XFSXFS Extents File System, a high-performance journaling file system created by Silicon Graphics for their IRIX operating system.
ZFSZFS Zettabyte File System, created by Sun Microsystems and so named because it could store up to 2⁷⁰ bytes (about 10²¹ bytes). ZFS focuses on data integrity.
All of these file systems are supported on Linux.

Mac OS X:

APFSAPFS (macOS, successor to HFS+)
HFS+HFS+ (previously used in Mac OS X, successor to HFS)
Linus Torvalds has an opinion about HFS+. (Caution: Linus uses a lot of colorful words in his opinions!)

Common systems:

CD (ISO9660ISO 9660)
DVD file systems (Universal Disk FormatUniversal Disk Format, UDF)
USB mass storage devices
Others (lots of others)

List of file systems.
Comparison of file systems.

Flash-friendly file systems

F2FSF2FS - A new filesystem (developed by Samsung) that is targetted at flash-based media (flash drives, SSDs, SD cards, etc.)
Many optimizations in filesystems are done to deal with head movement and rotational latency, which don't exist in solid-state media.
More detailed information:

Optimizing Linux with cheap flash drives - Introduces F2FS.
An F2FS teardown - Lots more details about the filesystem.

Some features of advanced file systems: (these used to be "add-ons" to some filesystems)

copy-on-write
transparent compression
encryption
built-in RAID
defragmentation
transactions and roll back
variable block sizes
snapshots

Some very simple benchmarks (using rsync and this script) on an SSD: (lower times are better)

Copying a 2 GB file
(2,147,483,648) SSD Copying ~500 MB of files
Linux kernel 3.2.34
(~40,000 files) SSD Copying a 16 GB file
using cp (NVMe) Copying a 16 GB file
using rsync (NVMe)

Copying a 2 GB file (2,147,483,648) SSD	Copying ~500 MB of files Linux kernel 3.2.34 (~40,000 files) SSD	Copying a 16 GB file using `cp` (NVMe)	Copying a 16 GB file using `rsync` (NVMe)
Time MB/s --------------------- xfs 0:19 110 btrfs 0:19 110 hfs+ 0:19 110 ext4 0:21 95 ext2 0:22 95 jfs 0:24 87 FAT32 0:25 84 reiserfs 0:25 84 ext3 0:27 78 NTFS 0:40 53 zfs 0:47 45	Time MB/s --------------------- btrfs 0:05 79 ext4 0:07 58 hfs+ 0:12 35 FAT32 0:14 30 reiserfs 0:14 30 jfs 0:18 25 zfs 0:24 18 ext3 0:25 18 ext2 0:25 17 xfs 0:27 15 NTFS 2:23 3	Time MB/s ------------------------ xfs 0:10.1 1700 btrfs 0:10.6 1616 hfs+ 0:13.2 1302 ext4 0:14.9 1153 ext3 0:16.9 1014 jfs 0:17.3 996 ext2 0:17.4 987 reiserfs 0:21.5 799 NTFS 1:35.1 181 FAT32 File is too big	Time MB/s ----------------------- xfs 0:51.1 336 hfs+ 0:52,0 331 btrfs 0:52.9 325 ext2 0:54.3 317 jfs 0:54.8 313 ext4 0:56.0 307 ext3 0:56.2 306 reiserfs 0:59.2 290 NTFS 2:26.5 117 FAT32 File is too big

          Time   MB/s
---------------------
xfs       0:19   110
btrfs     0:19   110
hfs+      0:19   110
ext4      0:21    95
ext2      0:22    95
jfs       0:24    87
FAT32     0:25    84
reiserfs  0:25    84
ext3      0:27    78
NTFS      0:40    53
zfs       0:47    45

          Time   MB/s
---------------------
btrfs     0:05    79
ext4      0:07    58
hfs+      0:12    35
FAT32     0:14    30
reiserfs  0:14    30
jfs       0:18    25
zfs       0:24    18
ext3      0:25    18
ext2      0:25    17
xfs       0:27    15
NTFS      2:23     3

          Time     MB/s
------------------------
xfs       0:10.1   1700
btrfs     0:10.6   1616
hfs+      0:13.2   1302
ext4      0:14.9   1153
ext3      0:16.9   1014
jfs       0:17.3    996
ext2      0:17.4    987
reiserfs  0:21.5    799
NTFS      1:35.1    181
FAT32     File is too big

          Time     MB/s
-----------------------
xfs       0:51.1   336
hfs+      0:52,0   331
btrfs     0:52.9   325
ext2      0:54.3   317
jfs       0:54.8   313
ext4      0:56.0   307
ext3      0:56.2   306
reiserfs  0:59.2   290
NTFS      2:26.5   117
FAT32     File is too big

ZFS is not part of the script. This was done manually for experimentation using the defaults. Surprisingly (or maybe not), when compression was turned on, the time was about 22.5 seconds to copy the Linux kernel source. In other words, it was slightly faster to compress the data before writing it to disk than just writing uncompressed data. Why do you think that is?
Also, the reason for its poor performance is due to the fact that I'm using ZFS in FUSE mode (Filesystem in User Space), which has a lot of overhead. A native-Linux port is in progress and will hopefully be widely available soon.

Here are some times required to create a filesystem on a 1 TB external USB3 spinning hard drive. (The output from this script.)

###########################################################################################################################
# Filesystem    Time (secs)       Total bytes             Used              Available             In use (gparted)        #
#-----------------------------------------------------------------------------------------------------------------------  #
#   btrfs          0.452       1,000,203,091,968        17,301,504       998,024,937,472      16.50 MiB       17,301,504  #
#   hfs+           2.214       1,000,203,091,968       104,984,576     1,000,098,107,392     100.12 MiB      104,983,429  #
#   ntfs           2.544       1,000,203,087,872        98,095,104     1,000,104,992,768      93.55 MiB       98,094,284  #
#   f2fs           2.779       1,000,202,043,392    49,994,014,720       947,992,387,584       Unknown         Unknown    #
#   jfs            5.756       1,000,038,141,952       122,331,136       999,915,810,816     273.97 MiB      287,278,366  #
#   ext4           9.814         984,373,075,968        75,124,736       934,271,021,056      14.81 GiB   15,902,116,414  #
#   xfs           11.418         999,714,713,600        35,028,992       999,679,684,608     499.16 MiB      523,407,196  #
#   fat32         18.326         999,958,937,600            32,768       999,958,904,832     232.88 MiB      244,192,378  #
#   reiserfs      79.393       1,000,172,560,384        33,628,160     1,000,138,932,224      61.19 MiB       64,162,365  #
#   ext3         413.841         984,373,075,968        75,259,904       934,287,663,104      14.81 GiB   15,902,116,414  #
#   ext2         416.520         984,507,293,696        75,124,736       934,422,016,000      14.69 GiB   15,773,267,394  #
###########################################################################################################################

Over the decades, researchers have discovered a few "truths" that are common to all systems (as of about 2012?)

"Truths" (T) Data (D)

T: Most files are small D: Roughly 2K is the most common size

T: The average file size is growing D: Almost 200K is the average size

T: Most bytes are stored in large files D: A few big files use the most space

T: File systems contain lots of files D: Almost 100K on average

T: File systems are roughly half full D: Even as disks grow, file systems remain 50% full

T: Directories are typically small D: Many have few entries; most have 20 or fewer

Here are some questions for you to answer about your own systems:

"Truths" (T)		Data (D)
T: Most files are small		D: Roughly 2K is the most common size
T: The average file size is growing		D: Almost 200K is the average size
T: Most bytes are stored in large files		D: A few big files use the most space
T: File systems contain lots of files		D: Almost 100K on average
T: File systems are roughly half full		D: Even as disks grow, file systems remain 50% full
T: Directories are typically small		D: Many have few entries; most have 20 or fewer

What is the largest file on the system?
How many regular files are present?
How many directories?
What's the average file size?
What's the average directory size?
What percentage of the filesystem is in use?

How would you go about finding the answers to these questions?

Virtual File System Layer

Every file system type provides its own API (Application Program Interface) for operations.

And they're all different!

But we (programmers and users) don't want to deal with those differences.
We want a file to be a file (a collection of bytes) anywhere with a few generic operations (open/read/write/seek/close).
To tame this potential for chaos, the OS provides one generic interface

to ALL file (text/binary, source code, photos, databases, etc.)
on ANY file system (FAT, NTFS, APFS, ext4, etc.)
on ANY device (Hard drive, SSD, NVMe, CD/DVD, floppies, tape, etc.)

We'll call that interface the "Virtual File System".

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

A top level API presents a unified view of files and operations on files. (open, read, write, etc.)
A "virtual file system" level will translate operations into specific operations for any type of file system.
As we've seen, there are many file system types.

Hardware view of a file

A file is a sequence of blocks.

A block is usually a fixed length, often between 512 and 8192 bytes (powers of 2).

Operations are all block operations: (The smallest read/write)

Allocate or free blocks for a file's sequence of blocks.
Read block X.
Write block X.

But programmers want:

byte oriented, not block oriented files
sequential access (both reads and writes)
random access (both reads and writes)

Think about how memory pages work.

Software view of a file

A file is (logically) a sequence of bytes.
An open file:

Has one or more blocks cached in memory by the OS
Has a current position byte pointer.
Reads, writes, seeks, ... act at the current position on the "in memory" cached blocks.
The OS takes care of reading/writing blocks from/to the hard disk.

The current position is a property of an open file, not of the file itself.

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

The function call heirarchy for writing:
The User could have simply called printf: printf → fprintf → puts → fwrite → etc...
The function call heirarchy for reading:

Various filesystems: FAT, FAT32, NTFS, APFS, ext4, XFS, Btrfs, ZFS, etc.
Various storage devices: Hard drives, floppies, SSD, NVMe, CD/DVD, tape, etc.

Structure of a file

Somewhere on disk is a FCB (File Control Block), which is a structure containing various pieces of file information.

Unix stores this info in a structure called an inodeinode. (structurestructure) (struct inode from linux source: include/linux/fs.h)

You can retrieve information about the inode using the statstat command.

Windows stores this info as an entry (a row) in the Master File TableMaster File Table (MFT)

The contents of a file are stored in one or more (often MANY more) blocks on the disk.
The FCB must provide some way to record and access those blocks.
There are various ways:

Contiguous allocation - generally a bad idea (array-like behavior)

Growing a file will run into a limit.
Deleting or shrinking a file will fragment the disk.
Size must be declared when created.
Requires a scheme to compact free space (otherwise there is external fragmentation).
However, the advantages are efficient access to a file's contents (random access) and easy disk management.
Kind of like arrays in C/C++. (One-time allocations, contiguous, random access, can't grow)

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

Linked allocation - generally works well (linked-list-like behavior)

Files can grow or shrink with almost no limit.
File size need not be declared when created.
One disadvantage is that direct access to ANY block is not possible (no random access), only sequential access is supported (just like linked lists).
Some implementations don't need an ending block number because the next block pointer in that block won't point to anything.

However, having a "tail pointer" may make appending to the file faster. (You don't have to walk the entire list looking for the end.)

Think about singly-linked lists.

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

Indexed allocation - put all pointers together into an index block.

All the advantages of linked allocation plus direct access to any block.
Allows random access to any block/byte of the file when using fixed-size blocks.

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

A Brief Unix/Linux/macOS Example

An inode has a fixed size and contains all of the information (metadata) about a file.

It essentially includes everything related to the file except for the filename and the actual contents of the file.

It has room to reference a fixed (small) number of data blocks. (Direct pointers to data)
Then it uses a pointer to an index block of pointers to data blocks. (Single-indirection, pointer to pointer to data)
Then it uses a pointer to an index block of pointers to index blocks of pointers to data blocks. (Double-indirection, pointer to pointer to pointer to data)
Then it uses a pointer to an index block of pointers to index blocks of pointers to index blocks of pointers to data blocks. (Triple-indirection, pointer to pointer to pointer to pointer to data)
Simplified view of an inode and its data blocks: (structurestructure) (struct inode)

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

Typically, there are 15 pointers in the inode: 12 direct pointers, 1 single-, 1 double-, and one triple-indirect pointers.
The size of the largest file is dependent on the size of a block and the size of a block pointer.

For example, having 4K (4096 bytes) blocks and 8-byte block pointers, the number of pointers in a block is 4096 / 8 = 512.

Having "only" triple indirection puts size limitssize limits on the file system.

Another view: (annotated view)

Note: The multiple levels of indirection shown above is also how B-Trees (a tree-like data structure for very large data sets) work. Many filesystems are implemented using B-Trees or similar data structures that use extents for more efficiency.

The relationship between directory entries, inodes, and data blocks:

You can think of the directory entries as the Table of Contents of the file system. This is how the filesystem "looks up" the file by name and then follows the pointer (12345 in the example) to get to the metadata (inode), which leads to the data blocks.

The size of a pointer and the size of the disk blocks (either blocks of pointers or blocks of data, as they are the same) determines the maximum size of the disk (filesystem) as well as the maximum size of a file. Given this information and the sizes below, answer the question:

What is the maximum size of a file?

Assume these sizes:

Pointer size Block size Max filesystem size^* Max file size

4 bytes 2,048 (2K) Depends ???

4 bytes 4,096 (4K) Depends ???

8 bytes 4,096 (4K) Depends ???

8 bytes 8,192 (8K) Depends ???

Pointer size	Block size	Max filesystem size^*	Max file size
4 bytes	2,048 (2K)	Depends	???
4 bytes	4,096 (4K)	Depends	???
8 bytes	4,096 (4K)	Depends	???
8 bytes	8,192 (8K)	Depends	???

^* This value depends on how many inodes the filesystem has and is sometimes determined when the filesystem is created.

Every file in the system has a number of pointers to its data blocks. Find what the maximum number of pointers is (for a file) and then multiply that by the size of a disk block. That gives the size of the largest file. For a simplistic example, if you had a maximum of 1,000 pointers to data blocks, and each data block was 4,096 (4K) bytes, then the largest file would be:

1,000 x 4,096 = 4,096,000 bytes

It's just simple ~~math~~ arithmetic. This is why you need to know the size of a pointer and the size of a disk block, because this tells you exactly what the maximum number of pointers can be, and hence, the maximum file size.

Self-check - Given all of this information, answer this question: "What is the maximum number of files that a filesystem can hold?" The answer is not simply a number, it's an explanation. Think of it like this: "How many files of zero length can the filesystem hold?" That will give you the answer. (Hint: It's not unlimited or infinite!)

Bonus: What is the command in Linux that will tell you this information?

Self-check - With multiple levels of indirection, filesystems can be implemented efficiently for fragmented files. However, for non-fragmented (i.e. contiguous files), this approach is not very efficient. Explain why that is and how a better method can be used.

Self-check - For very small files (just a few bytes), there is a lot of overhead necessary to keep track of it using this scheme. Can you think of a simple optimization that could reduce the overhead for files that are very small, say, less than 100 bytes? Many systems have many very small files and we call them symbolic links or shortcuts.

For reference, this is somewhat related to how using a doubly-linked list to keep track of a single character causes a lot of overhead. Essentially, with 8-byte pointers, each node in the list would require 24 bytes just to hold the single character (plus 2 pointers and padding/alignment) That's essentially 96% overhead!

A simple filesystem implementation.

Directories

A directory is a file just like any other "plain" file.

It has bytes stored in scattered blocks (like any other file).
It is described by a FCB (like any other file)
However, it is marked as type "directory" rather than "plain file", so it responds to special directory operations rather than read/write operations.

Logically, a directory file contains a list of pairs of file name and file info. (In reality, there is a little more information than just this.)

filename1: pointer to a FCB (inode or MFT entry)
filename2: pointer to a FCB (inode or MFT entry)
The "files" in the diagram below represent the inode. (The inodes point to the actual data blocks on the disk.)

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

A hierarchical view of a directory structure:

Operating System Concepts - 8th Edition Silberschatz, Galvin, Gagne ©2009

A logical (and simplified) view of the bin directory shown above (with arbitrary inode values):

Name inode Type

count 1345 d

find 656323 d

hex 87343 d

reorder 856 d

You can think of a directory as kind of a table of contents or index. In fact, that's kind of the definition of a directory. You've probably seen these directories inside of buildings. You walk in the front door, and the first thing you see (on a wall or sign) is a directory of people/companies in the building. This directory lists the name of the person/company and then a room number (maybe with a floor number, as well). It allows you to quickly and easily find the person you are looking for. The room could be a small closet (small file) or a 10,000 sq. ft. office (large file).

Name	inode	Type
count	1345	d
find	656323	d
hex	87343	d
reorder	856	d

There is an API for reading directory entries and it is similar to reading from a regular file:

opendir - Opens a directory for reading.
readdir - Reads the next directory entry
closedir - Closes the directory.

Here's an example of a simple program that will open a directory, read all of the entries and print out some information about each one. This structure is from the man page:

struct dirent 
{
  ino_t          d_ino;       /* Inode number             */
  off_t          d_off;       /* Not an offset; see below */
  unsigned short d_reclen;    /* Length of this record    */
  unsigned char  d_type;      /* type of file;            */
  char           d_name[256]; /* NUL-terminated filename  */
};

#include <stdio.h>  /* printf                     */
#include <dirent.h> /* opendir, readdir, closedir */
#include <stdlib.h> /* exit                       */

/* For human-readable names */
char *ENT_TYPE[] = {" UNK", "FIFO", "CHAR", "", " DIR", "", 
                    " BLK",     "", " REG", "", " LNK", "", "SOCK"};

int main(int argc, char **argv)
{
  char *dirname = "."; /* Default directory to process       */
  DIR *dir;            /* Like a FILE *, but for directories */
  struct dirent *ent;  /* A directory entry is a file        */
  int count = 0;       /* Number of files processed          */

    /* Optional directory to process, defaults to cwd */
  if (argc > 1)
    dirname = argv[1];

    /* Open the directory */
  dir = opendir(dirname);
  if (!dir)
  {
    perror(dirname);
    exit(1);
  }

    /* Read each directory entry (file) and print out some info */
  while (1)
  {
      /* Get next file */
    ent = readdir(dir);
    if (!ent)
      break;

    printf("%4i: inode: %12lu, offset: %20lu, len: %2hu, type: (%2i) %s, name: %s\n", 
           ++count,
           ent->d_ino == (unsigned long)-1 ? 0 : ent->d_ino, /* inode number */
           ent->d_off,            /* for internal use        */
           ent->d_reclen,         /* length of this record   */
           ent->d_type,           /* file type (numeric)     */
           ENT_TYPE[ent->d_type], /* file type (text)        */
           ent->d_name            /* NUL-terminated filename */
          );
  }

    /* Done */
  closedir(dir);

  return 0;
}

Most of the information is usually not relevant to application programs. The filename and type are useful. Once you have that information, you can call the stat function to retrieve all of the other information (e.g. file size, permissions, date/time stamps, etc.)

Case Study: ext2/ext3/ext4 Filesystem

The first filesystem developed specifically for Linux was the ext filesystem or extended filesystem, which was based on the Unix filesystem (a.k.a the Berkley Fast File System or FFS). Then, the ext2 filesystem enhanced ext further with more features from the FFS.

Next came the ext3 filesystem which added more improvements, especially journaling. After that came the ext4 filesystem, which added several more improvements, most notably, extents. Because the data structures (for the most part) have been compatible between the three filesystems (and we aren't interested in the other features yet), talking about ext4 will be very similar to discussing the structure of ext2/ext3 systems.

The ext4 filesystem is a very stable and mature filesystem used by many Linux distributions. It's not the best (if there exists a "best" filesystem) or fastest or the most feature-rich filesystem, but it's fairly efficient and fairly straight-forward to understand and implement (if you're an operating systems implementer). Many more powerful/complex filesystems have similar attributes of ext4. By understanding the basics of this filesystem, you'll be more likely to understand how other file systems work and what they have done to improve upon ext4.

So, with that said, let's see just how much work the filesystem must do in order to simply display the contents of a simple text file. We'll use this reference system for the demonstration:

chico@nina ~ $ ls -l / total 258,048 drwxr-xr-x 2 root root 4,096 Apr 9 2019 bin drwxr-xr-x 3 root root 4,096 Apr 9 2019 boot drwxr-xr-x 2 root root 4,096 Aug 23 2015 cdrom drwxr-xr-x 17 root root 4,640 Oct 1 11:56 dev drwxr-xr-x 213 root root 12,288 Oct 8 13:17 etc drwxr-xr-x 10 root root 4,096 Oct 8 13:20 home drwxr-xr-x 8 root root 4,096 Oct 8 13:20 homes drwxr-xr-x 27 root root 4,096 Apr 16 2019 lib drwxr-xr-x 2 root root 4,096 Apr 9 2019 lib32 drwxr-xr-x 2 root root 4,096 Apr 9 2019 lib64 drwxr-xr-x 2 root root 4,096 Apr 9 2019 libx32 drwxr-xr-x 2 root root 16,384 Feb 18 2017 lost+found drwxr-xr-x 6 root root 4,096 Jul 8 2018 media [several more lines removed . . .] chico@nina ~ $

We're going to focus on the user named chico. We will search for and display a text file in his own home directory which is /homes/chico. Let's see what we have in the homes directory.

chico@nina ~ $ ls -l /homes total 24,576 drwxr-xr-x 2 alvin alvin 4,096 Oct 8 13:20 alvin drwxr-xr-x 2 betty betty 4,096 Oct 8 13:20 betty drwxr-xr-x 8 chico chico 4,096 Oct 8 13:20 chico drwxr-xr-x 2 fred fred 4,096 Oct 8 13:20 fred drwxr-xr-x 2 veronica veronica 4,096 Oct 8 13:20 veronica drwxr-xr-x 2 wilma wilma 4,096 Oct 8 13:20 wilma chico@nina ~ $

Note: On a typical Linux system, a user's home directory is in the /home (singular) directory. However, for this example (and for technical reasons), I've created some "artificial" users in /homes (plural) which will make the details a little easier to explain and understand. Just keep that in mind if you're trying to find a /homes directory on your system as it's unlikely to exist.

Let's see what's in chico's directory using the tree command:

chico@nina ~ $ tree /homes/chico /homes/chico ├── bathroom ├── bedroom ├── garage └── kitchen ├── cupboards ├── microwave ├── oven ├── refrigerator │ ├── apples │ ├── butter │ ├── cake │ ├── cheese │ ├── chicken │ ├── coke │ ├── eggs │ ├── juice │ ├── milk │ └── pie ├── sink └── stove 10 directories, 10 files chico@nina ~ $

The file were interested in is cake. The full path to cake is:

/homes/chico/kitchen/refrigerator/cake

and the command that we will use to display the contents:

cat /homes/chico/kitchen/refrigerator/cake

and the output:

eggs
butter
milk
flour
vanilla
icing
strawberries
peaches
lettuce
asparagus

Which is presumably all of the things that are in the cake! (Don't knock it until you've tried it!)

Note: I'm attempting to create an analogy/metaphor here. In chico's home (directory) there is a kitchen (directory), and in the kitchen there is a refrigerator (directory) and in the refrigerator there is a cake (file) that contains ingredients (lines of text).

So, the question is, "How many disk reads are required to locate (search), open (read), and display the file?" To answer that question, this is how we proceed.

We have to locate the root directory, / (the forward slash). All files are within the root directory.
Then, we search the root directory looking for a directory called homes.
Next, we search the homes directory looking for a directory called chico.
Next, we search the chico directory looking for a directory called kitchen.
Next, we search the kitchen directory looking for a directory called refrigerator.
Next, we search the refrigerator directory looking for a file called cake.
Then, we open the file called cake and read in all of the data.
Finally, we display the data on the screen.

The first seven steps all require disk reads. That, in a nutshell, is how we would display the contents of the file. As you can see, the longer the path is, the more work is required by the filesystem. So, as you can imagine, locating the file:

/usr/hostname

is going to require significantly less work than locating this file:

/usr/share/icons/foo/bar/baz/bat/one/more/dir/and/were/done/file.txt

The hostname file above only requires searching the root directory (/) and the usr directory. The file.txt requires searching the root directory and 13 other directories before getting to file.txt! That's a lot of work that must be done everytime you access that file. Fortunately for the users, it's all hidden behind the filesystem.

The ext4 filesystem accomplishes this work using inodes and data blocks that were described above. Let's go through this step-by-step to see exactly what is going on. I'm going to use real data from one of my systems to show this process.

As you can imagine, there are a bunch of tools on a Linux system that will help us peer into the filesystems data structures (inodes) and disk blocks. The first and simplest command is our trusty ls command. If you run ls -ld / (on the root directory), it will display something like this:

drwxr-xr-x 29 root root 4,096 Oct  9 13:00 /

If we add -i to the command, it will also show us the inode that contains the information about the root directory:

ls -ldi /

Output:

2 drwxr-xr-x 29 root root 4,096 Oct  9 13:00 /

This tells us that the root directory's inode is inode #2. By the way, the -d option tells ls to just show information about the directory itself, not the contents of the directory. Removing the option will show this output.

Another way we could have found the inode is with the statstat command:

stat /

Output:

  File: '/'
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d	Inode: 2           Links: 29
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2019-05-18 12:13:53.741950988 -0700
Modify: 2020-10-09 13:00:01.918312997 -0700
Change: 2020-10-09 13:00:01.918312997 -0700
 Birth: -

There's a lot of other information displayed as well, but for now, we're just concerned with the inode. (The IO Block: 4096 is also important as it tells us how big each logical disk block is.)

OK, so we have the inode, but where on the disk is that inode? This is where the next tool comes in handy. It's called debugfs and it's used to help debug (or simply glean information about) the ext2/ext3/ext4 filesystems. This is the command that will map the inode number into a disk block:

sudo debugfs -R 'imap <2>' /dev/sda1

This command essentially runs debugfs and tells it to map inode #2 to its corresponding disk block on /dev/sda1, which is the first partition on the first hard drive in the system. If you want to see all of the partitions on all of the drives, just run the lsblk command and you'll see something like this:

NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 931.5G  0 disk 
├─sda1   8:1    0  39.1G  0 part /
├─sda2   8:2    0  15.6G  0 part 
├─sda3   8:3    0  39.1G  0 part 
├─sda4   8:4    0     1K  0 part 
├─sda5   8:5    0 781.3G  0 part /home
├─sda6   8:6    0    41G  0 part /opt
└─sda7   8:7    0  15.5G  0 part [SWAP]
sdb      8:16   0   3.7T  0 disk 
└─sdb1   8:17   0   3.7T  0 part /storage
sdc      8:32   0   3.7T  0 disk 
└─sdc1   8:33   0   3.7T  0 part /media/chico/wd-elements1
sdd      8:48   0   3.7T  0 disk 
└─sdd1   8:49   0   3.7T  0 part /media/chico/wd-elements3
sde      8:64   0   3.7T  0 disk 
└─sde1   8:65   0   3.7T  0 part /media/chico/wd-elements2
sr0     11:0    1  1024M  0 rom

I've highlighted the partition that we're interested in which is the first partition on the first disk.

This output is also telling me that there are 5 "disks" connected to my computer named sda, sdb, sdc, sdd, and sr0 (which is a DVD drive). It also tells me that the first drive has 7 partitions and the others only have one. Incidentally, these are the types of storage devices in the system:

sda - This is a 1 TB internal solid state mSATA drive.
sdb - This is a 4 TB internal solid state drive (SSD).
sdc - This is a 4 TB external USB drive.
sdd - This is a 4 TB external USB drive.
sde - This is a 4 TB external USB drive.
sr0 - This is an external USB DVD reader/writer.

OK, so back to the command:

sudo debugfs -R 'imap <2>' /dev/sda1

and its output:

debugfs 1.42.9 (4-Feb-2014)
Inode 2 is part of block group 0
	located at block 1057, offset 0x0100

The important information is the last line which tells us that inode #2 is located 256 bytes (0x0100) within disk block #1057. Now, all we have to do is to read the data at that location and we will have read all of the important information about the root directory.

To help out with my demonstration, I've written my own program that will read any blocks or partial blocks of data from any partition on any device. It's called readblock and you use it like this:

sudo readblock <partition> <block-number> <offset> <bytes-to-read>

So, to read the raw bytes from inode #2, we do this:

sudo readblock /dev/sda1 1057 0x0100 256

Broken down:

sudo - must be run as root because this is a huge security risk as it allows you to ready any file on the system, including other users' files.
readblock - program itself
/dev/sda1 - the partition to read
1057 - the disk block to read
0x0100 - the offset into the disk block (0x0100 is 256 decimal)
256 - how many bytes to read

Note: The readblock program is a work-in-progress. It currently reads the device to find out the size of the disk blocks. Generally, the size of the blocks is 4K (4,096) bytes, which is true for all of my partitions. It is important to have the correct block size because that value is used in all of the calculations. The program also allows the user to specify values in hexadecimal (0x prefix) or decimal.

Note: There are existing tools on Linux that will do something similar to my readblock program. However, I wanted to have total control over the output, so I wrote my own. It's only a few lines of code, actually. One such tool on Linux is dd. Very handy, powerful, and, dangerous! Read up on it before using it! YOU HAVE BEEN WARNED!

So, the actual bytes that will be read are bytes 4,329,998 to 4,330,254. The way we arrived at those numbers was:

BlockNumber * BlockSize + Offset
    1057    *   4096    +  256     = 4,329,742 + 256 = 4,329,998 [starting byte]
                                               + 256 = 4,330,254 [ending byte]

Now, because the information in the inode is mostly binary, when displaying it on the screen it will just look like garbage:

�AqY�\Qπ_Qπ7�!$ �P�ɬP��0尦��X

However, it really did read and display (or try to display) 256 bytes of binary data. One thing you can do is to redirect the output to a file:

sudo readblock /dev/sda1 1057 0x0100 256 > inode2.bin

On the disk you'll see that it's exactly 256 bytes;

ls -l inode2.bin

Output:

-rw------- 1 chico chico 256 Oct  9 14:46 inode2.bin

Now, you can just use any of the bajillion hex viewers to look at it such as hexdump or od (octal dump)

od -x inode2.bin

Output:

0000000 41ed 0000 1000 0000 5971 5ce0 cf51 5f80
0000020 cf51 5f80 0000 0000 0000 001d 0008 0000
0000040 0000 0008 3714 0000 f30a 0001 0004 0000
0000060 0000 0000 0000 0000 0001 0000 2421 0000
0000100 0000 0000 0000 0000 0000 0000 0000 0000
*
0000200 0020 0000 50ac c9c5 50ac c9c5 1830 b0e5
0000220 ffa6 58ac 0000 0000 0000 0000 0000 0000
0000240 0000 0000 0000 0000 0000 0000 0000 0000
*
0000400

Or, better yet, how about the trusty old dumpit program:

dumpit inode2.bin

Output:

inode2.bin:
       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 41 00 00 00 10 00 00  71 59 E0 5C 51 CF 80 5F   .A......qY.\Q.._
000010 51 CF 80 5F 00 00 00 00  00 00 1D 00 08 00 00 00   Q.._............
000020 00 00 08 00 14 37 00 00  0A F3 01 00 04 00 00 00   .....7..........
000030 00 00 00 00 00 00 00 00  01 00 00 00 21 24 00 00   ............!$..
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 20 00 00 00 AC 50 C5 C9  AC 50 C5 C9 30 18 E5 B0    ....P...P..0...
000090 A6 FF AC 58 00 00 00 00  00 00 00 00 00 00 00 00   ...X............
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

This is showing us the actual raw binary data that is stored in the disk block.

In fact, let's skip the temporary file creation and just pipe the output of readblocks directly into dumpit:

sudo readblock /dev/sda1 1057 0x0100 256 | dumpit

That will produce the same output! Yeah, pipes are a wonderful thing! (If you don't have access to the dumpit program, you can just use od or something similar.)

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 41 00 00 00 10 00 00  71 59 E0 5C 51 CF 80 5F   .A......qY.\Q.._
000010 51 CF 80 5F 00 00 00 00  00 00 1D 00 08 00 00 00   Q.._............
000020 00 00 08 00 14 37 00 00  0A F3 01 00 04 00 00 00   .....7..........
000030 00 00 00 00 00 00 00 00  01 00 00 00 21 24 00 00   ............!$..
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 20 00 00 00 AC 50 C5 C9  AC 50 C5 C9 30 18 E5 B0    ....P...P..0...
000090 A6 FF AC 58 00 00 00 00  00 00 00 00 00 00 00 00   ...X............
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

Most of the entries are zeros, but there is a bunch of other stuff. Specifically, those values represent permissions (read/write/execute and owner/group) as well as time/date of the file, how big it is, what type of file/directory it is, etc. However, what we are interested in is the contents of the root directory. Remember, our goal in all of this is to locate and display this file:

/homes/chico/kitchen/refrigerator/cake

Currently, we've just found the root directory's inode. Now, with this, we need to get the contents of the root directory because that's where the homes directory is located. I've highlighted some bytes in the output above. The hex number: 00 00 24 21 is the one. (My system is little-endian, so that's why the bytes appear to be reversed.) That number is a pointer (block number) to another block that contains the contents (i.e. the filenames) in the root directory.

Aside: There is a lot of information encoded in that inode and most of it is not necessary to understand in order to learn how the filesystem works. I will point out some other useful bits of information later. For now, the only piece we are interested in is the location (read: pointer) of the contents of the directory. That's what is highlighted. There are links below that will describe the layout of the inode and all of its data fields in excrutiating detail.

Ok, so how do we read the contents? Simple. We use the readblock program again:

sudo readblock /dev/sda1 0x2421 0 512 | dumpit

I'm just reading the first 512 bytes from data block #0x2421 (9249 in decimal), as that will contain what we're looking for. Of course, all data blocks are 4,096 bytes in length and if I showed every byte, all of the bytes at the end would be 0.

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 02 00 00 00 0C 00 01 02  2E 00 00 00 02 00 00 00   ................
000010 0C 00 02 02 2E 2E 00 00  0B 00 00 00 14 00 0A 02   ................
000020 6C 6F 73 74 2B 66 6F 75  6E 64 00 00 0C 00 00 00   lost+found......
000030 14 00 0A 07 69 6E 69 74  72 64 2E 69 6D 67 00 00   ....initrd.img..
000040 0D 00 00 00 10 00 07 07  76 6D 6C 69 6E 75 7A 00   ........vmlinuz.
000050 01 00 24 00 0C 00 03 02  62 69 6E 00 01 00 08 00   ..$.....bin.....
000060 0C 00 04 02 62 6F 6F 74  01 00 0C 00 10 00 05 02   ....boot........
000070 63 64 72 6F 6D 00 00 00  01 00 0A 00 0C 00 03 02   cdrom...........
000080 64 65 76 00 01 00 02 00  0C 00 03 02 65 74 63 00   dev.........etc.
000090 01 00 0E 00 0C 00 04 02  68 6F 6D 65 01 00 14 00   ........home....
0000A0 0C 00 03 02 6C 69 62 00  01 00 20 00 10 00 05 02   ....lib... .....
0000B0 6C 69 62 33 32 00 00 00  01 00 10 00 10 00 05 02   lib32...........
0000C0 6C 69 62 36 34 00 00 00  01 00 04 00 10 00 06 02   lib64...........
0000D0 6C 69 62 78 33 32 00 00  01 00 16 00 10 00 05 02   libx32..........
0000E0 6D 65 64 69 61 00 00 00  01 00 06 00 0C 00 03 02   media...........
0000F0 6D 6E 74 00 01 00 22 00  0C 00 03 02 6F 70 74 00   mnt...".....opt.
000100 01 00 18 00 0C 00 04 02  70 72 6F 63 01 00 1C 00   ........proc....
000110 0C 00 04 02 72 6F 6F 74  01 00 1A 00 0C 00 03 02   ....root........
000120 72 75 6E 00 01 00 1E 00  0C 00 04 02 73 62 69 6E   run.........sbin
000130 02 00 08 00 0C 00 03 02  73 72 76 00 02 00 04 00   ........srv.....
000140 0C 00 03 02 73 79 73 00  02 00 06 00 0C 00 03 02   ....sys.........
000150 74 6D 70 00 02 00 0A 00  0C 00 03 02 75 73 72 00   tmp.........usr.
000160 02 00 02 00 0C 00 03 02  76 61 72 00 02 00 0C 00   ........var.....
000170 0C 00 04 02 77 65 62 6D  66 77 06 00 14 00 07 02   ....webmfw......
000180 73 74 6F 72 61 67 65 6F  46 71 57 41 0E 00 00 00   storageoFqWA....
000190 10 00 05 01 2E 68 63 77  64 00 00 00 1A 0E 02 00   .....hcwd.......
0001A0 10 00 07 02 2E 63 6F 6E  66 69 67 74 D2 05 14 00   .....configt....
0001B0 10 00 05 02 68 6F 6D 65  73 31 77 76 0F 00 00 00   ....homes1wv....
0001C0 44 0E 10 01 77 65 62 6D  69 6E 2D 73 65 74 75 70   D...webmin-setup
0001D0 2E 6F 75 74 12 00 00 00  2C 0E 12 01 2E 69 73 6D   .out....,....ism
0001E0 6F 75 6E 74 2D 74 65 73  74 2D 66 69 6C 65 00 00   ount-test-file..
0001F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

I've highlighted the name of the directory we're searching for (homes) as well as a few other things. The 05 is the length of the filename, as these are not NUL-terminated strings (like C/C++). Also, the 02 is the type of file (0x02 means it's a directory). Files can be of these types:

Code Type of file

0 Unknown

1 regular file

2 directory

3 character device

4 block device

5 FIFO

6 socket

7 symbolic link

Code	Type of file
0	Unknown
1	regular file
2	directory
3	character device
4	block device
5	FIFO
6	socket
7	symbolic link

Lastly, and most importantly, I've highlighted the number D2 05 14 00, as this is the inode (little endian) for the homes directory. Remember, in addition to finding and searching the root directory, we also have to find and search the homes, chico, kitchen, and refrigerator directories. This is what's happening "behind the scenes" every time you try to access any file on the system.

Incidentally, the 2E and 2E 2E values at the top of the output correspond to the current directory (just a single dot .) and the parent directory, (2 dots ..) which are two directories you will find in every directory (even the root, which has no parent!)

Ok, so now it's time to search through the contents of the homes directory and see if we can locate the directory named chico.

First, we have to read the inode for the homes directory. We know that the inode number is 0x001405D2 because that's what we found in the root directory. Converting the hex to decimal we get 1312210. To verify that we are actually correct, we can simply stat the homes directory:

stat /homes

Output:

  File: '/homes'
  Size: 4096      	Blocks: 8          IO Block: 4096   directory
Device: 801h/2049d	Inode: 1312210     Links: 8
Access: (0755/drwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2020-10-08 13:15:10.002908649 -0700
Modify: 2020-10-08 13:20:28.402906431 -0700
Change: 2020-10-08 13:20:28.402906431 -0700
 Birth: -

Of course, we could have done this as well:

ls -ldi /homes

Output:

1312210 drwxr-xr-x 8 root root 4,096 Oct  8 13:20 /homes

Ok, let's dump that inode using readblock. First, we have to find out where (read: in which disk block) the inode resides. Using debugfs again to map the inode number to a disk block:

sudo debugfs -R 'imap <1312210>' /dev/sda1

Output:

Inode 1312210 is part of block group 160
	located at block 5243005, offset 0x0100

Using this information, we can read the block:

sudo readblock /dev/sda1 5243005 0x0100 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 41 00 00 00 10 00 00  4E 73 7F 5F 8C 74 7F 5F   .A......Ns._.t._
000010 8C 74 7F 5F 00 00 00 00  00 00 08 00 08 00 00 00   .t._............
000020 00 00 08 00 07 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 AF 2B 50 00   .............+P.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 2D AD E8 08  00 00 00 00 00 00 00 00   ....-...........
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 1C 00 00 00 FC 74 0F 60  FC 74 0F 60 A4 87 B1 00   .....t.`.t.`....
000090 4E 73 7F 5F A4 87 B1 00  00 00 00 00 00 00 00 00   Ns._............
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

This is the inode for the homes directory. We need to see the content (read: filenames) in the directory. I've highlighted the pointer to the contents above. Now, read that block to get the contents:

sudo readblock /dev/sda1 0x00502BAF 0 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 D2 05 14 00 0C 00 01 02  2E 00 00 00 02 00 00 00   ................
000010 0C 00 02 02 2E 2E 00 00  29 60 14 00 10 00 05 02   ........)`......
000020 63 68 69 63 6F 00 00 00  38 60 14 00 10 00 05 02   chico...8`......
000030 61 6C 76 69 6E 00 00 00  39 60 14 00 10 00 08 02   alvin...9`......
000040 76 65 72 6F 6E 69 63 61  3A 60 14 00 10 00 05 02   veronica:`......
000050 62 65 74 74 79 00 00 00  3B 60 14 00 0C 00 04 02   betty...;`......
000060 66 72 65 64 3C 60 14 00  9C 0F 05 02 77 69 6C 6D   fred<`......wilm
000070 61 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   a...............
000080 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000090 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

Aw, yeah! Now we're cookin' with gas! I've highlighted the name (chico) and its corresponding inode (0x00146029). Remember, this is what's in /homes:

We can find the block that contains this inode for /homes/chico:

sudo debugfs -R 'imap <0x00146029>' /dev/sda1

Output:

Inode 1335337 is part of block group 163
	located at block 5244450, offset 0x0800

Then dump the inode:

sudo readblock /dev/sda1 5244450 0x0800 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 41 EA 03 00 10 00 00  73 73 7F 5F B4 74 7F 5F   .A......ss._.t._
000010 A4 74 7F 5F 00 00 00 00  EB 03 08 00 08 00 00 00   .t._............
000020 00 00 08 00 13 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 B8 2B 50 00   .............+P.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 32 AD E8 08  00 00 00 00 00 00 00 00   ....2...........
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 1C 00 00 00 90 8C 6F E8  68 5E 5E 07 90 A3 64 82   ......o.h^^...d.
000090 73 73 7F 5F 90 A3 64 82  00 00 00 00 00 00 00 00   ss._..d.........
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

To get the contents of the /homes/chico directory, we have to follow the pointer that is highlighted above (0x00502BB8) and dump the first few bytes of the block:

sudo readblock /dev/sda1 0x502BB8 0 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 29 60 14 00 0C 00 01 02  2E 00 00 00 D2 05 14 00   )`..............
000010 0C 00 02 02 2E 2E 00 00  2A 60 14 00 10 00 07 02   ........*`......
000020 2E 63 6F 6E 66 69 67 00  2B 60 14 00 10 00 08 02   .config.+`......
000030 2E 6D 6F 7A 69 6C 6C 61  B7 15 14 00 1C 00 11 01   .mozilla........
000040 2E 63 6F 6D 70 74 6F 6E  2D 74 64 65 2E 63 6F 6E   .compton-tde.con
000050 66 50 30 00 B6 15 14 00  14 00 0C 01 2E 62 61 73   fP0..........bas
000060 68 5F 6C 6F 67 6F 75 74  B9 15 14 00 18 00 0B 01   h_logout........
000070 2E 78 63 6F 6D 70 6D 67  72 72 63 4C 53 63 74 6E   .xcompmgrrcLSctn
000080 B8 15 14 00 10 00 08 01  2E 70 72 6F 66 69 6C 65   .........profile
000090 3D 60 14 00 10 00 07 02  6B 69 74 63 68 65 6E 67   =`......kitcheng
0000A0 3E 60 14 00 10 00 07 02  62 65 64 72 6F 6F 6D 00   >`......bedroom.
0000B0 3F 60 14 00 10 00 08 02  62 61 74 68 72 6F 6F 6D   ?`......bathroom
0000C0 40 60 14 00 40 0F 06 02  67 61 72 61 67 65 00 00   @`..@...garage..
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

As a reminder, this is what's in /homes/chico. There are 4 visible directories there. You'll also see there are a few hidden files/directories also, but we can ignore those.

This time, I've highlighted the kitchen directory and its inode (0x0014603D) because now we need to find the refrigerator directory in the kitchen directory.

Now, find the block that contains the inode:

sudo debugfs -R 'imap <0x0014603D>' /dev/sda1

Output:

Inode 1335357 is part of block group 163
	located at block 5244451, offset 0x0c00

Then, dump the inode:

sudo readblock /dev/sda1 5244451 0x0c00 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 41 00 00 00 10 00 00  A4 74 7F 5F CE 74 7F 5F   .A.......t._.t._
000010 CE 74 7F 5F 00 00 00 00  00 00 08 00 08 00 00 00   .t._............
000020 00 00 08 00 07 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 FE 23 50 00   .............#P.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 D2 AD E8 08  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 1C 00 00 00 CC FD DF 63  CC FD DF 63 68 5E 5E 07   .......c...ch^^.
000090 A4 74 7F 5F 68 5E 5E 07  00 00 00 00 00 00 00 00   .t._h^^.........
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

To get the contents of the /homes/chico/kitchen directory, we need to follow the highlighted pointer (block) above and dump the first few bytes of that block:

sudo readblock /dev/sda1 0x005023FE 0 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 3D 60 14 00 0C 00 01 02  2E 00 00 00 29 60 14 00   =`..........)`..
000010 0C 00 02 02 2E 2E 00 00  41 60 14 00 14 00 0C 02   ........A`......
000020 72 65 66 72 69 67 65 72  61 74 6F 72 42 60 14 00   refrigeratorB`..
000030 0C 00 04 02 73 69 6E 6B  43 60 14 00 14 00 09 02   ....sinkC`......
000040 63 75 70 62 6F 61 72 64  73 00 00 00 44 60 14 00   cupboards...D`..
000050 0C 00 04 02 6F 76 65 6E  45 60 14 00 10 00 05 02   ....ovenE`......
000060 73 74 6F 76 65 00 00 00  46 60 14 00 98 0F 09 02   stove...F`......
000070 6D 69 63 72 6F 77 61 76  65 00 00 00 00 00 00 00   microwave.......
000080 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000090 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

Again, this is what's in /homes/chico/kitchen. You'll see 6 visible directories. Now that we've located the refrigerator directory, it's time to find out what's in it.

As usual, find the block that contains the inode for refrigerator:

sudo debugfs -R 'imap <0x00146041>' /dev/sda1

Output:

Inode 1335361 is part of block group 163
	located at block 5244452, offset 0x0000

Then dump the inode:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 41 EA 03 00 10 00 00  CE 74 7F 5F 5E 75 7F 5F   .A.......t._^u._
000010 10 75 7F 5F 00 00 00 00  EB 03 02 00 08 00 00 00   .u._............
000020 00 00 08 00 0B 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 22 24 50 00   ............"$P.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 DD AD E8 08  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 1C 00 00 00 1C 92 58 83  A8 26 C4 13 CC FD DF 63   ......X..&.....c
000090 CE 74 7F 5F CC FD DF 63  00 00 00 00 00 00 00 00   .t._...c........
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

Then follow the pointer to get the contents of refrigerator:

sudo readblock /dev/sda1 0x00502422 0 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 41 60 14 00 0C 00 01 02  2E 00 00 00 3D 60 14 00   A`..........=`..
000010 0C 00 02 02 2E 2E 00 00  FB 15 14 00 0C 00 04 01   ................
000020 6D 69 6C 6B FC 15 14 00  0C 00 04 01 65 67 67 73   milk........eggs
000030 FD 15 14 00 10 00 06 01  62 75 74 74 65 72 00 00   ........butter..
000040 FE 15 14 00 10 00 05 01  6A 75 69 63 65 00 00 00   ........juice...
000050 FF 15 14 00 10 00 06 01  63 68 65 65 73 65 00 00   ........cheese..
000060 00 16 14 00 0C 00 04 01  63 6F 6B 65 01 16 14 00   ........coke....
000070 10 00 06 01 61 70 70 6C  65 73 00 00 02 16 14 00   ....apples......
000080 10 00 07 01 63 68 69 63  6B 65 6E 00 03 16 14 00   ....chicken.....
000090 0C 00 04 01 63 61 6B 65  04 16 14 00 68 0F 03 01   ....cake....h...
0000A0 70 69 65 00 00 00 00 00  00 00 00 00 00 00 00 00   pie.............
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

Reminder, this is what's in /homes/chico/kitchen/refrigerator. You'll see 10 visible files (not directories) this time. Now that we've located the cake file, it's time to find out what's in it. The process is the same as it is for directories.

We have the inode for cake and it's inode 0x00141602. Let's dump out the inode. First, get the block that contains it:

sudo debugfs -R 'imap <0x00141602>' /dev/sda1

Output:

Inode 1316355 is part of block group 160
	located at block 5243264, offset 0x0200

Then, dump the inode:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 A4 81 EA 03 4C 00 00 00  10 75 7F 5F 79 C7 80 5F   ....L....u._y.._
000010 79 C7 80 5F 00 00 00 00  EB 03 01 00 08 00 00 00   y.._............
000020 00 00 08 00 01 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 08 20 3E 00   ............. >.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 ED AD E8 08  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 1C 00 00 00 54 B3 DE 60  54 B3 DE 60 A8 26 C4 13   ....T..`T..`.&..
000090 10 75 7F 5F A8 26 C4 13  00 00 00 00 00 00 00 00   .u._.&..........
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

Additionally, I've highlighted 4 bytes above because they have a specific meaning. Those bytes are the actual size of the file (0x0000004C is 76 in decimal). We'll return to this value shortly.

If we follow the pointer above, it will take us to the contents of the cake file, which is the actual text that is in the file.

sudo readblock /dev/sda1 0x003E2008 0 128 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 65 67 67 73 0A 62 75 74  74 65 72 0A 6D 69 6C 6B   eggs.butter.milk
000010 0A 66 6C 6F 75 72 0A 76  61 6E 69 6C 6C 61 0A 69   .flour.vanilla.i
000020 63 69 6E 67 0A 73 74 72  61 77 62 65 72 72 69 65   cing.strawberrie
000030 73 0A 70 65 61 63 68 65  73 0A 6C 65 74 74 75 63   s.peaches.lettuc
000040 65 0A 61 73 70 61 72 61  67 75 73 0A 00 00 00 00   e.asparagus.....
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

This time, I chose to just dump out 128 bytes, because the file is smaller than that. Because all of the data in the cake file is text, I don't need to use dumpit as I can just display it:

sudo readblock /dev/sda1 0x003E2008 0 128

Output:

eggs
butter
milk
flour
vanilla
icing
strawberries
peaches
lettuce
asparagus

The reason it stops after asparagus is because that's the end of the printable characters and character 0 doesn't print anything. This is the exact output we got from our original command:

cat /homes/chico/kitchen/refrigerator/cake

This was the entire purpose of this demonstration: To show exactly what is going on behind the scenes. So, to get back to the original question: "How many disk reads were required to locate (search), open (read), and display the file?"

Now, you should be able to answer that.

If we use ls to look at the file:

ls -l /homes/chico/kitchen/refrigerator/cake

Output:

-rw-r--r-- 1 chico chico 76 Oct  9 13:26 /homes/chico/kitchen/refrigerator/cake

We can, in fact, see that the file size is 76 bytes, which is part of the inode (metadata) that is associated with this file that was shown above.

Remember these two files from before? I said that this file:

/usr/hostname

is going to require significantly less work than locating this file:

/usr/share/icons/foo/bar/baz/bat/one/more/dir/and/were/done/file.txt

It should be clear and obvious why that is. Each directory adds 2 additional disk reads to the process. One read is for the inode and one is for the contents. So, files stored very deep in the heirarchy are much more expensive to read than ones that are shallow. To reach the hostname file, the system only has to read 2 directories (root and usr) but to reach file.txt it has to read 14 directories! (The root directory plus 13 subdirectories)

Notes:

Drives are pretty fast today, so a user won't even notice the difference.

Can you imagine doing this on a floppy disk?
The first floppy disks (filesystems) didn't have directories, so it wasn't a problem.
Once filesystems on floppies had directories, you can see how glacially slow the process would be.
Of course, floppies could only contain a small number of files/directories, so it would be very unlikely you would have deep directories.

Finding hundreds or thousands of files in a deep heirarchy will definitely be more noticeable.
This is where caching really comes into play.
Imagine reading these two files in succession:
```
/usr/share/icons/foo/bar/baz/bat/one/more/dir/and/were/done/file1.txt
/usr/share/icons/foo/bar/baz/bat/one/more/dir/and/were/done/file2.txt
```
The first file will incur a steep cost to locate and read all of the directories leading up to the file. However, the second file will likely only require 2 disk reads because all of the directories (inodes and data blocks) are likely to still be cached in memory, making the lookups very fast.
Having a lot of memory in your system can significantly speed up the filesystem operations.

This extra memory will be used to cache tens or hundreds of thousands of inodes and disk blocks.
This means once a disk block (especially directories) has been read, subsequent reads are from memory, not the disk, resulting in a huge performance boost.

It's the most basic rule of computer science: "Throwing more memory at a problem can significantly improve the performance." File systems are no exception. In fact, file systems are a major reason to have lots of memory. (Remember, disks are 10,000 times slower than memory, so avoid reading the disk whenever possible!)
Good operating systems should never have free (i.e. unused) memory. What?

Any memory that isn't used is essentially being wasted.
It should be put to use, such as caching.
If other processes actually need memory to run, the OS can free up some of the cached disk blocks as needed. It's all transparent to the user.
Most modern operating systems (i.e. filesystems) do this.

More Inode Details

We saw that there was quite a bit of information in the inodes that we were ignoring. We were basically just interested in using the inode to find the data block(s) associated with the file/directory. Let's look a little closer at the inode for the cake file:

ls -li /homes/chico/kitchen/refrigerator/cake

Output:

1316355 -rw-r--r-- 1 chico chico 76 Oct 12 14:24 /homes/chico/kitchen/refrigerator/cake

Output annotated:

1316355  -rw-r--r--  1  chico  chico  76  Oct 12 14:24  /homes/chico/kitchen/refrigerator/cake
^^^^^^^  ^^^^^^^^^^  ^  ^^^^^  ^^^^^  ^^  ^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 inode     perms     |    |      |    |     date/time            fullpath of the file
           type      |    |      |    |
                    /     |      |     \
                   /      |      |      \
               links    owner  group    size

The ls command shows us a lot of information that is all in the inode for the file. Let's break it down. These are the fields (from left to right) and their meanings:

Field Description

1316355 This is the inode for the file.

-rw-r--r-- These are the permissions for the user, group, and others, as well as the type of file.

1 The number of hard links to this file.

chico The owner (user) of the file.

chico The group that the file belongs to.

76 The size of the file (in bytes).

Oct 12 14:24 The date/time that the contents of the file were last modified.

Field	Description
`1316355`	This is the inode for the file.
`-rw-r--r--`	These are the permissions for the user, group, and others, as well as the type of file.
`1`	The number of hard links to this file.
`chico`	The owner (user) of the file.
`chico`	The group that the file belongs to.
`76`	The size of the file (in bytes).
`Oct 12 14:24`	The date/time that the contents of the file were last modified.

Using the same technique to find the inode and dump out its contents:

sudo debugfs -R 'imap <1316355>' /dev/sda1

Output:

Inode 1316355 is part of block group 160
	located at block 5243264, offset 0x0200

Dump the inode:

sudo readblock /dev/sda1 5243264 0x200 256 | dumpit

Output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 A4 81 EA 03 4C 00 00 00  10 75 7F 5F 76 C9 84 5F   ....L....u._v.._
000010 76 C9 84 5F 00 00 00 00  EB 03 01 00 08 00 00 00   v.._............
000020 00 00 08 00 01 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 F8 60 24 00   .............`$.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 ED AD E8 08  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 1C 00 00 00 E4 98 9D 1D  E4 98 9D 1D A8 26 C4 13   .............&..
000090 10 75 7F 5F A8 26 C4 13  00 00 00 00 00 00 00 00   .u._.&..........
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

The 2 bytes (0x81A4) at offset 0x00 are the permissions and type. We know that the file is a regular file and the permissions are rw-r--r--. In octal, these permissions would be 644. The inode has the values encoded in a hexadecimal number.

The next 2 bytes (0x03EA) at offset 0x0002 are the owner (user) of the file. In decimal, this is user ID 1002. To verify this, run this command:
```
cat /etc/passwd | grep chico
```
and you'll see this (on my system):
```
chico:x:1002:1003:Chico Escuela:/home/chico:/bin/bash
```
You can plainly see that the second field (1002) is the user ID of chico. The 1003 is the group that chico is in.
The next 4 bytes (0x0000004C) at offset 0x0004 are the size of the file. In this case 0x0000004C is 76 in decimal, which is what the ls command shows. Actually, these 4 bytes are just the low 32 bits of the size. Files larger than what can fit into 32 bits also have 4 additional bytes for the high 32 bits, which gives a practical maximum size of a file as 2⁶⁴, which is pretty large, although in practice 16 TB is the (current) limit.

At offset 0x0018 (0x03EB) is the group that the file belongs to. To verify that ID 1003 is chico, run this command:
```
cat /etc/group | grep chico
```
and you'll see this (on my system):
```
chico:x:1003:
```
The 2 bytes after the group at offset 0x001A is the link count. This number is how many references there are to the file. There can be more than one because you can give multiple names to the same file. This allows a file to be known by more than one name. This is somewhat analagous to how references in C++ work.

The bytes that are underlined are the date of the file's last modification (offset 0x0010, 0x5F84C976) and the time of the last modification (offset 0x0088, 0x1D9D98E4).

Here again is the output from the ls command:

1316355 -rw-r--r-- 1 chico chico 76 Oct 12 14:24 /homes/chico/kitchen/refrigerator/cake

The rest of the information encodes things like the creation date/time, last access date/time, checksums, version, high 32 bits of the size, and several other obsolete, reserved, and advanced pieces of information.

So, in a nutshell, the inode stores all of the information about a file with the exception of the filename. We saw that the filenames are stored in a directory's contents. This information in the inode is called metadata.

Extents

What about the actual contents of a file or directory? We saw that every inode has a pointer (block number) to the actual data blocks that store the contents. However, we know that, traditionally, an inode has several block pointers (15 to be exact). Recall the inode diagram and the (partial) inode for the cake file:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 A4 81 EA 03 4C 00 00 00  10 75 7F 5F 76 C9 84 5F   ....L....u._v.._
000010 76 C9 84 5F 00 00 00 00  EB 03 01 00 08 00 00 00   v.._............
000020 00 00 08 00 01 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 F8 60 24 00   .............`$.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 ED AD E8 08  00 00 00 00 00 00 00 00   ................
000070 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000080 1C 00 00 00 E4 98 9D 1D  E4 98 9D 1D A8 26 C4 13   .............&..
000090 10 75 7F 5F A8 26 C4 13  00 00 00 00 00 00 00 00   .u._.&..........
0000A0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000B0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000C0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000D0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000E0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
0000F0 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................

Where are all (15) of the block pointers? Up until now I've just "magically" been saying that the data can be found by following the 4 bytes at offset 0x003C (in bold) and that this is the pointer to the data block (singular). Also, since all of our data (contents of blocks) thus far have been less than 4,096 bytes, we've never needed more than one pointer/block. Yes, there appears to be a lot of "empty" pointers that follow it, but there aren't 15 of them. Remember this self-check from above?

Self-check - With multiple levels of indirection, filesystems can be implemented efficiently for fragmented files. However, for non-fragmented (i.e. contiguous files), this approach is not very efficient. Explain why that is and how a better method can be used.

First, we need to see why the "old" inode scheme of many levels of indirection is good for fragmented files, but bad for non-fragmented (contiguous) files. Once we understand this, a better "solution" is obvious, and the solution is extents.

Many older and less sophisticated filesystems suffered from fragmented files so the originally inode scheme made sense. However, many modern filesystems have very few fragmented files so this scheme is sub-optimal.

As an example, let's assume we have a file called file.txt that is 18,000 bytes in size. With block sizes of 4,096 bytes, the contents of the file will require 5 blocks, with the first 4 blocks being full and the last block containing 1,616 bytes.

  1       2       3       4       5
4,096 + 4,096 + 4,096 + 4,096 + 1,616 = 18,000

With linked allocation, we would have something like this:

With indexed allocation, we would have something like this:

The blue number above the block is the (arbitrary) byte address of the block. The numbers inside the blocks are the size of the data in the block. Because the data blocks are not contiguous, the file is fragmented.

For each of these two different schemes, answer these questions:

How much effort is required to read byte #15,000? (That is byte #2712 within block #4)
How much effort is required to read the entire file into memory?

Now, suppose the file was not fragmented (e.g. all blocks are contiguous).

Linked allocation:

Indexed allocation:

Answer the same questions:

How much effort is required to read byte #15,000? (That is byte #2712 within block #4)
How much effort is required to read the entire file into memory?

With linked allocation, you still must chase pointers as there is no random access regardless of the fragmentation of the file. With a naive indexed allocation, which is shown above, we also don't get a lot of improvement from the file system. (However, we will get some improvement from the hardware itself due to the locality of the non-fragmented blocks, e.g. prefetching.)

To see just how poorly a fragmented disk can perform, here is a forum post that I made (from July 2005). I've always been a big fan of Microsoft's Flight Simulator and have about 1,000,000 files (photo-realistic textures for parts of the United States.) Because the frame rate depends so much on reading many files per second from the slow disk, any fragmentation is going to make things even worse. You can see the significant improvements by 1) defragging the MFT (Master File Table) and 2) moving important files (e.g. textures) to the outside tracks of the spinning disks. This demonstrates that the outer tracks are moving much faster than the inner tracks (angular velocity), thereby increasing the performance.

Remember, non-fragmented blocks act more like arrays than linked lists because all of the data is contiguous. We can take advantage of this fact by using extents.

Using extents, physical view:

Using extents, logical view:

So, looking back at the (partial) inode for the cake file we can see the extents that are in use:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 A4 81 EA 03 4C 00 00 00  10 75 7F 5F 76 C9 84 5F   ....L....u._v.._
000010 76 C9 84 5F 00 00 00 00  EB 03 01 00 08 00 00 00   v.._............
000020 00 00 08 00 01 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  01 00 00 00 F8 60 24 00   .............`$.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 ED AD E8 08  00 00 00 00 00 00 00 00   ................

The 2 bytes in blue is the number of blocks that are present in this extent. For files less than or equal to 4,096 bytes, it will always be 1. Larger files will have more blocks in the extent.

The 2 bytes in red are the high (upper) 16-bits of the address of the data block and will only be used for very large filesystems.

As an example, let's look at /usr/bin/zip which is clearly larger than a single block:

ls -li /usr/bin/zip

Output:

661483 -rwxr-xr-x 1 root root 188,296 Oct 21  2013 /usr/bin/zip

Find out which block contains inode 661483:

sudo debugfs -R 'imap <661483>' /dev/sda1

Output:

Inode 661483 is part of block group 80
	located at block 2621854, offset 0x0a00

And then dump the inode:

sudo readblock /dev/sda1 2621854 0xa00 256 | dumpit

Partial output:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 81 00 00 88 DF 02 00  2A 01 AD 58 2A 01 AD 58   ........*..X*..X
000010 DF 38 65 52 00 00 00 00  00 00 01 00 70 01 00 00   .8eR........p...
000020 00 00 08 00 01 00 00 00  0A F3 01 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  2E 00 00 00 ED F2 2A 00   ..............*.
000040 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 CD C7 2E D2  00 00 00 00 00 00 00 00   ................

The inode is telling us that the extent starts with block 0x002AF2ED and extends for 46 (0x002E) blocks. If you do the ~~math~~ arithmetic:

4,096 * 46 = 188,614

we can see that there are exactly 318 (188,614 - 188,296) bytes that are unused in the last block. We can also see how many blocks the file used by running the stat command:

stat /usr/bin/zip

  File: '/usr/bin/zip'
  Size: 188296    	Blocks: 368        IO Block: 4096   regular file
Device: 801h/2049d	Inode: 661483      Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2017-02-21 19:10:34.696610844 -0800
Modify: 2013-10-21 07:23:27.000000000 -0700
Change: 2017-02-21 19:10:34.696610844 -0800
 Birth: -

It tells us the file consumes 368 blocks. But, wait, the inode said there were only 46 blocks. What gives? The stat command is telling us how many 512-byte blocks are used by the file. Since the filesystem uses 4,096-byte blocks, just divide the value from stat by 8 and you'll get 46.

Ok, but what if, for some reason, all of the blocks are not contiguous. Maybe you have a really large file that does have "gaps" in the extents. Lets look at this file on my system

ls -li /usr/bin/rosegarden

Output:

669122 -rwxr-xr-x 1 root root 15,863,224 Oct 22  2013 /usr/bin/rosegarden

This rosegarden file is over 15 megabytes in size and may not be 100% contiguous. Let's run stat on it first to see the output:

  File: '/usr/bin/rosegarden'
  Size: 15863224  	Blocks: 30984      IO Block: 4096   regular file
Device: 801h/2049d	Inode: 669122      Links: 1
Access: (0755/-rwxr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2018-09-23 12:20:30.000000000 -0700
Modify: 2013-10-22 05:47:12.000000000 -0700
Change: 2018-09-23 12:20:32.525504593 -0700
 Birth: -

We can see that there are 30,984 512-byte blocks or 3,873 I/O blocks (4,096 bytes). Using the debugfs command, we can see some information about the extents:

sudo debugfs -R 'stat /usr/bin/rosegarden' /dev/sda1

Output:

Inode: 669122   Type: regular    Mode:  0755   Flags: 0x80000
Generation: 1883018732    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 15863224
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 30984
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x5ba7e780:7d4a4144 -- Sun Sep 23 12:20:32 2018
 atime: 0x5ba7e77e:00000000 -- Sun Sep 23 12:20:30 2018
 mtime: 0x526673d0:00000000 -- Tue Oct 22 05:47:12 2013
crtime: 0x5ba7e780:60ae0954 -- Sun Sep 23 12:20:32 2018
Size of extra inode fields: 28
EXTENTS:
(0-2047):6352896-6354943, (2048-3872):6356992-6358816

This command produces a lot more information. The lines at the bottom tell us that there are 2 extents (e.g. 2 contiguous sets of blocks). The first extent is 2,048 blocks in length (with the corresponding block addresses) and the second extent is 1,825 blocks in length. If you add those numbers together (2,048 + 1,825) you'll get 3,873, the number of 4,096-byte I/O blocks used by the file.

Here's the partial inode for the rosegarden file:

       00 01 02 03 04 05 06 07  08 09 0A 0B 0C 0D 0E 0F
--------------------------------------------------------------------------
000000 ED 81 00 00 B8 0D F2 00  7E E7 A7 5B 80 E7 A7 5B   ........~..[...[
000010 D0 73 66 52 00 00 00 00  00 00 01 00 08 79 00 00   .sfR.........y..
000020 00 00 08 00 01 00 00 00  0A F3 02 00 04 00 00 00   ................
000030 00 00 00 00 00 00 00 00  00 08 00 00 00 F0 60 00   ..............`.
000040 00 08 00 00 21 07 00 00  00 00 61 00 00 00 00 00   ....!.....a.....
000050 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ................
000060 00 00 00 00 EC 95 3C 70  00 00 00 00 00 00 00 00   ......

The 2 bytes highlighted in red on the third line tell us that there are 2 extents in this file.

The 2 bytes highlighted in blue on the third line tell us that this inode can hold at most 4 extents.

The file size (highlighted on the first line, 0x00F20DB8) is decimal 15,863,224, which is what the other commands told us.

The first extent starts at block 0x0060F000 (decimal 6352896) and consumes 2,048 (0x8000) blocks. The second extent starts at block 0x00600000 (decimal 6356992) and extends for 2,048 (0x8000) blocks. Wait. What? That's 4,096 blocks, but the file is only 3,873 blocks. What gives?

The short answer is that the filesystem has reserved some extra blocks at the end. This allows the file to grow without getting fragmented. If the filesystem had not done this and some other file's data ended up after the first file's data, we would end up fragmenting the first file when more data was appended to it.

The long answer is more complicated and beyond the scope of this introduction. Follow some of the links below, if you're interested.

Keep in mind that the filesystem (via the inode) knows how large the file is and how many blocks are actually valid, so it isn't going to "accidentally" read the invalid blocks/bytes at the end of the extent. In fact, the 2 bytes highlighted in red on the fifth line tells us how many of the blocks in the extent are valid (0x0721 is decimal 1825).

Some obvious questions:

What if the file needs more than 4 extents? (Very large files)
What if the file (even a not-so-large file) is badly fragmented?

Notes:

This was just a brief introduction to how many modern filesystems work. There are many more (gory) details that I did not cover because they were well beyond the scope of an introduction.
Many modern file systems use some variation of indexed allocation.
Many file systems, in addtion to ext4, are using the idea of extents to improve things. Examples:

APFS (Apple), Btrfs (Oracle), NTFS (Microsoft), JFS (IBM), XFS (SGI)

Many filesystems do automatic defragmentation, so most files are not fragmented.

The filesystems don't necessarily run a "defrag" program but merely don't fragment files to begin with.

Hal Pomeranz has a good in-depth set of blogs about the ext4 file system:
Those pages also answer the two "obvious questions" above.

References

Links

Popular Linux Filesystems Linux file system types explained, which one should you use.
debugfs This tool can also be run interactively.
List of file systems.
Comparison of file systems.
Ext4 Disk Layout - A lot of (gory) details about inodes and directories.
Ext4: The Next Generation of Ext2/3 Filesystem.
Ext4 Wiki
More Ext4 Information at Kernel Newbies.
The Unix (or Cygwin) stat command.
Nice introduction to BtrFS with lots of good information. Great place to start.
More btrfs information.
A good article about how to Run ZFS on Linux from IBM's website.
ZFS on Linux home page.
Why ZFS and Btrfs are better than ext4.
SleuthKitThe Sleuth Kit^™ (TSK) is a library and collection of command line tools that allow you to investigate disk images. Overview
Facebook hires several Btrfs developers, starts using it now. openSUSE 13.2 is also using it as the default filesystem.
ReFS - (Resilient File System) This is supposed to be the successor to Microsoft's NTFS.
ZFS article from Linux Journal.
Btrfs article from Linux Journal.
filefrag This is a Linux tool that shows the fragmentation of a given file. Try it on a really large file. (Most files will not be fragmented.)

Files

readblock.c This is used to read raw bytes from the disk blocks.
getblocks.c This will list all of the blocks used by a file.
dumpinode This is a bash script that displays the data in an inode for a file.
dumpit This is the hex viewer used.