The Virtual Filesystem
(VFS) is the subsystem of the kernel that implements the file and filesystem-related interfaces to user-space. All filesystems rely on the VFS not only to coexist but also to interoperate. This enables programs to use standard Unix system calls to read and write to different filesystems, even on different media.
The VFS enables system calls work between different filesystems and media. New filesystems and new Virtual of storage media can find their way into Linux, and programs need not be rewritten or even recompiled.
Ther kernel implements an abstraction layer around its low-level filesystem interface. This abstraction layer works by defining the basic conceptual interfaces and data structures that all filesystems support.
Unix has provided four basic filesystem-related abstractions:
file
: Afile
is an ordered string of bytes. The first byte marks the beginning of the file, and the last byte marks the end of the file. Each file is assigned a human-readable name for identification by both the system and the user.directory entries
: Files are organized in directories. Adirectory
is analogous to a folder and usually contrains related files. Directories can also contain other directories, called subdirectories. In this fashion, directories may be nested to form paths. Each component of a path is called adirectory entry
, all of them calleddentries
. A directory is a file to the VFS.inodes
: Information about a file calledfile metadata
and is stored in a separate data structure form the file, called theinode
.mount points
: filesystems are mounted at a specific mount point in a global hierarchy known as anamespace
. This enables all mounted filesystems to appear as entries in a single tree.
The VFS is object-oriented. A family of data structures represents the common file model. These data structures are skin to objects.
The four primary object types of the VFS are:
- The
superblock
object, which represents a specific mounted filesystem. - The
inode
object, which represents a specific file. - The
dentry
object, which represents a directory entry, which is a single component of a path. - The
file
object, which represents an open file are associated with a process.
An operations
object is contained within each of these primary objects. These objects describe the methods that the kernel invokes agains the primary objects:
- The
super_operations
object, which contains the methods that the kernel can invoke on a specific filesystem, such aswrite_inode()
andsync_fs()
. - The
inode_operations
object, which contains the methods that the kernel can invoke on a specific file, such ascreate()
andlink()
. - The
dentry_operations
object, which contains the methods that the kernel can invoke on a specific directory entry, such asd_compare()
andd_delete()
. - The
file_operations
object, which contains the methods that a process can invoke on an open file, such asread()
andwrite()
.
The operations objects are implemented as a structure of pointers to functions that operate on the parent object.
The superblock object is implemented by each filesystem and is used to store information describing that specific filesystem. This object usually corresponds to the filesystem superblock or the filesystem control block, which is stored in a special sector on disk. Filesystems that are not disk-based generate the superblock on-the-fly and store it in memory.
The superblock object is represented by struct super_block
and defined in <linux/fs.h>
:
struct super_block {
struct list_head s_list; /* list of all superblocks */
dev_t s_dev; /* identifier */
unsigned long s_blocksize; /* block size in bytes */
unsigned char s_blocksize_bits; /* block size in bits */
unsigned char s_dirt; /* dirty flag */
unsigned long long s_maxbytes; /* max file size */
struct file_system_type s_type; /* filesystem type */
struct super_operations s_op; /* superblock methods */
struct dquot_operations *dq_op; /* quota methods */
struct quotactl_ops *s_qcop; /* quota control methods */
struct export_operations *s_export_op; /* export methods */
unsigned long s_flags; /* mount flags */
unsigned long s_magic; /* filesystem’s magic number */
struct dentry *s_root; /* directory mount point */
struct rw_semaphore s_umount; /* unmount semaphore */
struct semaphore s_lock; /* superblock semaphore */
int s_count; /* superblock ref count */
int s_need_sync; /* not-yet-synced flag */
atomic_t s_active; /* active reference count */
void *s_security; /* security module */
struct xattr_handler **s_xattr; /* extended attribute handlers */
struct list_head s_inodes; /* list of inodes */
struct list_head s_dirty; /* list of dirty inodes */
struct list_head s_io; /* list of writebacks */
struct list_head s_more_io; /* list of more writeback */
struct hlist_head s_anon; /* anonymous dentries */
struct list_head s_files; /* list of assigned files */
struct list_head s_dentry_lru; /* list of unused dentries */
int s_nr_dentry_unused; /* number of dentries on list */
struct block_device *s_bdev; /* associated block device */
struct mtd_info *s_mtd; /* memory disk information */
struct list_head s_instances; /* instances of this fs */
struct quota_info s_dquot; /* quota-specific options */
int s_frozen; /* frozen status */
wait_queue_head_t s_wait_unfrozen; /* wait queue on freeze */
char s_id[32]; /* text name */
void *s_fs_info; /* filesystem-specific info */
fmode_t s_mode; /* mount permissions */
struct semaphore s_vfs_rename_sem; /* rename semaphore */
u32 s_time_gran; /* granularity of timestamps */
char *s_subtype; /* subtype name */
char *s_options; /* saved mount options */
};
The code for creating, managing, and destroying superblock objects lives in fs/super.c
. A superblock object is created and initialized via the alloc_super()
function. When mounted, a filesystem invokes this function, reads its superblock off of the disk, and fills in its superblock object.
The most important item in the superblock object is s_op
, which is a pointer to the superblock operations table. The superblock operations table is represented by struct super_operations
and is defined in <linux/fs.h>
, which looks like this:
struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
void (*destroy_inode)(struct inode *);
void (*dirty_inode) (struct inode *);
int (*write_inode) (struct inode *, int);
void (*drop_inode) (struct inode *);
void (*delete_inode) (struct inode *);
void (*put_super) (struct super_block *);
void (*write_super) (struct super_block *);
int (*sync_fs)(struct super_block *sb, int wait);
int (*freeze_fs) (struct super_block *);
int (*unfreeze_fs) (struct super_block *);
int (*statfs) (struct dentry *, struct kstatfs *);
int (*remount_fs) (struct super_block *, int *, char *);
void (*clear_inode) (struct inode *);
void (*umount_begin) (struct super_block *);
int (*show_options)(struct seq_file *, struct vfsmount *);
int (*show_stats)(struct seq_file *, struct vfsmount *);
ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t);
};
Each item in this structure is a pointer to a function that operates on a superblock object. The superblock operations perform low-level operations on the filesystem and its inodes.
When a filesystem needs to perform an operation on its superblock, it follows the pointers from its superblock object to the desired method. For example:
sb->s_op->write_super(sb);
The following are some of the superblock operations that are specified by super_operations
:
struct inode * alloc_inode(struct super_block *sb)
- Creates and initializes a new inode object under the given superblock.
void destroy_inode(struct inode *inode)
- Deallocates the given inode.
void dirty_inode(struct inode *inode)
- Invoked by the VFS when an inode is dirtied (modified). Journaling filesystems such as ext3 and ext4 use this function to perform journal updates.
void write_inode(struct inode *inode, int wait)
- Writes the given inode to disk. The
wait
parameter specifies whether the operation should be synchronous.
- Writes the given inode to disk. The
void drop_inode(struct inode *inode)
- Called by the VFS when the last reference to an inode is dropped. Normal Unix filesystems do not define this function, in which case the VFS simply deletes the inode.
void delete_inode(struct inode *inode)
- Deletes the given inode from the disk.
void put_super(struct super_block *sb)
- Called by the VFS on unmount to release the given superblock object.The caller must hold the
s_lock
lock.
- Called by the VFS on unmount to release the given superblock object.The caller must hold the
void write_super(struct super_block *sb)
- Updates the on-disk superblock with the specified superblock.The VFS uses this function to synchronize a modified in-memory superblock with the disk. The caller must hold the
s_lock
lock.
- Updates the on-disk superblock with the specified superblock.The VFS uses this function to synchronize a modified in-memory superblock with the disk. The caller must hold the
int sync_fs(struct super_block *sb, int wait)
- Synchronizes filesystem metadata with the on-disk filesystem. The
wait
parameter specifies whether the operation is synchronous.
- Synchronizes filesystem metadata with the on-disk filesystem. The
void write_super_lockfs(struct super_block *sb)
- Prevents changes to the filesystem, and then updates the on-disk superblock with the specified superblock. It is currently used by LVM (the Logical Volume Manager).
void unlockfs(struct super_block *sb)
- Unlocks the filesystem against changes as done by
write_super_lockfs()
.
- Unlocks the filesystem against changes as done by
int statfs(struct super_block *sb, struct statfs *statfs)
- Called by the VFS to obtain filesystem statistics. The statistics related to the given filesystem are placed in
statfs
.
- Called by the VFS to obtain filesystem statistics. The statistics related to the given filesystem are placed in
int remount_fs(struct super_block *sb, int *flags, char *data)
- Called by the VFS when the filesystem is remounted with new mount options.The caller must hold the
s_lock
lock.
- Called by the VFS when the filesystem is remounted with new mount options.The caller must hold the
void clear_inode(struct inode *inode)
- Called by the VFS to release the inode and clear any pages containing related data.
void umount_begin(struct super_block *sb)
- Called by the VFS to interrupt a mount operation. It is used by network filesystems, such as NFS.
The inode Object represents all the information needed by the kernel to manipulate a file or directory. For Unix-style filesystems, this information is simply read from the on-disk inode.
The inode object is represented by struct inode
and is defined in <linux/fs.h>
.
struct inode {
struct hlist_node i_hash; /* hash list */
struct list_head i_list; /* list of inodes */
struct list_head i_sb_list; /* list of superblocks */
struct list_head i_dentry; /* list of dentries */
unsigned long i_ino; /* inode number */
atomic_t i_count; /* reference counter */
unsigned int i_nlink; /* number of hard links */
uid_t i_uid; /* user id of owner */
gid_t i_gid; /* group id of owner */
kdev_t i_rdev; /* real device node */
u64 i_version /* versioning number */
loff_t i_size /* file size in bytes */
seqcount_t i_size_seqcount; /* serializer for i_size */
struct timespec i_atime; /* last access time */
struct timespec i_mtime; /* last modify time */
struct timespec i_ctime; /* last change time */
unsigned int i_blkbits; /* block size in bits */
blkcnt_t i_blocks; /* block size in bits */
unsigned short i_bytes; /* bytes consumed */
umode_t i_mode; /* access permissions */
spinlock_t i_lock; /* spinlock */
struct rw_semaphore i_alloc_sem; /* nests inside of i_sem */
struct semaphore i_sem; /* inode semaphore */
struct inode_operations *i_op; /* inode ops table */
struct file_operations *i_fop; /* default inode ops */
struct super_block *i_sb; /* associated superblock */
struct file_lock *i_flock; /* file lock list */
struct address_space *i_mapping; /* associated mapping */
struct address_space i_data; /* mapping for device */
struct dquot *i_dquot[MAXQUOTAS]; /* disk quotas for inode */
struct list_head i_devices; /* list of block devices */
union {
struct pipe_inode_info *i_pipe; /* pipe information */
struct block_device *i_bdev; /* block device driver */
struct cdev *i_cdev; /* character device driver */
};
unsigned long i_dnotify_mask; /* directory notify mask */
struct dnotify_struct *i_dnotify; /* dnotify */
struct list_head inotify_watches; /* inotify watches */
struct mutex inotify_mutex; /* protects inotify_watches */
unsigned long i_state; /* state flags */
unsigned long dirtied_when; /* first dirtying time */
unsigned int i_flags; /* filesystem flags */
atomic_t i_writecount; /* count of writers */
void *i_security; /* security module */
void *i_private; /* fs private pointer */
};
An inode represents each file on a filesystem, but the inode object is constructed in memory only as file are accessed.
Inode Operations are invoked via:
i->i_op->truncate(i)
i
is a reference to a particular inode, the truncate()
operation defined by the filesystem on which i
exists is called on the given inode.
The inode_operations
structure is defined in <linux/fs.h>
:
struct inode_operations {
int (*create) (struct inode *, struct dentry *, int, struct nameidata *);
struct dentry * (*lookup) (struct inode *, struct dentry *, struct nameidata *);
int (*link) (struct dentry *, struct inode *, struct dentry *);
int (*unlink) (struct inode *, struct dentry *);
int (*symlink) (struct inode *, struct dentry *, const char *);
int (*mkdir) (struct inode *, struct dentry *, int);
int (*rmdir) (struct inode *, struct dentry *);
int (*mknod) (struct inode *, struct dentry *, int, dev_t);
int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *, int);
void * (*follow_link) (struct dentry *, struct nameidata *);
void (*put_link) (struct dentry *, struct nameidata *, void *);
void (*truncate) (struct inode *);
int (*permission) (struct inode *, int);
int (*setattr) (struct dentry *, struct iattr *);
int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
int (*setxattr) (struct dentry *, const char *, const void *, size_t, int);
ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t);
ssize_t (*listxattr) (struct dentry *, const char *);
int (*removexattr) (struct dentry *, const char *);
void (*truncate_range) (struct inode *, loff_t, loff_t);
long (*fallocate) (struct inode *inode, int mode, loff_t offset, loff_t len);
int (*fiemap) (struct inode *, struct fiemap_extent_info *, u64 start, u64 len);
};
The following interfaces constitute the various functions that the VFS may perform, or ask a specific filesystem to perform, on a given inode:
int create(struct inode *dir, struct dentry *dentry, int mode)
- TheVFS calls this function from the
creat()
andopen()
system calls to create a new inode associated with the given dentry object with the specified initial access mode.
- TheVFS calls this function from the
struct dentry * lookup(struct inode *dir, struct dentry *dentry)
- This function searches a directory for an inode corresponding to a filename specified in the given dentry.
int link(struct dentry *old_dentry, struct inode *dir, struct dentry *dentry)
- Invoked by the link() system call to create a hard link of the file
old_dentry
in the directory dir with the new filenamedentry
.
- Invoked by the link() system call to create a hard link of the file
int unlink(struct inode *dir, struct dentry *dentry)
- Called from the
unlink()
system call to remove the inode specified by the directory entry dentry from the directorydir
.
- Called from the
int symlink(struct inode *dir, struct dentry *dentry, const char *symname)
- Called from the
symlink()
system call to create a symbolic link named symname to the file represented by dentry in the directorydir
.
- Called from the
int mkdir(struct inode *dir, struct dentry *dentry, int mode)
- Called from the
mkdir()
system call to create a new directory with the given initial mode.
- Called from the
int rmdir(struct inode *dir, struct dentry *dentry)
- Called by the
rmdir()
system call to remove the directory referenced by dentry from the directorydir
.
- Called by the
int mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t rdev)
- Called by the
mknod()
system call to create a special file (device file, named pipe, or socket). The file is referenced by the device rdev and the directory entry dentry in the directorydir
. The initial permissions are given viamode
.
- Called by the
int rename(struct inode *old_dir, struct dentry *old_dentry, struct inode *new_dir, struct dentry *new_dentry)
- Called by the VFS to move the file specified by
old_dentry
from theold_dir
directory to the directorynew_dir
, with the filename specified bynew_dentry
.
- Called by the VFS to move the file specified by
int readlink(struct dentry *dentry, char *buffer, int buflen)
- Called by the
readlink()
system call to copy at mostbuflen
bytes of the full path associated with the symbolic link specified bydentry
into the specified buffer.
- Called by the
int follow_link(struct dentry *dentry, struct nameidata *nd)
- Called by the VFS to translate a symbolic link to the inode to which it points. The link pointed at by
dentry
is translated, and the result is stored in thenameidata
structure pointed at bynd
.
- Called by the VFS to translate a symbolic link to the inode to which it points. The link pointed at by
int put_link(struct dentry *dentry, struct nameidata *nd)
- Called by the VFS to clean up after a call to
follow_link()
.
- Called by the VFS to clean up after a call to
void truncate(struct inode *inode)
- Called by the VFS to modify the size of the given file. Before invocation, the inode’s
i_size
field must be set to the desired new size.
- Called by the VFS to modify the size of the given file. Before invocation, the inode’s
int permission(struct inode *inode, int mask)
- Checks whether the specified access mode is allowed for the file referenced by
inode
. This function returns zero if the access is allowed and a negative error code otherwise. Most filesystems set this field toNULL
and use the generic VFS method, which simply compares the mode bits in the inode’s objects to the given mask.
- Checks whether the specified access mode is allowed for the file referenced by
int setattr(struct dentry *dentry, struct iattr *attr)
- Called from
notify_change()
to notify a “change event” after an inode has been modified.
- Called from
int getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
- Invoked by the VFS upon noticing that an inode needs to be refreshed from disk. Extended attributes allow the association of key/values pairs with files.
int setxattr(struct dentry *dentry, const char *name, const void *value, size_t size, int flags)
- Used by the VFS to set the extended attribute name to the value value on the file referenced by dentry .
ssize_t getxattr(struct dentry *dentry, const char *name, void *value, size_t size)
- Used by the VFS to copy into value the value of the extended attribute
name
for the specified file.
- Used by the VFS to copy into value the value of the extended attribute
ssize_t listxattr(struct dentry *dentry, char *list, size_t size)
- Copies the list of all attributes for the specified file into the buffer
list
.
- Copies the list of all attributes for the specified file into the buffer
int removexattr(struct dentry *dentry, const char *name)
- Removes the given attribute from the given file.
Dentry objects are all components in a path, including files.
Dentry objects are represented by struct dentry
and defined in <linux/dcache.h>
.
struct dentry {
atomic_t d_count; /* usage count */
unsigned int d_flags; /* dentry flags */
spinlock_t d_lock; /* per-dentry lock */
int d_mounted; /* is this a mount point? */
struct inode *d_inode; /* associated inode */
struct hlist_node d_hash; /* list of hash table entries */
struct dentry *d_parent; /* dentry object of parent */
struct qstr d_name; /* dentry name */
struct list_head d_lru; /* unused list */
union {
struct list_head d_child; /* list of dentries within */
struct rcu_head d_rcu; /* RCU locking */
} d_u;
struct list_head d_subdirs; /* subdirectories */
struct list_head d_alias; /* list of alias inodes */
unsigned long d_time; /* revalidate time */
struct dentry_operations *d_op; /* dentry operations table */
struct super_block *d_sb; /* superblock of file */
void *d_fsdata; /* filesystem-specific data */
unsigned char d_iname[DNAME_INLINE_LEN_MIN]; /* short name */
};
Unlike the previous two objects, the dentry object does not correspond to any sort of on-disk data structure. The VFS creates it on-the-fly from a string representation of a path name. Because the dentry object is not physically stored on the disk, no flag in struct dentry
specifies whether the object is modified.
A valid dentry object can be in one of three states:
used
: A used dentry corresponds to a valid inode and indicates that there are one or more users of the object.unused
: An unused dentry corresponds to a valid inode, but the VFS is not currently using the dentry object.negative
: A negative dentry is not associated with a valid inode because either the inode was deleted or the path name was resolved quickly.
The kernel caches dentry objects in the dentry cache or, simply, the dcache
. The dentry cache consists of three parts:
- List of "used" dentries linked off their associated inode via the
i_dentry
field of the inode object. - A doubly linked "least recently used" list of unused and negative dentry objects.
- A hash table and hashing function used to quickly resolve a given path into the associated dentry object.
The hash table is represented by the dentry_hashtable
array. Each element is a pointer to a list of dentries that hash to the same value. The size of this array depends on the amount of physical RAM in the system. The actual hash value is determined by d_hash)()
. Hash table lookup is performed via d_lookup()
.
The dentry_operations structure specifies the methods that the VFS invokes on directory entries on a given filesystem. The dentry_operations
structure is defined in <linux/dcache.h>
:
struct dentry_operations {
int (*d_revalidate) (struct dentry *, struct nameidata *);
int (*d_hash) (struct dentry *, struct qstr *);
int (*d_compare) (struct dentry *, struct qstr *, struct qstr *);
int (*d_delete) (struct dentry *);
void (*d_release) (struct dentry *);
void (*d_iput) (struct dentry *, struct inode *);
char *(*d_dname) (struct dentry *, char *, int);
};
The methods are as follows:
int d_revalidate(struct dentry *dentry, struct nameidata *)
- Determines whether the given dentry object is valid. The VFS calls this function whenever it is preparing to use a dentry from the dcache.
int d_hash(struct dentry *dentry, struct qstr *name)
- Creates a hash value from the given dentry. The VFS calls this function whenever it adds a dentry to the hash table.
int d_compare(struct dentry *dentry, struct qstr *name1, struct qstr *name2)
- Called by the VFS to compare two filenames,
name1
andname2
.
- Called by the VFS to compare two filenames,
int d_delete(struct dentry *dentry)
- Called by the VFS when the specified dentry object's
d_count
reaches zero. This function requires thedcache_lock
and the dentry'sd_lock
.
- Called by the VFS when the specified dentry object's
void d_release(struct dentry *dentry)
- Called by the VFS when the specified dentry is going to be freed.
void d_iput(struct dentry *dentry, struct inode *inode)
- Called by the VFS when a dentry object loses its associated inode.
The file object is used to represent a file opened by a process. The file object is the in-memory representation of an open file. The object is created in response to the open()
system call and destroyed in response to the close()
system call. Because multiple processes can open and manipulate a file at the same time, there can be multiple file objects in existence for the same file. The file object merely represents a process's view of an open file. The object points back to the dentry that actually represents the open file. The inode and dentry objects are unique.
The file object is represented by struct file
and is defined in <linux/sf.h>
:
struct file {
union {
struct list_head fu_list; /* list of file objects */
struct rcu_head fu_rcuhead; /* RCU list after freeing */
} f_u;
struct path f_path; /* contains the dentry */
struct file_operations *f_op; /* file operations table */
spinlock_t f_lock; /* per-file struct lock */
atomic_t f_count; /* file object's usage count */
unsigned int f_flags; /* flags specified on open */
mode_t f_mode; /* file access mode */
loff_t f_pos; /* file offset (file pointer) */
struct fown_struct f_owner; /* owner data for signals */
const struct cred *f_cred; /* file credentials */
struct file_ra_state f_ra; /* read-ahead state */
u64 f_version; /* version number */
void *f_security; /* security module */
void *private_data; /* tty driver hook */
struct list_head f_ep_links; /* list of epoll links */
spinlock_t f_ep_lock; /* epoll lock */
struct address_space *f_mapping; /* page cache mapping */
unsigned long f_mnt_write_state; /* debugging state */
};
The file object does not actually correspond to any on-disk data. Therefore, no flag in the object represents whether the object is dirty and needs to be written back to disk. The file object does point to its associated dentry object via the f_dentry
pointer. The dentry in turn points to the associated inode, which reflects whether the file itself is dirty.
The file object methods are specified in file_operations
and defined in <linux/fs.h>
:
struct file_operations {
struct module *owner;
loff_t (*llseek) (struct file *, loff_t, int);
ssize_t (*read) (struct file *, char __user *, size_t, loff_t *);
ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *);
ssize_t (*aio_read) (struct kiocb *, const struct iovec *,
unsigned long, loff_t);
ssize_t (*aio_write) (struct kiocb *, const struct iovec *,
unsigned long, loff_t);
int (*readdir) (struct file *, void *, filldir_t);
unsigned int (*poll) (struct file *, struct poll_table_struct *);
int (*ioctl) (struct inode *, struct file *, unsigned int,
unsigned long);
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
int (*fsync) (struct file *, struct dentry *, int datasync);
int (*aio_fsync) (struct kiocb *, int datasync);
int (*fasync) (int, struct file *, int);
int (*lock) (struct file *, int, struct file_lock *);
ssize_t (*sendpage) (struct file *, struct page *,
int, size_t, loff_t *, int);
unsigned long (*get_unmapped_area) (struct file *,
unsigned long,
unsigned long,
unsigned long,
unsigned long);
int (*check_flags) (int);
int (*flock) (struct file *, int, struct file_lock *);
ssize_t (*splice_write) (struct pipe_inode_info *,
struct file *,
loff_t *,
size_t,
unsigned int);
ssize_t (*splice_read) (struct file *,
loff_t *,
struct pipe_inode_info *,
size_t,
unsigned int);
int (*setlease) (struct file *, long, struct file_lock **);
};
loff_t llseek(struct file *file, loff_t offset, int origin)
- Updates the file pointer to the given offset. It is called via the
llseek()
system call.
- Updates the file pointer to the given offset. It is called via the
ssize_t read(struct file *file, char *buf, size_t count, loff_t *offset)
- Reads
count
bytes from the given file at positionoffset
intobuf
. The file pointer is then updated. This function is called by theread()
system call.
- Reads
ssize_t aio_read(struct kiocb *iocb, char *buf, size_t count, loff_t offset)
- Begins an asynchronous read of
count
bytes intobuf
of the file described iniocb
. This function is called by theaio_read()
system call.
- Begins an asynchronous read of
ssize_t write(struct file *file, const char *buf, size_t count, loff_t *offset)
- Writes
count
bytes from buf into the given file at positionoffset
. The file pointer is then updated. This function is called by thewrite()
system call.
- Writes
ssize_t aio_write(struct kiocb *iocb, const char *buf, size_t count, loff_t offset)
- Begins an asynchronous write of
count
bytes intobuf
of the file described iniocb
. This function is called by theaio_write()
system call.
- Begins an asynchronous write of
int readdir(struct file *file, void *dirent, filldir_t filldir)
- Returns the next directory in a directory listing. This function is called by the
readdir()
system call.
- Returns the next directory in a directory listing. This function is called by the
unsigned int poll(struct file *file, struct poll_table_struct *poll_table)
- Sleeps, waiting for activity on the given file. It is called by the
poll()
system call.
- Sleeps, waiting for activity on the given file. It is called by the
int ioctl(struct inode *inode, struct file *file, unsigned int cmd, unsigned long arg)
- Sends a command and argument pair to a device. It is used when the file is an open device node. This function is called from the
ioctl()
system call. Callers must hold the BKL.
- Sends a command and argument pair to a device. It is used when the file is an open device node. This function is called from the
int unlocked_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
- Implements the same functionality as
ioctl()
but without needing to hold the BKL. The VFS callsunlocked_ioctl()
if it exists in lieu ofioctl()
when user-space invokes theioctl()
system call. Thus filesystems need implement only one, preferablyunlocked_ioctl()
.
- Implements the same functionality as
int compat_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
- Implements a portable variant of
ioctl()
for use on 64-bit systems by 32-bit applications. This function is designed to be 32-bit safe even on 64-bit architectures, performing any necessary size conversions. New drivers should design their ioctl commands such that all are portable, and thus enablecompat_ioctl()
andunlocked_ioctl()
to point to the same function. Likeunlocked_ioctl()
,compat_ioctl()
does not hold the BKL.
- Implements a portable variant of
int mmap(struct file *file, struct vm_area_struct *vma)
- Memory maps the given file onto the given address space and is called by the
mmap()
system call.
- Memory maps the given file onto the given address space and is called by the
int open(struct inode *inode, struct file *file)
- Creates a new file object and links it to the corresponding inode object. It is called by the
open()
system call.
- Creates a new file object and links it to the corresponding inode object. It is called by the
int flush(struct file *file)
- Called by the VFS whenever the reference count of an open file decreases. Its purpose is filesystem-dependent.
int release(struct inode *inode, struct file *file)
- Called by the VFS when the last remaining reference to the file is destroyed.
int fsync(struct file *file, struct dentry *dentry, int datasync)
- Called by the
fsync()
system call to write all cached data for the file to disk.
- Called by the
int aio_fsync(struct kiocb *iocb, int datasync)
- Called by the aio_fsync() system call to write all cached data for the file associated with
iocb
to disk.
- Called by the aio_fsync() system call to write all cached data for the file associated with
int fasync(int fd, struct file *file, int on)
- Enables or disables signal notification of asynchronous I/O.
int lock(struct file *file, int cmd, struct file_lock *lock)
- Manipulates a file lock on the given file.
ssize_t readv(struct file *file, const struct iovec *vector, unsigned long count, loff_t *offset)
- Called by the
readv()
system call to read from the given file and put the results into the count buffers described byvector
. The file offset is then incremented.
- Called by the
ssize_t writev(struct file *file, const struct iovec *vector, unsigned long count, loff_t *offset)
- Called by the
writev()
system call to write from the count buffers described byvector
into the file specified byfile
. The file offset is then incremented.
- Called by the
ssize_t sendfile(struct file *file, loff_t *offset, size_t size, read_actor_t actor, void *target)
- Called by the
sendfile()
system call to copy data from one file to another. It performs the copy entirely in the kernel and avoids an extraneous copy to user-space.
- Called by the
ssize_t sendpage(struct file *file, struct page *page, int offset, size_t size, loff_t *pos, int more)
- Used to send data from one file to another.
unsigned long get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, unsigned long offset, unsigned long flags)
- Gets unused address space to map the given file.
int check_flags(int flags)
- Used to check the validity of the flags passed to the
fcntl()
system call when the SETFL command is given.
- Used to check the validity of the flags passed to the
int flock(struct file *filp, int cmd, struct file_lock *fl)
- Used to implement the
flock()
system call, which provides advisory locking.
- Used to implement the
The kernel uses other standard data structures to manage data related to filesystems. The first object is used to describe a specific variant of a filesystem. The second data structure describes a mounted instance of a filesystem.
The file_system_type
structure, defined in <linux/fs.h>
:
struct file_system_type {
const char *name; /* filesystem's name */
int fs_flags; /* filesystem type flags */
/* the following is used to read the superblock off the disk */
struct super_block *(*get_sb) (struct file_system_type *, int,
char *, void *);
/* the following is used to terminate access to the superblock */
void (*kill_sb) (struct super_block *);
struct module *owner; /* module owning the filesystem */
struct file_system_type *next; /* next file_system_type in list */
struct list_head fs_supers; /* list of superblock objects */
/* the remaining fields are used for runtime lock validation */
struct lock_class_key s_lock_key;
struct lock_class_key s_umount_key;
struct lock_class_key i_lock_key;
struct lock_class_key i_mutex__key;
struct lock_class_key i_mutex__dir_key;
struct lock_class_key i_alloc_sem_key;
};
The get_sb()
function reads the superblock from the disk and populates the superblock object when the filesystem is loaded. The remaining functions describe the filesystem's properties. There is only one file_system_type
per filesystem, regardless of how many instances of the filesystem are mounted on the system, or whether the filesystem is even mounted at all.
When the filesystem is actually mounted, at which point the vfsmount
structure is created. This structure represents a specific instance of a filesystem. The vfsmount
structure is defined in <linux/mount.h>
:
struct vfsmount {
struct list_head mnt_hash; /* hash table list */
struct vfsmount *mnt_parent; /* parent filesystem */
struct dentry *mnt_mountpoint; /* dentry of this mount point */
struct dentry *mnt_root; /* dentry of root of this fs */
struct super_block *mnt_sb; /* superblock of this filesystem */
struct list_head mnt_mounts; /* list of children */
struct list_head mnt_child; /* list of children */
int mnt_flags; /* mount flags */
char *mnt_devname; /* device file name */
struct list_head mnt_list; /* list of descriptors */
struct list_head mnt_expire; /* entry in expiry list */
struct list_head mnt_share; /* entry in shared mounts list */
struct list_head mnt_slave_list; /* list of slave mounts */
struct list_head mnt_slave; /* entry in slave list */
struct vfsmount *mnt_master; /* slave’s master */
struct mnt_namespace *mnt_namespace; /* associated namespace */
int mnt_id; /* mount identifier */
int mnt_group_id; /* peer group identifier */
atomic_t mnt_count; /* usage count */
int mnt_expiry_mark; /* is marked for expiration */
int mnt_pinned; /* pinned count */
int mnt_ghosts; /* ghosts count */
atomic_t __mnt_writers; /* writers count */
};
The complicated part of maintaining the list of all mount points is the relation between the filesystem and all the other mount points. The various linked lists in vfsmount
keep track of this information. The vfsmount
structure also stores the flags, if any, specified on mount in the mnt_flags
field.
Flag | Description |
---|---|
MNT_NOSUID | Forbids setuid and setgid flags on binaries on this filesystem |
MNT_NODEV | Forbids access to device files on this filesystem |
MNT_NOEXEC | Forbids execution of binaries on this filesystem |
Each process on the system has its own list of open files, root filesystem, current working directory, mount points, and so on. Three data structures tie together the VFS layer and the processes on the system: files_struct
, fs_struct
, and namespace
.
The files_struct
is defined in <linux/fdtable.h>
. This table’s address is pointed to by the files entry in the processor descriptor. All per-process information about open files and file descriptors is contained therein:
struct files_struct {
atomic_t count; /* usage count */
struct fdtable *fdt; /* pointer to other fd table */
struct fdtable fdtab; /* base fd table */
spinlock_t file_lock; /* per-file lock */
int next_fd; /* cache of next available fd */
struct embedded_fd_set close_on_exec_init; /* list of close-on-exec fds */
struct embedded_fd_set open_fds_init /* list of open fds */
struct file *fd_array[NR_OPEN_DEFAULT]; /* base files array */
};
The array fd_array
points to the list of open file objects. If a process opens more than 64 file objects, the kernel allocates a new array and points the fdt
pointer at it.
The second process-ralated structure is fs_struct
, which contains filesystem information related to a process and is pointed at by the fs
field in the processs descriptor. The structure is defined in <linux/fs_struct.h>
.
struct fs_struct {
int users; /* user count */
rwlock_t lock; /* per-structure lock */
int umask; /* umask */
int in_exec; /* currently executing a file */
struct path root; /* root directory */
struct path pwd; /* current working directory */
};
This structure holds the current working directory and root directory of the current process.
The third and final structure is the namespace
structure, which is defined in <linux/mnt_namespace.h>
and pointed at by the mnt_namespace
field in the process descriptor.
struct mnt_namespace {
atomic_t list; /* usage count */
struct vfsmount *root; /* root directory */
struct list_head list; /* list of mount points */
wait_queue_head_t poll; /* polling waitqueue */
int event; /* event count */
};
The list
member specifies a doubly linked list of the mounted filesystems that make up the namespace.