Google File System

3 min readMar 11, 2023

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

According to the paper, Google designed a system called The Google File System(GFS) to meet the rapidly growing demands of Google’s data and data processing needs which is basically derived from the previous file distribution system sharing features like performance, scalability, reliability availability and general assumptions such as many inexpensive commodity components often fail, storing of large files, large streaming reads and random reads, append data to files that can be accessed by multiple clients and high bandwidth for receiving incoming data.

Unlike other filing system GFS has a master and chunkserver, what can be accessed by multiple users simultaneously. The files are divided into fixed size chunks which are then handled by the handler, assigned by the master. It also has commands like snapshot, which create the copy of a file and record append, which add records. All the information about the data is stored and controlled by the master. Having a single master simplified our design but it does not have to be involved in the reading and writing of a file as the user only contact the master to know which chunkserver to contact to access a particular file.

The chunk size is typically large because large file runs in sequence which reduced the client’s interaction with the master, with only one initial request, which allows clients to perform many operations in short period of time by maintaining long TCP connection to the server and eventually the storage of the master is saved. Although in the beginning the small files become hot spots because of large number of clients on the same file. It is a very rare case.

GFS namespace, which is a hierarchical directory structure that is used to organize the files in the file system. The namespace is managed by the master node, and it is designed to be scalable and efficient. The namespace is stored in memory on the master node, and it is periodically checkpointed to disk. The namespace is also replicated across multiple master nodes to provide fault tolerance.

The GFS client, which is used to access the files in the file system. The client is designed to be simple and efficient, and it provides a familiar file system interface to the users. The client caches the metadata of the files, such as the location of the chunks, to reduce the number of requests to the master node. The client also supports file append and record append operations, which are used to support applications that require high-speed data ingestion.

GFS performance. The authors conducted a series of experiments to measure the performance of GFS, and they found that GFS is able to handle large datasets efficiently. The experiments showed that GFS is able to sustain a high rate of data access, and it is able to handle failures without affecting the availability of the system. The experiments also showed that the performance of GFS is scalable, and it is able to handle large datasets without any degradation in performance.

In conclusion, “The Google File System” paper describes the design and implementation of the Google File System (GFS), which is a distributed file system designed to store and manage large datasets in a distributed environment. The paper describes the design goals of GFS, which are to provide fault tolerance, high performance, and scalability.

Google File System

Written by Ananya

No responses yet