Day 1/100

HDFS - [Hadoop Distributed File System]

HDFS architecture looks like this -

  • NameNode holds all the metadata about files and directory structure and then there are
  • DataNodes which basically hold files in the form of blocks so DataNode holds blocks of data files.
  • NameNode has mapping of all the blocks and respective files. the blocks are usually of size 128MB but again this is user-controllable.
  • NameNode holds all this info in the form of iNodes, which has record attributes like permissions, modification and access times, namespace and disk space quotas.
  • NameNode is multithreaded and responds to multiple client request at single time.

Image

  • Image is basically List of all Inodes and list of blocks that defines metadata for entire NameNode system.
  • NameNode keeps the current image in RAM for quick access and a copy stored in permanent memory of image is called checkpoint

Journal

  • Any changes to HDFS from clients is maintained in a write ahead log files, which is called as journal and is maintained in the local memory of NameNode.
  • During startup the NN initializes HDFS from checkpoint and replays information from journal

DataNode

  • each block consists of two files one contains the actual data and other contains the metadata of the data including checcksums
  • DataNodes while starting up does a handshake with NN
  • namespace ID is unique for a cluster, DN with different namespace ID is not allowed to join cluster, [it can join if namespace ID is empty so the cluster ID is assigned and onboarded.]
  • The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that -A block report contains block ID, the generation stamp and the length for each block replica.
  • The first block report is sent after DN registration. Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view
  • DN sends Heartbeats every 3 sec to NN to mark its presence, NN deregisters a DN if there are no Heartbeats for 10 mins
  • Heartbeats also contains total storage capacity, fraction of storage in use, and the number of data transfers currently in progress

HDFS Client

  • a library that exports the HDFS filesystem interface.
  • Read request -> NameNode for identifying list of DNs having those blocks -> DN List is sorted by the network topology distance from the client.

CheckpointNode

  • NN in addition can alternatively execute either a CheckpointNode role or a BackupNode role.
  • The role is specified at the node startup.

BackupNode

  • Kind of in memory replica of NN, secondary NN.
  • read-only NameNode, It contains all filesystem metadata information except for block locations

    In reference to this - aosabook.org/en/hdfs.html