HDFS - [Hadoop Distributed File System]

NameNode holds all the metadata about files and directory structure and then there are
DataNodes which basically hold files in the form of blocks so DataNode holds blocks of data files.
NameNode has mapping of all the blocks and respective files. the blocks are usually of size 128MB but again this is user-controllable.
NameNode holds all this info in the form of iNodes, which has record attributes like permissions, modification and access times, namespace and disk space quotas.
NameNode is multithreaded and responds to multiple client request at single time.

Image is basically List of all Inodes and list of blocks that defines metadata for entire NameNode system.
NameNode keeps the current image in RAM for quick access and a copy stored in permanent memory of image is called checkpoint

Any changes to HDFS from clients is maintained in a write ahead log files, which is called as journal and is maintained in the local memory of NameNode.
During startup the NN initializes HDFS from checkpoint and replays information from journal

each block consists of two files one contains the actual data and other contains the metadata of the data including checcksums
DataNodes while starting up does a handshake with NN
namespace ID is unique for a cluster, DN with different namespace ID is not allowed to join cluster, [it can join if namespace ID is empty so the cluster ID is assigned and onboarded.]
The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that -A block report contains block ID, the generation stamp and the length for each block replica.
The first block report is sent after DN registration. Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view
DN sends Heartbeats every 3 sec to NN to mark its presence, NN deregisters a DN if there are no Heartbeats for 10 mins
Heartbeats also contains total storage capacity, fraction of storage in use, and the number of data transfers currently in progress

a library that exports the HDFS filesystem interface.
Read request -> NameNode for identifying list of DNs having those blocks -> DN List is sorted by the network topology distance from the client.

NN in addition can alternatively execute either a CheckpointNode role or a BackupNode role.
The role is specified at the node startup.

Kind of in memory replica of NN, secondary NN.
read-only NameNode, It contains all filesystem metadata information except for block locations

In reference to this - aosabook.org/en/hdfs.html

Day 1/100