Day 1/100
HDFS - [Hadoop Distributed File System]
HDFS architecture looks like this -
- NameNode holds all the metadata about files and directory structure and then there are
- DataNodes which basically hold files in the form of blocks so DataNode holds blocks of data files.
- NameNode has mapping of all the blocks and respective files. the blocks are usually of size 128MB but again this is user-controllable.
- NameNode holds all this info in the form of iNodes, which has record attributes like permissions, modification and access times, namespace and disk space quotas.
- NameNode is multithreaded and responds to multiple client request at single time.
Image
- Image is basically List of all Inodes and list of blocks that defines metadata for entire NameNode system.
- NameNode keeps the current image in RAM for quick access and a copy stored in permanent memory of image is called checkpoint
Journal
- Any changes to HDFS from clients is maintained in a write ahead log files, which is called as journal and is maintained in the local memory of NameNode.
- During startup the NN initializes HDFS from checkpoint and replays information from journal
DataNode
- each block consists of two files one contains the actual data and other contains the metadata of the data including checcksums
- DataNodes while starting up does a handshake with NN
- namespace ID is unique for a cluster, DN with different namespace ID is not allowed to join cluster, [it can join if namespace ID is empty so the cluster ID is assigned and onboarded.]
- The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that -A block report contains block ID, the generation stamp and the length for each block replica.
- The first block report is sent after DN registration. Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view
- DN sends Heartbeats every 3 sec to NN to mark its presence, NN deregisters a DN if there are no Heartbeats for 10 mins
- Heartbeats also contains total storage capacity, fraction of storage in use, and the number of data transfers currently in progress
HDFS Client
- a library that exports the HDFS filesystem interface.
- Read request -> NameNode for identifying list of DNs having those blocks -> DN List is sorted by the network topology distance from the client.
CheckpointNode
- NN in addition can alternatively execute either a CheckpointNode role or a BackupNode role.
- The role is specified at the node startup.
BackupNode
- Kind of in memory replica of NN, secondary NN.
read-only NameNode, It contains all filesystem metadata information except for block locations
In reference to this - aosabook.org/en/hdfs.html