weed-fs 源码解读

Weed-fs 源码解读【hhy 2014/12/23 16:17:54 】

Weed-fs 是一个简单、高可扩展的分布式文件系统。

MasterServer学习：

1、启动

weed目录下，master.go中的runMaster方法为入口。

首先，使用google.gorilla构建router。（最终，被塞入weedmaster，用于处理各种http请求）

其次，构造并初始化weedmaster。涉及如下多个结构：

MasterServer结构：

type MasterServer struct {

    port                    int
    metaFolder              string
    volumeSizeLimitMB       uint  //volume大小限制，单位为M
    pulseSeconds            int   //心跳发送周期
    defaultReplicaPlacement string //默认的备份模式
    garbageThreshold        string
    whiteList               []string

    Topo   *topology.Topology   //topo结构
    vg     *topology.VolumeGrowth //简单看成就是一个lock
    vgLock sync.Mutex

    bounedLeaderChan chan int  //16 buffer channel

}

Topology结构：

type Topology struct {

    NodeImpl   //指向NodeImpl对象，即Topology和node相互指向

    collectionMap *util.ConcurrentReadMap

    pulse int64

    volumeSizeLimit uint64  //volume大小限制

    Sequence sequence.Sequencer

    chanDeadDataNodes      chan *DataNode
    chanRecoveredDataNodes chan *DataNode
    chanFullVolumes        chan storage.VolumeInfo

    configuration *Configuration

    RaftServer raft.Server

}

NodeImpl结构：

type NodeImpl struct {
    id                NodeId  //代码中写死为Topo
    volumeCount       int
    activeVolumeCount int
    maxVolumeCount    int
    parent            Node
    children          map[NodeId]Node
    maxVolumeId       storage.VolumeId

    //for rack, data center, topology
    nodeType string      //代码写死为‘Topology’
    value    interface{} //指向Topology对象，即Topology和node相互指向

}

接着，通过masterserver中的router，mapping不同的url的处理方式。

再接着，调用masterserver中Topology的StartRefreshWritableVolumes方法（weed/weed_server目录下,master_server.go中的NewMasterServer方法，倒数第二行）。该方法启动3个gorouting：

第一个gorouting：每次心跳周期，去检测当前master是不是leader。如果是，则调用CollectDeadNodeAndFullVolumes进行逻辑处理；否则，什么也不干。

第二个gorouting：每隔15分钟，检测一次当前master是不是leader。如果是，则调用Vacuum方法，处理garbage。

第三个gorouting：用于处理节点故障，节点恢复。

最终，构造RaftServer，并将其塞入到weedmaster中(master.go的89行代码);通过RaftServer的raft协议，完成多个weedmaster间投票选leader的事情。（去看raft_server.go中的相关代码，具体在后续“Leader选举”中讲述）

type RaftServer struct {
    peers      []string // initial peers to join with
    raftServer raft.Server  //使用goraft
    dataDir    string
    httpAddr   string
    router     *mux.Router
    topo       *topology.Topology
}

2、Leader选举简述

当启动多个ServerMaster时，它们之间会进行通信，通过raft协议选举出一个Leader，对所有的master进行管理。
weed-fs中，通过使用raftServer完成上述选举过程；而raftServer则是用到了第三方资源，即goRaft（参照 http://ayende.com/blog/165858/reviewing-go-raft-part-i）。

VolumeServer学习：

1、启动

volume.go中的runVolume方法为入口。

首先，使用原生的http来构建router。（最终，被塞入volumeServer，用于处理各种http请求）

其次，构造并初始化volumeServer。涉及如下多个结构：

type VolumeServer struct {
    masterNode   string  //关联到哪个master上
    pulseSeconds int     //心跳周期，要比master上设置的心跳周期小
    dataCenter   string  //自己属于哪个 dataCenter
    rack         string  //自己属于哪个 rack
    whiteList    []string
    store        *storage.Store //看做是每个 volume_server 里对硬盘上数据的管理者

    FixJpgOrientation bool //存储图片时是否要对图片进行处理的选项

}

Store结构：

type Store struct {
    Port            int
    Ip              string
    PublicUrl       string
    Locations       []*DiskLocation
    dataCenter      string //optional informaton, overwriting master setting if exists
    rack            string //optional information, overwriting master setting if exists
    connected       bool
    volumeSizeLimit uint64 //read from the master
    masterNodes     *MasterNodes

}

DiskLocation结构：

type DiskLocation struct {
    Directory      string
    MaxVolumeCount int
    volumes        map[VolumeId]*Volume

}

每个DiskLocation下，存放多个Volume，通过VolumeId与之关联。

Volume结构：

type Volume struct {
    Id         VolumeId   
    dir        string     //属于哪个目录
    Collection string
    dataFile   *os.File      //实际数据存放
    nm         NeedleMapper  //Needle映射关系操作，Needle后续会讲到。NeedleMap是NeedleMapper的【实现类】
    readOnly   bool          //是否只读

    SuperBlock

    accessLock       sync.Mutex
    lastModifiedTime uint64 //unix time in seconds}

SuperBlock结构：

type SuperBlock struct {
    version          Version
    ReplicaPlacement *ReplicaPlacement
    Ttl              *TTL}

结论：

在启动的一个weed-fs服务里，会有一个store结构，下有多个DiskLocation，即同一个服务，可以设置多个存储文件夹。每个DiskLocation下面，又会有多个volume，每个volume下面，如果开启了collection功能，则每个volume下会有多个collection。每个collection或者volume下面，主要有两种文件，一个是存放实际数据的dat文件，一个是存放文件索引的idx文件（参见volume.go中的load方法）。如果开启了collection，则文件的命名是collectionname_1.dat , collectionname_2.dat, collectionname_1.idx, collectoinname_2.idx 等一系列这样的名字。后面的这个数字就是volume id。

启动一个 volume server，默认在指定的文件夹上预先生成7个volume。

接着，通过被塞入volumeServer中的router，mapping不同的url的处理方式

最终，启动一个gorouting，隔一段时间，去和master通信（可能是多个master中的任一个），让master端做join操作。

2、文件上传

使用如下命令，上传一个文件：
curl -F file=@G:\weed\bin\1.jpg http://172.31.210.111:8312/2,019737f534

上述命令，发起了一个post的http请求，被volume_server_handlers.go中的storeHandler函数进行处理，最终则是交给PostHandler函数处理。大致处理过程，如下：

首先， parseURLPath方法，解析得到vid，构造出volumeId。
其次，调用storage.NewNeedle构造Needle（具体参见storge目录下，needle.go中的NewNeedle方法），可见上传的文件都是被封装成Needle，在weed-fs中进行处理的。Needle的结构如下：

    type Needle struct {
        Cookie uint32 `comment:"random number to mitigate brute force lookups"`
        Id     uint64 `comment:"needle id"`
        Size   uint32 `comment:"sum of DataSize,Data,NameSize,Name,MimeSize,Mime"`

        DataSize     uint32 `comment:"Data size"` //version2
        Data         []byte `comment:"The actual file data"`  //存放实际数据内容
        Flags        byte   `comment:"boolean flags"` //version2，
        NameSize     uint8  //version2
        Name         []byte `comment:"maximum 256 characters"` //version2
        MimeSize     uint8  //version2
        Mime         []byte `comment:"maximum 256 characters"` //version2
        LastModified uint64 //only store LastModifiedBytesLength bytes, which is 5 bytes to disk
        Ttl          *TTL   //Time To Live，

        Checksum CRC    `comment:"CRC32 to check integrity"`  //校验和
        Padding  []byte `comment:"Aligned to 8 bytes"`}

    type TTL struct {
        count byte
        unit  byte    }

【该步骤的实质是将上传的文件，转换为Needle对象，放在weed-fs的缓存中。】

接着，调用topology.ReplicatedWrite函数，将上述放在缓存中的Needle内容，固化下来放到文件中。大体过程：

（1）根据Voulumeid，查找对应的Vloume对象。

（2）在找到的Volume对象中，调用Volume类的write方法，写入Needle。写入时，查看Volume对象中，是否已经有该Needle，如果没有，或者有但是不完全一样，则执行写入操作。写入文件又分两种：第一种，存放实际数据的dat文件，由storge目录下，needle_read_write.go文件中的Append方法实现（是【追加的方式】写入）；第二种，存放文件索引的idx文件，由storge目录下，needle_map.go文件中NeedleMap的方法Put实现。

（3）最后，调用Store的Join()方法，构造operation.VolumeInformationMessage，将当前store发生变更后的情况，通过【"http://"+masterNode+"/dir/join"】这样的请求，告知给masterserver（返回VolumeSizeLimit）。这里，volumeserver和masterserver进行了通信。

（4）如果需要备份，则做相关备的份操作。distributedOperation方法为入口函数，大体过程，依次如下

a、做lookUp操作：根据volumeId，先在本地缓存中查找对应的Location信息；如果找不到，再通过do_lookup方法，向masterserver发送【"http://"+server+"/dir/lookup"】这样的的http请求，获取masterserver上该volumeId的Location信息。（具体，参看operation目录下，lookup.go中的Lookup方法）
volumeId所对应的Volume，缓存其Locations信息的对象，结构如下：

                type VidInfo struct {
                    Locations       []Location
                    NextRefreshTime time.Time
                }

                type VidCache struct {
                    cache []VidInfo
                }

                type Location struct {
                    Url       string `json:"url,omitempty"`
                    PublicUrl string `json:"publicUrl,omitempty"`
                }

b、lookUp操作找到的Location信息，与volumeId构造得到LookupResult对象。

                type LookupResult struct {
                    VolumeId  string     `json:"volumeId,omitempty"`
                    Locations []Location `json:"locations,omitempty"`
                    Error     string     `json:"error,omitempty"`
                }

volumeserver的store.Ip与store.Port，构造出selfUrl。遍历LookupResult的Locations，如果location.Url != selfUrl，那么起一个gorouting去调用一个匿名方法（具体参看topology目录下，store_replicate.go中【ReplicatedWrite】方法相关代码）。该匿名方法，会给masterserver发送类似【"http://"+location.Url+r.URL.Path+"?type=replicate&ts="】的http请求。通过masterserver，让需要备份的其他的多个VolumeServer上进行备份操作。