Skip to content

ETOgaosion/MIT-Distributed-Lab-2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIT Distributed Lab(6.824 $\rightarrow$ 6.5840)

This repo is based on MIT Spring 2024 6.5840 Course and Lab

  • Lab1: Map Reduce (SS, 100% pass)
    • pass all tests with -race
    • use commit call to promise server/client exit moment right
    • workers scalable
  • Lab2: Key/Value Server (M, 100% pass)
    • tests are extremely strict(for codes and machine), check this article, it may help a lot if you are stuck in memory and time overhead
  • Lab3: Raft (H, 100% pass)
    • Lab3A: Leader Election (M)
      • Locks are disturbing, but remember that main process does not wait for go routine, so it's free to use lock inside threads, while the outside function is within lock
      • If you do respect the frameworks, some lab requirements are different from years before, like avoiding the use of time.Timer/time.Ticker, and all loop process should be in ticker only
    • Lab3B: log (HH)
      • Not so hard, but very complicated, you'd better understand the algorithm through running test, rather than refer to different versions of others' codes, they can be confusing
      • It cost me a lot of time to debug and understand the algorithm, corner cases
    • Lab3C: persistence (SS)
      • TestFigure83C/TestFigure8Unreliable3C cannot pass within 30s timeout
    • Lab3D: log compaction (H)
      • Don't be confused by conceptions, Snapshot is just Log Compaction not persistence, we cannot recover states from Snapshots
      • But debug is hard, you'd better not modify other files' implementation like config.go, here is some hints:
        • If you encounter one([cmd]) failed to reach agreement error, there are two possible reasons:
          • Your 3A implementation has bugs, no leader was successfully elected
          • it means that applyCh is not the one config.go set up, note that applierSnap check the rafts[i] before check applyCh outputs
        • readPersist must reload lastApplied/commitIndex from lastIncludedIndexRd or new added server would get error to find the lastApplied + 1 index
        • We cannot compact the log actively by running a go routine, older tests not support such Snapshot mechanism, see the conception above
  • Lab4: Fault tolerance Key/Value Service (S/M, 100% pass)
    • Lab4A: Key/value service without snapshots(M)
      • Client Linear RPC Specification, Page 67
      • How to pass TestSpeed4A
      • We accelarate the leader finding process by add a GetState RPC call, I do not know why the framework code must try every server each time, cost too much extra time
      • TestSpeed4A cannot pass, the requests are theoretically not satisfied with lab3 timing requirements
      • Do NOT modify your implementation of Raft in lab3 easily, it's likely that in multi-process environment, any small design changes can cause critical bugs which is hardly understandable. So if you have confidence of your lab3 tests, do not modify your implementation. If you really detect some lab3 bugs by retesting Raft directly, you can fix them, and make sure your Raft works perfectly after each modification
      • Use another RPC call GetState to recognize leader faster, no need for extra communication costs
    • Lab4B: Key/value service with snapshots (S)
      • We should save our kv storage as well as client sequences into snapshots, and save raft states
      • We tested a corner case handled roughly when in Lab3: in AppendEntries, if the leader try to send append entries which has already been send to snapshots by clients (But server's nextIndex array do not receive that news), this is gonna happen: args.PrevLogIndex < rf.getFirstLogIndex(), we should return false with FirstIndex set to -1, let server retry later. This modification can pass all tests, but may not be as clear as other theories
      • go test can not handle well with goroutines, Iterations can be easily stuck
      • golang has a poor unsafe.Sizeof
  • Lab5: Sharded Key/Value Service (Coding: M/H, Debug: HH, 100% pass)
    • Recommend this blog
    • Lab5A: The Controller and Static Sharding (S/M, 100% pass)
      • TC: $O(2n)$ SC: $O(1)$ Shard rebalance algorithm (Not like others' costly $O(n^2)$): Calculate avg and remains number of shards in gids first, Use 0 gid as a tmp storage for extra shards (larger than avg + 1 if still got remains, larger than avg if remains is 0), then move all these shards to gid shards less than avg
    • Lab5B (Coding: M/H, Debug: HHH, 100% pass)
      • Idea Refernce Blog: https://www.inlighting.org/archives/mit-6.824-notes
      • Bug log:
        • Snapshot Test (3 Days):
          • time.Millisecond typo: time.Microsecond
          • Bug: stuck on rf.log encoding process
            • Why: newly created server cannot sync log and do correct snapshots, so rf.log grow bigger
            • Solition:
              • Seperate snapshot with main loop engine, do not let channel stuck the snapshot process
              • Judge the snapshot index, if is 0, do not do snapshot
        • UnReliable Test (0.5 Days)
          • repeated ctrler requests return ErrOK typo ErrWrongLeader (Most Serious)
        • All (1 Day):
          • livelock: help new leader apply old logs
          • handleOps return directly when new SequenceNum $\leq$ old
        • Other optimizations
      • Performance: We use conservative time intervals, so our performance is much lower than else. We may improve this later.
  • Challenges
    • TestChallenge1Delete: easy one, send delete shard requests after install shards
    • TestChallenge2Unaffected: check shards before common operations, if their states are Serving, they can be used directly
    • TestChallenge2Partial

Evaluation Level (due to my own experience, regardless of official assessment):

  • SS: Too Simple
  • S: Simple
  • M: Medium
  • H: Hard
  • HH: Too Hard
  • HHH: Extrememly Hard

This repo clearifies some FAQ, but some answers need to be reconsider: https://github.com/niebayes/MIT-6.5840, it uses modularized development, codes are not so referencable.

Good reference: https://github.com/lantin16/MIT6.824-2020-Lab/tree/main

Version

Go: go version go1.21.3 linux/amd64 Project based on git://g.csail.mit.edu/6.5840-golabs-2024 Commit: e7aea3e613fdf9814580c97fd56c22c86a798c0e

About

Lab for MIT Distributed course (6.824 -> 6.5840)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published