Steins;Lab

  • 项目
  • 折腾
  • 笔记
  • 图册
  • 杂谈
  • 文章索引 - 博主自荐博文
  • 关于/留言
Steins;Lab
某团的自留研究所
  1. 首页
  2. 项目跟踪
  3. 正文

bakemono - 一个 Go 实现的磁盘缓存引擎

2023年4月26日 3239点热度 0人点赞 0条评论

近一个月从底层翻阅了 Apache Traffic Server 的磁盘缓存引擎,俞发觉得精妙。另外自己在开发的 web 代理缓存 hitori 中,也确实需要一款大容量磁盘缓存引擎。便实现了一个磁盘缓存引擎库,名字叫做 bakemono。

https://github.com/bocchi-the-cache/bakemono

经过几十个小时的压力测试后,刚刚 release 了 v0.1。

在博客中直接放了项目文档,供读者批评指正。

Table of Contents

  • bakemono
    • 🏁 Design goals
    • 💾 Cache Storage Engine
    • 🗝️ Usage
      • Install
      • Init
      • Read/Write
      • Note
    • 🤖 Tech Design
      • Data Structure
        • 🧱 Chunk
        • 🔖 Dir
        • 🪣 Segment, Bucket, Freelist
        • 🗂️ Vol
      • Write
      • Read
      • Metadata Persistence
    • Performance
    • Other Information
      • Roadmap
      • Name Origin
      • Logo
      • Who are bocchi-the-cache?

bakemono

Go Reference
Go Report Card
License
ci-bakemono-tests
codecov

bakemonois a cache storage engine implemented in Go.

🏁 Design goals

  • 🪶 Lightweight: easy to embed in your project
  • 🚀 High-performance: high throughput and low latency
  • 😀 Code-readable: simple but powerful storage design, easy to read and understand

It is highly inspired by Apache Traffic Server, implemented for our cache-proxy project hitori.

💾 Cache Storage Engine

What is a cache storage engine?
What is the difference from an embeddable k-v database?

Similarities:
They both are:

  • key-value storage
  • embeddable
  • persistent storage on SSD/HDD

Differences:
Cache storage are:

  • allowed to drop data when conditions are met
  • fault-tolerant (just return a MISS when disk failure happens)

Cache storage is common in CDN (Content Delivery Network). It is used to cache frequently accessed data to reduce the load of backend servers.

The size of cache data is usually ~100TiB per bare-metal server.

🗝️ Usage

Install

You can use bakemono as a pkg in your project.

go get github.com/bocchi-the-cache/bakemono

Init

Then simply import and init a Vol in your code:

func main() {
    cfg, err := bakemono.NewDefaultVolOptions("/tmp/bakemono-test.vol", 1024*512*100000, 1024*1024)
    if err != nil {
        panic(err)
    }

    v := &bakemono.Vol{}
    corrupted, err := v.Init(cfg)
    if err != nil {
        panic(err)
    }
    if corrupted {
        log.Printf("vol is corrupted, but fixed. ignore this if first time running.")
    }

    // ...
}

Read/Write

func main() {
    // ...

    // write
    err = v.Set([]byte("key"), []byte("value"))
    if err != nil {
        panic(err)
    }
    // read
    hit, data, err := v.Get([]byte("key"))
    if err != nil {
        // note: err can be not nil when disk failure happens
        // consider it as a MISS when err != nil, or log it to do further processing
        panic(err)
    }
    if !hit {
        panic("key should be hit")
    }
    if string(data) != "value" {
        panic("value should be 'value'")
    }
    log.Printf("value: %s", data)

    // close
    err = v.Close()
    if err != nil {
        panic(err)
    }
}

Note

Concurrency RW is supported.

In this version, they are sharing several RWLocks. We will give more tuning options in the future.

We highly recommend you to read tech design doc before using it in high-load scenarios.

🤖 Tech Design

... until now.

Data Structure

There are 3 data structures in bakemono:

  • Vol
    • volume, represents a single file on disk.
    • A Vol is what we finally persist on disk.
    • We will support bare block device in the future.
  • Chunk
    • basic unit of your k-v cache data.
    • restored on disk.
  • Dir
    • directory, a meta index of Chunk.
    • All Dirs are loaded in memory. It is compact.

🧱 Chunk

Chunk is the basic unit of cache data.

name data type desc
Magic uint32 fixed: 0x00114514
Checksum uint32 checksum of DataRaw
Key [3000]byte fixed size key bytes
DataLength uint32
HeaderSize uint32 fixed: 4096.
HeaderChecksum uint32 checksum of the above
DataRaw variable bytes raw data

We force set chunk header size to 4KB this version, a sector size.

🔖 Dir

Dir is the meta index of Chunk.

Every Chunk has a Dir to represent it in memory.

Dir is always loaded in memory.

type Dir struct {
    /*
       unsigned int offset : 24;  // (0,1:0-7)
       unsigned int bigInternal : 2;      // (1:8-9)
       unsigned int sizeInternal : 6;     // (1:10-15)
       unsigned int tag : 12;     // (2:0-11)
       unsigned int phase : 1;    // (2:12)
       unsigned int head : 1;     // (2:13)
       unsigned int pinned : 1;   // (2:14)
       unsigned int token : 1;    // (2:15)
       unsigned int next : 16;    // (3)
       unsigned int offset_high : 16;

       if unused, raw[2] is `prev`, represents previous dir in freelist.

       approx_size = sectorSize(512) * (2**3big) * sizeInternal
    */
    raw [5]uint16
}

Dir is organized in raw 10 bytes. Use bit operations to get/set fields. This is a design of Apache Traffic Server.
We must save every bit to reduce memory usage.

Note:

  • If Dir is unused, raw[2] is prev, represents previous dir in freelist. To save 1 byte.
  • Size is not exact. It is sectorSize(512) * (2**3big) * sizeInternal. Max to represent ~16GB using only 8 bits.

Memory usage:

If we have 100TB data with 1MB chunk size, we only need 100TB/1MB*10B = 1GB memory to index Dirs.

🪣 Segment, Bucket, Freelist

Segment and Bucket are logical groups of Dir.

Segment is a collection of Buckets. Bucket is a collection of dirs.

They are logically organized as below:

  • bucket is group of fixed size 4 dirs.

  • segment has max size 2^16=65536 dirs.

  • Dirs is a linked list.

  • All non-bucket-head dirs to freelist when initializing, named freeDirs.

  • map[segmentId]dirId to index the first free dir of each segment.

Note, dirs is an array in memory.

Why segments? We could lock, flush meta per segment.

🗂️ Vol

Vol is the volume on disk. It is the final data structure we persist on disk.

Vol offset calculation:

init options are

type VolOptions struct {
    Fp        OffsetReaderWriterCloser
    FileSize  Offset
    ChunkAvgSize Offset

    FlushMetaInterval time.Duration
}

Max dirs size is nearly FileSize/ChunkAvgSize (There are alignment for bucket and segments).

Meaning, we have max FileSize/ChunkAvgSize chunks in this volume.

Vol Multi meta

Meta A/B are designed to be flush alternately. To avoid data loss when power failure happens.

In this version, only use meta A. Will implement multi meta in the future.

Write

We use md5 to hash key:

  • key -> segmentId
  • key -> bucketId

Once segmentId and bucketId are known:

  • try to use bucket-head dir.
  • if bucket-head is used, try to use non-bucket-head dir in this bucket.
  • if all dirs in this bucket are used, try to grab a free dir from freeDirs.

When no free dir in freeDirs:

  • rebuild freeDirs in this segment.
  • if still no free dir, purge 10% buckets randomly in this segment.

Once get free dir, unlink it from freeDirs and link it as tail of bucket. Write chunk offset, key and other metadata to dir.

One example:

Then, write data to disk.

  • rw lock the Segment
  • write data to data range of Vol

Note: no hard limit for data size. But should avoid too large data, hold write lock too long.

Chunk data is cyclic, we will overwrite old data when disk is full. The advantage is

  • sequential write is faster
  • no need to delete old data
  • no need to compact data in background
  • no write amplification

Read

We use md5 to hash key same as write

  • key -> segmentId
  • key -> bucketId

Once segmentId and bucketId are known:

  • find key dir linked list.
  • if not found, return MISS
  • if found, read Chunk data from disk.
  • Check again the full key in Chunk.

Due to the only 12 bit tag in dir, we have to check key again in Chunk.
Collision is possible, but will be much less when file size is large(>100GB).

There are a little read amplification due to approx size of Dir. But it is acceptable.

Metadata Persistence

Flush/restore meta A with FlushMetaInterval.

Will implement multi meta in the future.

Performance

Still in progress.

On our initial testing with HDD, it is nearly same to do fio test on hard disk.

Other Information

Roadmap

  • We are working on basic caching functions in this stage.
  • When caching functions are stable, we will concentrate on performance tuning.

Name Origin

Bakemono is a Japanese word meaning "monster". In Chinese, it is called "贵物".

We wish it could be a lightweight but high-performance cache storage engine like a "bakemono"!

Logo

The logo is designed by Yige.

Who are bocchi-the-cache?

We are a group of engineers who are interested in storage, networking and Go programming language.

We are excited to build projects using new technologies and share our experience with others.

相关

标签: 缓存引擎
最后更新:2023年4月26日

SPtuan

SPtuan 是一名普通的工程师,最大的愿望是度过平静的时光。 当前从事网络/CDN/对象存储研发。

点赞
< 上一篇
下一篇 >
0 0 votes
文章评分
Subscribe
Login
提醒
guest

guest

0 评论
最新
最旧 得票最多
Inline Feedbacks
View all comments

SPtuan

SPtuan 是一名普通的工程师,最大的愿望是度过平静的时光。
当前从事网络/CDN/对象存储研发。

  • bakemono
    • 🏁 Design goals
    • 💾 Cache Storage Engine
    • 🗝️ Usage
      • Install
      • Init
      • Read/Write
      • Note
    • 🤖 Tech Design
      • Data Structure
        • 🧱 Chunk
        • 🔖 Dir
        • 🪣 Segment, Bucket, Freelist
        • 🗂️ Vol
      • Write
      • Read
      • Metadata Persistence
    • Performance
    • Other Information
      • Roadmap
      • Name Origin
      • Logo
      • Who are bocchi-the-cache?
分类
  • Uncategorized
  • 图册
  • 学习笔记
  • 库
  • 折腾
  • 杂谈
  • 瞎**扯
  • 碎碎念
  • 项目跟踪
最近评论
SPtuan 发布于 2 个月前(03月22日) 书签: 关于 disk-io 的经验, 异步/同步 io 系统设计的经验 https://you...
SPtuan 发布于 2 个月前(03月21日) 如果公司不是对外提供这些服务的,这种岗位都是 infra 部门,平均年龄确实会大一些。尤其构建和维护...
HUA 发布于 2 个月前(03月19日) 想请问博主对于国内CDN行业,以及CDN调度、DNS托管类服务相关岗位的看法,以及是否还推荐校招新人...
SPtuan 发布于 4 个月前(02月03日) 2025 注: 长辈对于只身去深圳的担忧,更多地来自于 80s/90s 治安情况。近几年了解了严打...
SPtuan 发布于 5 个月前(01月16日) 哈哈,100就100吧,新年快乐!
热门主题 & 页面
  • 使用 WSL2 + X11 转发 - 在 Windows10 中打造 GNU/Linux 学习生产环境
  • PYNQ上手体验:以目标检测应用为例
  • [实验]VPS搭建ss服务中转实现纯ipv6访问网络-校园网免流量
  • 分布式存储漫游指南 1: 2025 年了,存储硬件啥样了?
  • VOCALOID4 - Miku V4 Chinese 初音未来V4中文试用 - 附《茉莉花》demo
归档
  • 2025 年 5 月
  • 2025 年 3 月
  • 2024 年 12 月
  • 2024 年 9 月
  • 2024 年 8 月
  • 2024 年 5 月
  • 2024 年 3 月
  • 2024 年 2 月
  • 2023 年 12 月
  • 2023 年 11 月
  • 2023 年 9 月
  • 2023 年 8 月
  • 2023 年 4 月
  • 2023 年 1 月
  • 2022 年 12 月
  • 2022 年 10 月
  • 2022 年 9 月
  • 2022 年 7 月
  • 2022 年 6 月
  • 2022 年 2 月
  • 2021 年 12 月
  • 2021 年 11 月
  • 2021 年 2 月
  • 2021 年 1 月
  • 2020 年 9 月
  • 2020 年 4 月
  • 2020 年 3 月
  • 2020 年 1 月
  • 2019 年 8 月
  • 2019 年 7 月
  • 2019 年 5 月
  • 2019 年 4 月
  • 2019 年 3 月
  • 2019 年 2 月
  • 2018 年 12 月
  • 2018 年 10 月
  • 2018 年 9 月
  • 2018 年 8 月
  • 2018 年 5 月
  • 2018 年 2 月
  • 2018 年 1 月
  • 2017 年 11 月
  • 2017 年 9 月
  • 2017 年 7 月
  • 2017 年 6 月
  • 2017 年 5 月
  • 2017 年 4 月
  • 2017 年 3 月
  • 2017 年 2 月
  • 2017 年 1 月
  • 2016 年 12 月
  • 2016 年 11 月
  • 2016 年 10 月
  • 2016 年 9 月
  • 2016 年 8 月
  • 2016 年 7 月
  • 2016 年 6 月
  • 2016 年 5 月
  • 2016 年 4 月
  • 2016 年 3 月
  • 2016 年 2 月
  • 2016 年 1 月
  • 2015 年 12 月
  • 2015 年 11 月
  • 2015 年 9 月

友情链接:

Blessing Studio hahaschool 绘枫和畅 魔法少女Fandy monsterx Clarke的博客 Luminous’ Home Shintaku's Blog
蓝黑的博客 haruhi.club Yida的博客 Bo2SS 涛叔 TangBao 同和君Hocassian

Steins;Lab 团子神社 zdfmc.net

steinslab.io built with ❤. Thanks for all 2015-2025.

Theme Kratos Made By Seaton Jiang

wpDiscuz