lolly/docs/superpowers/plans/2026-06-04-performance-optimization.md
xfy ebeb258c58 docs(benchmark): add v0.4.0 baseline summary and update gitignore
- Collect baseline benchmark summary across all core modules
- Save key results to benchmarks/v0.4.0/summary.txt
- Update .gitignore to track benchmark summaries/reports
- Include performance optimization design docs and plan
2026-06-11 13:43:28 +08:00

21 KiB
Raw Permalink Blame History

性能热路径优化 Implementation Plan

For agentic workers: REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (- [ ]) syntax for tracking.

Goal: 消除 6 个已确认的热路径性能瓶颈,减少每请求堆分配和锁竞争。

Architecture: 针对 loadbalance filterHealthy每请求分配、RadixTree 堆分配、DNS LRU O(n) 操作、FileInfoCache 双重锁升级、ConsistentHash 双重锁、IsAvailable mutex 逐个进行激进优化。每项优化独立可测,不改变外部接口。

Tech Stack: Go 1.26+, sync.Pool, container/list, atomic operations, unsafe pointer (b2s/s2b)


Task 1: loadbalance — filterHealthy 零分配优化

Files:

  • Modify: internal/loadbalance/balancer.go
  • Test: internal/loadbalance/balancer_test.go
  • Benchmark: internal/loadbalance/balancer_bench_test.go

问题: filterHealthy 每次调用分配 2 个切片(available + backupsfilterHealthyAndExclude 分配 3 个(加 excludeSet mapIPHash.SelectByIP 额外分配 fnv.New64a() 对象。这些在每个请求的负载均衡选择中触发。

方案: 引入 filterContext 结构体持有可复用缓冲区,通过 sync.Pool 管理。filterHealthy 改为写入 filterContext 的预分配切片而非每次 make。IPHash 使用内联 FNV-64a 哈希避免 fnv.New64a() 分配。

  • Step 1: 定义 filterContext 和 Pool

balancer.go 中添加:

type filterContext struct {
	available []*Target
	backups   []*Target
	excludeSet map[string]bool
}

var filterContextPool = sync.Pool{
	New: func() any {
		return &filterContext{
			available:  make([]*Target, 0, 64),
			backups:    make([]*Target, 0, 64),
			excludeSet: make(map[string]bool, 8),
		}
	},
}

func acquireFilterContext() *filterContext {
	fc := filterContextPool.Get().(*filterContext)
	return fc
}

func releaseFilterContext(fc *filterContext) {
	fc.available = fc.available[:0]
	fc.backups = fc.backups[:0]
	for k := range fc.excludeSet {
		delete(fc.excludeSet, k)
	}
	filterContextPool.Put(fc)
}
  • Step 2: 重写 filterHealthy 为 filterInto
func filterInto(fc *filterContext, targets []*Target) []*Target {
	for _, t := range targets {
		if !t.IsAvailable() {
			continue
		}
		if t.IsBackup() {
			fc.backups = append(fc.backups, t)
		} else {
			fc.available = append(fc.available, t)
		}
	}
	if len(fc.available) > 0 {
		return fc.available
	}
	return fc.backups
}
  • Step 3: 重写 filterHealthyAndExclude 为 filterIntoExcluding
func filterIntoExcluding(fc *filterContext, targets []*Target, excluded []*Target) []*Target {
	if len(excluded) > 0 {
		for _, t := range excluded {
			if t != nil {
				fc.excludeSet[t.URL] = true
			}
		}
	}
	for _, t := range targets {
		if !t.IsAvailable() || fc.excludeSet[t.URL] {
			continue
		}
		if t.IsBackup() {
			fc.backups = append(fc.backups, t)
		} else {
			fc.available = append(fc.available, t)
		}
	}
	if len(fc.available) > 0 {
		return fc.available
	}
	return fc.backups
}
  • Step 4: 添加内联 FNV-64a 哈希函数

避免 fnv.New64a() 的堆分配:

func fnvHash64a(key string) uint64 {
	var h uint64 = 14695981039346656037
	for i := 0; i < len(key); i++ {
		h ^= uint64(key[i])
		h *= 1099511628211
	}
	return h
}
  • Step 5: 重写所有 Balancer 的 Select/SelectExcluding 使用 Pool

RoundRobin 示例:

func (r *RoundRobin) Select(targets []*Target) *Target {
	fc := acquireFilterContext()
	defer releaseFilterContext(fc)
	healthy := filterInto(fc, targets)
	if len(healthy) == 0 {
		return nil
	}
	idx := r.counter.Add(1) - 1
	return healthy[idx%uint64(len(healthy))]
}

对所有 6 个算法的 Select/SelectExcluding 方法应用相同模式。 IPHash 中将 fnv.New64a() + h.Write() + h.Sum64() 替换为 fnvHash64a(clientIP)。 ConsistentHash 中 hashKeyString 也替换为 fnvHash64a

  • Step 6: 保留旧函数作为兼容别名(可选)

保留 filterHealthyfilterHealthyAndExclude 函数签名但标记 // Deprecated,内部调用新实现,确保外部调用方不受影响。如果没有外部调用方,可直接删除。

  • Step 7: 运行现有测试验证正确性
go test -v -count=1 ./internal/loadbalance/...

预期:全部 PASS无行为变化。

  • Step 8: 运行基准测试验证性能提升
go test -bench=BenchmarkAllBalancers -benchmem -count=5 ./internal/loadbalance/...

预期allocs/op 从 2-3 降低到 0-1。

  • Step 9: 提交
git add internal/loadbalance/balancer.go internal/loadbalance/random.go internal/loadbalance/consistent_hash.go
git commit -m "perf(loadbalance): eliminate per-request allocations in filterHealthy with sync.Pool"

Task 2: loadbalance — IsAvailable 无锁化

Files:

  • Modify: internal/loadbalance/balancer.go
  • Test: internal/loadbalance/balancer_test.go

问题: IsAvailable()MaxFails > 0 时获取 failMu mutex。这发生在 filterHealthy/filterInto 的每次目标遍历中,意味着每次 LB Select 都会对每个目标加锁一次。

方案: 将 failCountfailedUntil 改为 atomic 操作,消除 failMu mutex。使用 CAS 循环实现 RecordFailure 和冷却重置。

  • Step 1: 修改 Target 字段为 atomic
type Target struct {
	// ... 保留其他字段 ...
	failCount   atomic.Int64
	failedUntil atomic.Int64
	// 删除: failMu sync.Mutex
}
  • Step 2: 重写 IsAvailable 为无锁版本
func (t *Target) IsAvailable() bool {
	if !t.Healthy.Load() || t.Down {
		return false
	}
	if t.MaxConns > 0 && atomic.LoadInt64(&t.Connections) >= t.MaxConns {
		return false
	}
	if t.MaxFails > 0 {
		failCount := t.failCount.Load()
		if failCount >= t.MaxFails {
			failedUntil := t.failedUntil.Load()
			if time.Now().UnixNano() < failedUntil {
				return false
			}
			// 冷却已过期,尝试重置(允许竞争,不影响正确性)
			if failedUntil > 0 {
				t.failCount.Store(0)
				t.failedUntil.Store(0)
			}
		}
	}
	return true
}
  • Step 3: 重写 RecordFailure 和 RecordSuccess 为无锁版本
func (t *Target) RecordFailure() int64 {
	if t.MaxFails <= 0 {
		return 0
	}
	count := t.failCount.Add(1)
	if count >= t.MaxFails {
		timeout := t.FailTimeout
		if timeout <= 0 {
			timeout = 10 * time.Second
		}
		t.failedUntil.Store(time.Now().Add(timeout).UnixNano())
	}
	return count
}

func (t *Target) RecordSuccess() {
	if t.MaxFails <= 0 {
		return
	}
	t.failCount.Store(0)
	t.failedUntil.Store(0)
}
  • Step 4: 运行测试
go test -v -count=1 -run=TestTarget ./internal/loadbalance/...

预期:全部 PASS。

  • Step 5: 运行完整包测试
go test -v -count=1 ./internal/loadbalance/...
  • Step 6: 提交
git add internal/loadbalance/balancer.go
git commit -m "perf(loadbalance): replace failMu mutex with atomic operations in IsAvailable"

Task 3: matcher — RadixTree 零分配搜索

Files:

  • Modify: internal/matcher/radix.go
  • Test: internal/matcher/radix_test.go, internal/matcher/integration_test.go
  • Benchmark: 新建 internal/matcher/radix_bench_test.go

问题: searchLongest 递归搜索中,每次遇到带 handler 的节点都分配 &MatchResult{},一次查找可能分配 N 个 MatchResult 但只保留 1 个。正则匹配器 GetCaptures 每次分配 map[string]string

方案: 使用 sync.Pool 复用 MatchResult。引入 searchState 避免递归中的多次分配,改为栈式迭代或就地更新最佳匹配。

  • Step 1: 添加 MatchResult Pool

radix.go 中添加:

var matchResultPool = sync.Pool{
	New: func() any {
		return &MatchResult{}
	},
}
  • Step 2: 重写 searchLongest 为就地更新最佳匹配

将递归中创建 newMatch 改为直接比较节点字段,仅在最终返回时从池中获取 MatchResult

func (t *RadixTree) searchLongest(node *RadixNode, path string, bestNode *RadixNode, bestPrefixLen int) *RadixNode {
	if node == nil || path == "" {
		return bestNode
	}
	if !strings.HasPrefix(path, node.prefix) {
		return bestNode
	}
	remaining := path[len(node.prefix):]
	if node.handler != nil {
		if bestNode == nil || node.priority < bestNode.priority {
			bestNode = node
		} else if node.priority == bestNode.priority && len(node.prefix) > bestPrefixLen {
			bestNode = node
		}
	}
	for _, child := range node.children {
		bestNode = t.searchLongest(child, remaining, bestNode, bestPrefixLen)
	}
	return bestNode
}
  • Step 3: 修改 FindLongestPrefix 在返回时构建 MatchResult
func (t *RadixTree) FindLongestPrefix(path string) *MatchResult {
	bestNode := t.searchLongest(t.root, path, nil, 0)
	if bestNode == nil {
		return nil
	}
	result := matchResultPool.Get().(*MatchResult)
	result.Handler = bestNode.handler
	result.Path = bestNode.prefix
	result.Priority = bestNode.priority
	result.LocationType = bestNode.locationType
	result.Internal = bestNode.internal
	return result
}

注意:调用方使用完 MatchResult 后需调用 PutMatchResult(result) 归还池。

  • Step 4: 添加 ReleaseMatchResult 函数供调用方使用
func ReleaseMatchResult(r *MatchResult) {
	if r == nil {
		return
	}
	r.Handler = nil
	r.Captures = nil
	r.Path = ""
	r.LocationType = ""
	r.Internal = false
	r.Priority = 0
	matchResultPool.Put(r)
}
  • Step 5: 更新 LocationEngine.Match 调用 FindLongestPrefix 后释放

location.go 中,确保所有 FindLongestPrefix 返回值在函数结束前调用 ReleaseMatchResult(需分析调用链确认所有权)。

  • Step 6: 添加基准测试文件

创建 internal/matcher/radix_bench_test.go

func BenchmarkRadixTreeFindLongestPrefix(b *testing.B) {
	tree := NewRadixTree()
	paths := []string{"/", "/api", "/api/v1", "/api/v1/users", "/api/v1/users/:id", "/static", "/static/css", "/static/js", "/health", "/favicon.ico"}
	for _, p := range paths {
		tree.Insert(p, func(ctx *fasthttp.RequestCtx) {}, 0, "prefix", false)
	}
	tree.MarkInitialized()

	b.ResetTimer()
	b.ReportAllocs()
	for b.Loop() {
		result := tree.FindLongestPrefix("/api/v1/users/123")
		ReleaseMatchResult(result)
	}
}

func BenchmarkRadixTreeFindLongestPrefixParallel(b *testing.B) {
	// 同上但用 b.RunParallel
}
  • Step 7: 运行所有 matcher 测试
go test -v -count=1 ./internal/matcher/...
  • Step 8: 运行基准测试
go test -bench=BenchmarkRadixTree -benchmem ./internal/matcher/...

预期allocs/op 从 N匹配路径上的 handler 节点数)降低到 1仅池获取

  • Step 9: 提交
git add internal/matcher/radix.go internal/matcher/radix_bench_test.go
git commit -m "perf(matcher): eliminate heap allocations in RadixTree search with sync.Pool"

Task 4: resolver — LRU 从 O(n) 切换到 O(1)

Files:

  • Modify: internal/resolver/resolver.go, internal/resolver/cache.go
  • Test: internal/resolver/resolver_test.go, internal/resolver/mock_dns_test.go
  • Benchmark: internal/resolver/resolver_bench_test.go

问题: DNS 缓存的 LRU 使用 []string 切片实现 moveToFrontLocked,每次操作 O(n) 线性扫描 + 切片重组。storeCache 持有写锁执行整个 O(n) 操作,阻塞所有并发读。

方案: 将 LRU 从 []string 切片替换为 container/list + map[string]*list.Element(与 FileCache 和 FileInfoCache 的模式一致。moveToFront 和 eviction 都变为 O(1)。

  • Step 1: 修改 DNSResolver 结构体
type DNSResolver struct {
	config       *config.ResolverConfig
	stopCh       chan struct{}
	refreshHosts map[string]struct{}
	cache        map[string]*DNSCacheEntry
	lruList      *list.List                    // 替代 lruOrder []string
	lruIndex     map[string]*list.Element      // 新增host -> list.Element
	hits         atomic.Int64
	misses       atomic.Int64
	errors       atomic.Int64
	latencyNs    atomic.Int64
	count        atomic.Int64
	mu           sync.RWMutex
	serverIdx    atomic.Uint32
	started      atomic.Bool
}
  • Step 2: 重写 storeCache
func (r *DNSResolver) storeCache(host string, entry *DNSCacheEntry) {
	r.mu.Lock()
	defer r.mu.Unlock()

	if elem, ok := r.lruIndex[host]; ok {
		r.cache[host] = entry
		r.lruList.MoveToFront(elem)
		return
	}

	if r.config.CacheSize > 0 && len(r.cache) >= r.config.CacheSize {
		r.evictLRULocked()
	}

	r.cache[host] = entry
	elem := r.lruList.PushFront(host)
	r.lruIndex[host] = elem
}
  • Step 3: 重写 evictLRULocked
func (r *DNSResolver) evictLRULocked() {
	oldest := r.lruList.Back()
	if oldest == nil {
		return
	}
	host := oldest.Value.(string)
	delete(r.cache, host)
	delete(r.lruIndex, host)
	r.lruList.Remove(oldest)
}
  • Step 4: 删除 moveToFrontLocked(不再需要,由 lruList.MoveToFront 替代)

  • Step 5: 更新 New() 构造函数

return &DNSResolver{
	config:       &configCopy,
	stopCh:       make(chan struct{}),
	refreshHosts: make(map[string]struct{}),
	cache:        make(map[string]*DNSCacheEntry),
	lruList:      list.New(),
	lruIndex:     make(map[string]*list.Element),
}
  • Step 6: 更新 DeleteCacheEntry
func (r *DNSResolver) DeleteCacheEntry(host string) {
	r.mu.Lock()
	defer r.mu.Unlock()
	delete(r.cache, host)
	if elem, ok := r.lruIndex[host]; ok {
		r.lruList.Remove(elem)
		delete(r.lruIndex, host)
	}
	delete(r.refreshHosts, host)
}
  • Step 7: 更新 ClearCache
func (r *DNSResolver) ClearCache() {
	r.mu.Lock()
	r.cache = make(map[string]*DNSCacheEntry)
	r.lruList = list.New()
	r.lruIndex = make(map[string]*list.Element)
	r.refreshHosts = make(map[string]struct{})
	r.mu.Unlock()
}
  • Step 8: 添加 import "container/list"

  • Step 9: 运行所有 resolver 测试

go test -v -count=1 ./internal/resolver/...
  • Step 10: 运行基准测试验证
go test -bench=BenchmarkDNS -benchmem -count=5 ./internal/resolver/...

预期:BenchmarkDNSResolverCacheWriteLockBenchmarkDNSResolverMixedWorkload 显著提速。

  • Step 11: 提交
git add internal/resolver/resolver.go internal/resolver/cache.go
git commit -m "perf(resolver): replace slice-based LRU with container/list for O(1) operations"

Task 5: handler — FileInfoCache 近似 LRU 消除读锁升级

Files:

  • Modify: internal/handler/fileinfo_cache.go
  • Test: internal/handler/static_test.go(间接,通过现有测试验证)
  • Benchmark: internal/handler/static_bench_test.go

问题: FileInfoCache.Get() 在每次缓存命中时需要 两次锁获取:先 RLock 检查存在性和 TTL然后释放 RLock再 Lock 做 MoveToFront LRU 更新。每次命中都有 RLock→Lock 升级。

方案: 采用近似 LRU 策略——Get 路径跳过 MoveToFront,仅 RLock 快速路径返回。仅在 Set 路径(写操作)时更新 LRU 位置。这与 FileCache 的近似 LRU 策略一致。

  • Step 1: 重写 Get 为纯 RLock 快速路径
func (c *FileInfoCache) Get(filePath string) (os.FileInfo, bool) {
	c.mu.RLock()
	entry, ok := c.entries[filePath]
	if !ok {
		c.mu.RUnlock()
		return nil, false
	}
	if time.Since(entry.cachedAt) > fileInfoCacheTTL {
		c.mu.RUnlock()
		// 过期删除仍需写锁
		c.mu.Lock()
		if e, ok := c.entries[filePath]; ok && time.Since(e.cachedAt) > fileInfoCacheTTL {
			c.lruList.Remove(e.element)
			delete(c.entries, filePath)
		}
		c.mu.Unlock()
		return nil, false
	}
	info := entry.info
	c.mu.RUnlock()
	return info, true
}
  • Step 2: 在 Set 中添加 LRU 位置更新
func (c *FileInfoCache) Set(filePath string, info os.FileInfo) {
	c.mu.Lock()
	defer c.mu.Unlock()

	if entry, ok := c.entries[filePath]; ok {
		entry.info = info
		entry.cachedAt = time.Now()
		c.lruList.MoveToFront(entry.element)
		return
	}
	// ... 淘汰和插入逻辑不变 ...
}
  • Step 3: 添加 FileInfoCache 专项基准测试

internal/handler/static_bench_test.go 中添加:

func BenchmarkFileInfoCacheGetHit(b *testing.B) {
	cache := NewFileInfoCache()
	info, _ := os.Stat("testdata/style.css")
	cache.Set("/style.css", info)

	b.ResetTimer()
	b.ReportAllocs()
	for b.Loop() {
		cache.Get("/style.css")
	}
}

func BenchmarkFileInfoCacheGetHitParallel(b *testing.B) {
	cache := NewFileInfoCache()
	info, _ := os.Stat("testdata/style.css")
	cache.Set("/style.css", info)

	b.ResetTimer()
	b.ReportAllocs()
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			cache.Get("/style.css")
		}
	})
}

注意:需确认 NewFileInfoCache 是否已导出,若未导出则在包内测试。

  • Step 4: 运行所有 handler 测试
go test -v -count=1 ./internal/handler/...
  • Step 5: 运行基准测试
go test -bench=BenchmarkFileInfoCache -benchmem ./internal/handler/...

预期Get hit 路径从 2 次锁操作降到 1 次 RLock并行吞吐显著提升。

  • Step 6: 提交
git add internal/handler/fileinfo_cache.go internal/handler/static_bench_test.go
git commit -m "perf(handler): eliminate read-lock upgrade in FileInfoCache.Get with approximate LRU"

Task 6: loadbalance — ConsistentHash 消除双重锁

Files:

  • Modify: internal/loadbalance/consistent_hash.go
  • Test: internal/loadbalance/balancer_test.go

问题: SelectByKeySelectExcludingByKey 在发现 circle 为空时执行 RLock → RUnlock → rebuildCircle(Lock) → RLock,即释放读锁、获取写锁重建、再获取读锁。在冷启动高并发时,多个 goroutine 可能同时触发 rebuild。

方案: 使用 sync.Onceatomic.Bool 保证 rebuild 只执行一次。在首次 Select 前完成 rebuild后续调用直接 RLock 读取。同时将 hashKeyString 替换为内联 fnvHash64aTask 1 中已定义)。

  • Step 1: 添加 rebuildOnce 字段
type ConsistentHash struct {
	circle       map[uint64]*Target
	hashKey      string
	sortedHashes []uint64
	virtualNodes int
	mu           sync.RWMutex
	rebuilt      atomic.Bool
}
  • Step 2: 重写 SelectByKey 使用 ensureRebuilt
func (c *ConsistentHash) ensureRebuilt(targets []*Target) {
	if c.rebuilt.Load() {
		return
	}
	c.rebuildCircle(targets)
}

func (c *ConsistentHash) SelectByKey(targets []*Target, key string) *Target {
	c.ensureRebuilt(targets)

	c.mu.RLock()
	defer c.mu.RUnlock()

	if len(c.sortedHashes) == 0 {
		return nil
	}

	hash := fnvHash64a(key)
	idx := sort.Search(len(c.sortedHashes), func(i int) bool {
		return c.sortedHashes[i] >= hash
	})
	if idx >= len(c.sortedHashes) {
		idx = 0
	}
	return c.circle[c.sortedHashes[idx]]
}
  • Step 3: 更新 Rebuild 方法重置 rebuilt 标志
func (c *ConsistentHash) Rebuild(targets []*Target) {
	c.rebuilt.Store(false)
	c.rebuildCircle(targets)
}
  • Step 4: 更新 rebuildCircle 设置 rebuilt 标志
func (c *ConsistentHash) rebuildCircle(targets []*Target) {
	c.mu.Lock()
	defer c.mu.Unlock()
	// ... 现有逻辑不变 ...
	c.rebuilt.Store(true)
}
  • Step 5: 同样更新 SelectExcludingByKey

移除内部的 RLock → RUnlock → rebuildCircle → RLock 模式,改为先 ensureRebuiltRLock

  • Step 6: 将 hashKeyString 替换为 fnvHash64a
// 删除 hashKeyString 方法
// 在 PrecomputeHashes 中将 c.hashKeyString(key) 替换为 fnvHash64a(key)
  • Step 7: 运行测试
go test -v -count=1 ./internal/loadbalance/...
  • Step 8: 运行基准测试
go test -bench=BenchmarkConsistentHash -benchmem ./internal/loadbalance/...
  • Step 9: 提交
git add internal/loadbalance/consistent_hash.go
git commit -m "perf(loadbalance): eliminate double-lock in ConsistentHash with atomic rebuild guard"

Task 7: 全局验证与基准对比

Files:

  • 无新文件修改

  • Step 1: 运行完整测试套件

make test
  • Step 2: 运行集成测试
make test-integration
  • Step 3: 运行代码格式化和静态检查
make fmt && make lint
  • Step 4: 保存基准对比结果
make bench-stat
mv benchmark-current.txt bench-after-optimization.txt

如有优化前的基准数据,运行 benchstat bench-before.txt bench-after-optimization.txt 对比。

  • Step 5: 最终提交(如有 lint 修复)
git add -A
git commit -m "chore: lint fixes after performance optimization"

依赖关系

Task 1 (filterHealthy Pool) ──→ Task 6 (ConsistentHash复用 fnvHash64a)
Task 2 (IsAvailable atomic) ──→ 无依赖(可并行)
Task 3 (RadixTree Pool)     ──→ 无依赖(可并行)
Task 4 (Resolver LRU)        ──→ 无依赖(可并行)
Task 5 (FileInfoCache)       ──→ 无依赖(可并行)
Task 7 (全局验证)            ──→ 依赖 Task 1-6 全部完成

推荐并行执行: Task 1+2 可同一批同一文件Task 3/4/5 可并行Task 6 在 Task 1 后执行。