diff --git a/benchmarks/v0.4.0/REPORT.md b/benchmarks/v0.4.0/REPORT.md new file mode 100644 index 0000000..b03a0b0 --- /dev/null +++ b/benchmarks/v0.4.0/REPORT.md @@ -0,0 +1,186 @@ +# Lolly v0.4.0 性能分析报告 + +> **生成日期**: 2026-06-11 +> **测试环境**: Linux amd64, Go 1.26.4 +> **测试负载**: wrk 静态文件 + 代理 (共 ~600 并发) + +--- + +## 1. 基准测试摘要 + +完整基准数据见 `benchmarks/v0.4.0/summary.txt`。 + +关键微基准指标: + +| 模块 | 关键指标 | 数值 | +|------|---------|------| +| `cache` | FileCacheGet | ~45-53 ns/op, 1 alloc/op | +| `cache` | ProxyCacheGet | ~125 ns/op, 1 alloc/op | +| `matcher` | RadixTree 查找 | <100 ns/op | +| `proxy` | WebSocket 检测 | ~30-50 ns/op | +| `middleware/security` | RateLimiter | ~60-100 ns/op | +| `stream` | filterHealthy | ~50-200 ns/op | +| `resolver` | DNS 缓存命中 | ~100-300 ns/op | + +--- + +## 2. CPU 热点 Top 10 + +数据来源:`go tool pprof -top cpu.prof` + +| 排名 | 函数 | flat% | cum% | 分析 | +|------|------|-------|------|------| +| 1 | `internal/runtime/syscall/linux.Syscall6` | 61.64% | 61.64% | 系统调用开销(epoll/read/write),网络 I/O 正常开销 | +| 2 | `runtime.memmove` | 2.13% | 2.13% | 内存拷贝 | +| 3 | `zerolog/internal/json.Encoder.AppendString` | 1.14% | 1.37% | JSON 字符串编码 | +| 4 | `time.runtimeNow` | 1.00% | 1.00% | 时间获取 | +| 5 | `fasthttp.(*Server).serveConn` | 0.87% | 92.67% | 请求处理总入口 | +| 6 | `runtime.futex` | 0.71% | 0.71% | 线程同步 | +| 7 | `runtime.exitsyscall` | 0.66% | 0.73% | 系统调用返回 | +| 8 | `runtime.nanotime` | 0.66% | 0.66% | 纳秒时间 | +| 9 | `runtime.gopark` | 0.48% | 0.51% | goroutine 调度 | +| 10 | **`logging.(*Logger).LogAccess`** | **0.23%** | **16.36%** | **⚠️ 访问日志是巨大的 CPU 热点** | + +**关键洞察**: +- 61.64% 的 CPU 在系统调用中,这是高并发网络 I/O 的基线开销 +- **访问日志 (LogAccess) 占累计 16.36%** — 这是应用层最大的单一热点 +- zerolog JSON 编码也占了一部分(1.37% cum) + +--- + +## 3. 内存分配热点 Top 10 + +数据来源:`go tool pprof -top allocs.prof` + +| 排名 | 函数 | flat% | cum% | 分析 | +|------|------|-------|------|------| +| 1 | **`os.statNolog`** | **74.95%** | **79.78%** | **⚠️ 绝对主导,文件 stat 调用产生大量分配** | +| 2 | `syscall.ByteSliceFromString` | 4.83% | 4.83% | 字符串转字节片 | +| 3 | `net.IP.String` | 4.80% | 4.80% | IP 地址字符串化 | +| 4 | `net.JoinHostPort` | 4.54% | 4.54% | 地址拼接 | +| 5 | `internal/bytealg.MakeNoZero` | 4.51% | 4.51% | 字节分配 | +| 6 | `sync.(*poolChain).pushHead` | 0.77% | 0.77% | sync.Pool 操作 | +| 7 | `bufio.NewReaderSize` | 0.74% | 0.74% | bufio reader 创建 | +| 8 | `net.(*Dialer).DialContext` | 0.38% | 2.39% | 连接创建 | +| 9 | `context.AfterFunc` | 0.27% | 0.68% | context 回调 | +| 10 | `zerolog.(*Event).msg` | 0% | 0.77% | 日志事件处理 | + +**关键洞察**: +- **`os.statNolog` 独占 74.95% 的分配** — 这是最大的优化机会 +- `net.IP.String` + `net.JoinHostPort` 合计 **9.34%** — 地址格式化反复分配 +- 连接建立(DialContext)也占 2.39% cum — 连接池不够高效 + +--- + +## 4. 堆内存占用 Top 10 + +数据来源:`go tool pprof -top heap.prof` + +| 排名 | 函数 | inuse% | 分析 | +|------|------|--------|------| +| 1 | `bufio.NewReaderSize` | 27.31% | bufio reader 长期占用 | +| 2 | `bufio.NewWriterSize` | 27.29% | bufio writer 长期占用 | +| 3 | `fasthttp.(*RequestHeader).parseFirstLine` | 18.14% | 请求头解析 | +| 4 | `runtime.mallocgc` | 9.09% | 一般分配 | +| 5 | `golang.org/x/net/http2/hpack.init` | 4.56% | HTTP/2 hpack 初始化 | +| 6 | `fasthttp.(*Server).acquireCtx` | 4.54% | 请求上下文 | +| 7 | `zerolog.init` | 4.54% | zerolog 事件池 | +| 8 | `sync.(*poolChain).pushHead` | 4.53% | sync.Pool 链 | + +**关键洞察**: +- `bufio.Reader/Writer` 合计 **54.6%** 堆内存 — 可以优化池化复用 +- fasthttp 的请求上下文和头解析占 22.68% + +--- + +## 5. 优化建议与优先级 + +### P0 — 高优先级(预期收益大) + +#### 1. 访问日志 CPU 优化 (cum 16.36%) + +**问题**: `logging.(*Logger).LogAccess` 累计占 16.36% CPU,包括 JSON 编码和时间获取。 + +**方向**: +- 添加访问日志采样配置(如只记录 1% 请求) +- 使用对象池复用日志事件 +- 简化默认访问日志格式(避免复杂 JSON) +- 异步批量写入访问日志 + +**预期收益**: CPU 降低 5-15% + +#### 2. 静态文件 stat 调用优化 (allocs 74.95%) + +**问题**: `os.statNolog` 独占 74.95% 分配,说明静态文件 handler 频繁调用 stat 检查文件存在性。 + +**方向**: +- 增强 `FileInfoCache` 命中率和 TTL +- 对不存在的路径做负缓存(negative cache) +- 避免在代理路径上调用 stat +- 优化 `try_files` 实现,减少重复 stat + +**预期收益**: 分配降低 50-70% + +### P1 — 中优先级 + +#### 3. 网络地址字符串化优化 (allocs 9.34%) + +**问题**: `net.IP.String` 和 `net.JoinHostPort` 合计 9.34% 分配。 + +**方向**: +- 在访问日志等热路径缓存 IP 字符串 +- 使用预格式化的 host:port 缓存 +- 避免重复解析 X-Forwarded-For + +**预期收益**: 分配降低 5-10% + +#### 4. bufio Reader/Writer 池化 (heap 54.6%) + +**问题**: `bufio.NewReaderSize/NewWriterSize` 占 54.6% 堆内存。 + +**方向**: +- 复用 fasthttp HostClient 的 reader/writer +- 在代理路径中池化 bufio 对象 + +**预期收益**: 内存占用降低 20-40% + +### P2 — 低优先级 + +#### 5. 连接池优化 + +**问题**: `net.Dialer.DialContext` 占 2.39% cum。 + +**方向**: +- 增加 fasthttp HostClient 连接池大小 +- 优化空闲连接超时 + +#### 6. 系统调用优化 + +**问题**: syscall 占 61.64% CPU,这是基线。 + +**方向**: +- 考虑 io_uring(Linux)降低 syscall 开销 +- sendfile 优化(已部分实现) + +--- + +## 6. 建议的下一步实施顺序 + +1. **Task A**: 访问日志采样 + 异步化 — 最大 CPU 收益 +2. **Task B**: 静态文件 stat 负缓存 + FileInfoCache 优化 — 最大分配收益 +3. **Task C**: IP/Host:Port 字符串缓存 — 降低分配 +4. **Task D**: bufio 池化 — 降低内存占用 +5. **Task E**: 重新跑 benchmark 验证收益 + +--- + +## 7. 原始数据文件 + +- `benchmarks/v0.4.0/pprof/cpu.prof` — CPU profile +- `benchmarks/v0.4.0/pprof/allocs.prof` — 分配 profile +- `benchmarks/v0.4.0/pprof/heap.prof` — 堆内存 profile +- `benchmarks/v0.4.0/pprof/goroutine.prof` — Goroutine profile +- `benchmarks/v0.4.0/cpu-top.txt` — CPU top 函数 +- `benchmarks/v0.4.0/allocs-top.txt` — 分配 top 函数 +- `benchmarks/v0.4.0/heap-top.txt` — 堆内存 top 函数 +- `benchmarks/v0.4.0/summary.txt` — 基准测试汇总 diff --git a/benchmarks/v0.4.0/allocs-top.txt b/benchmarks/v0.4.0/allocs-top.txt new file mode 100644 index 0000000..58505a7 --- /dev/null +++ b/benchmarks/v0.4.0/allocs-top.txt @@ -0,0 +1,30 @@ +File: lolly +Build ID: 91ef2ed3a6e5dbfe48cc1c157eb0ae7ab47d63c1 +Type: alloc_space +Time: 2026-06-11 13:47:29 CST +Showing nodes accounting for 1626.28MB, 95.80% of 1697.53MB total +Dropped 135 nodes (cum <= 8.49MB) + flat flat% sum% cum cum% + 1272.22MB 74.95% 74.95% 1354.22MB 79.78% os.statNolog + 82MB 4.83% 79.78% 82MB 4.83% syscall.ByteSliceFromString + 81.50MB 4.80% 84.58% 81.50MB 4.80% net.IP.String + 77MB 4.54% 89.11% 77MB 4.54% net.JoinHostPort (inline) + 76.50MB 4.51% 93.62% 76.50MB 4.51% internal/bytealg.MakeNoZero + 13.01MB 0.77% 94.39% 13.01MB 0.77% sync.(*poolChain).pushHead + 12.55MB 0.74% 95.13% 12.55MB 0.74% bufio.NewReaderSize (inline) + 6.50MB 0.38% 95.51% 40.50MB 2.39% net.(*Dialer).DialContext + 4.50MB 0.27% 95.77% 11.50MB 0.68% context.AfterFunc + 0.50MB 0.029% 95.80% 12.50MB 0.74% net.(*netFD).connect + 0 0% 95.80% 12.01MB 0.71% github.com/rs/zerolog.(*Event).Msg (inline) + 0 0% 95.80% 13.01MB 0.77% github.com/rs/zerolog.(*Event).msg + 0 0% 95.80% 13.01MB 0.77% github.com/rs/zerolog.(*Event).write + 0 0% 95.80% 9.01MB 0.53% github.com/rs/zerolog.putEvent + 0 0% 95.80% 48.50MB 2.86% github.com/valyala/fasthttp.(*HostClient).AcquireConn + 0 0% 95.80% 57.53MB 3.39% github.com/valyala/fasthttp.(*HostClient).Do + 0 0% 95.80% 48.50MB 2.86% github.com/valyala/fasthttp.(*HostClient).dialHostHard + 0 0% 95.80% 57.53MB 3.39% github.com/valyala/fasthttp.(*HostClient).do + 0 0% 95.80% 57.53MB 3.39% github.com/valyala/fasthttp.(*HostClient).doNonNilReqResp + 0 0% 95.80% 1684.31MB 99.22% github.com/valyala/fasthttp.(*Server).serveConn + 0 0% 95.80% 48.50MB 2.86% github.com/valyala/fasthttp.(*TCPDialer).Dial (inline) + 0 0% 95.80% 48.50MB 2.86% github.com/valyala/fasthttp.(*TCPDialer).dial + 0 0% 95.80% 48MB 2.83% github.com/valyala/fasthttp.(*TCPDialer).tryDial diff --git a/benchmarks/v0.4.0/cpu-top.txt b/benchmarks/v0.4.0/cpu-top.txt new file mode 100644 index 0000000..e532bd6 --- /dev/null +++ b/benchmarks/v0.4.0/cpu-top.txt @@ -0,0 +1,30 @@ +File: lolly +Build ID: 91ef2ed3a6e5dbfe48cc1c157eb0ae7ab47d63c1 +Type: cpu +Time: 2026-06-11 13:46:57 CST +Duration: 30s, Total samples = 78.99s (263.28%) +Showing nodes accounting for 63.79s, 80.76% of 78.99s total +Dropped 642 nodes (cum <= 0.39s) + flat flat% sum% cum cum% + 48.69s 61.64% 61.64% 48.69s 61.64% internal/runtime/syscall/linux.Syscall6 + 1.68s 2.13% 63.77% 1.68s 2.13% runtime.memmove + 0.90s 1.14% 64.91% 1.08s 1.37% github.com/rs/zerolog/internal/json.Encoder.AppendString + 0.79s 1.00% 65.91% 0.79s 1.00% time.runtimeNow + 0.69s 0.87% 66.78% 73.20s 92.67% github.com/valyala/fasthttp.(*Server).serveConn + 0.56s 0.71% 67.49% 0.56s 0.71% runtime.futex + 0.55s 0.7% 68.19% 0.55s 0.7% indexbytebody + 0.52s 0.66% 68.84% 0.58s 0.73% runtime.exitsyscall + 0.52s 0.66% 69.50% 0.52s 0.66% runtime.nanotime (inline) + 0.38s 0.48% 69.98% 0.40s 0.51% runtime.gopark + 0.30s 0.38% 70.36% 0.43s 0.54% runtime.reentersyscall + 0.26s 0.33% 70.69% 47.80s 60.51% syscall.RawSyscall6 + 0.24s 0.3% 71.00% 3.45s 4.37% runtime.schedule + 0.23s 0.29% 71.29% 1.22s 1.54% runtime.netpoll + 0.22s 0.28% 71.57% 1.03s 1.30% github.com/valyala/fasthttp.(*RequestHeader).parseHeaders + 0.20s 0.25% 71.82% 0.58s 0.73% internal/strconv.genericFtoa + 0.19s 0.24% 72.06% 8.46s 10.71% internal/poll.(*FD).Read + 0.18s 0.23% 72.29% 8.74s 11.06% bufio.(*Reader).fill + 0.18s 0.23% 72.52% 7.74s 9.80% github.com/rs/zerolog.(*Event).msg + 0.18s 0.23% 72.74% 12.92s 16.36% rua.plus/lolly/internal/logging.(*Logger).LogAccess + 0.18s 0.23% 72.97% 0.42s 0.53% sync.(*Pool).Get + 0.17s 0.22% 73.19% 6.56s 8.30% github.com/rs/zerolog.(*Event).write diff --git a/benchmarks/v0.4.0/heap-top.txt b/benchmarks/v0.4.0/heap-top.txt new file mode 100644 index 0000000..3bd067a --- /dev/null +++ b/benchmarks/v0.4.0/heap-top.txt @@ -0,0 +1,30 @@ +File: lolly +Build ID: 91ef2ed3a6e5dbfe48cc1c157eb0ae7ab47d63c1 +Type: inuse_space +Time: 2026-06-11 13:47:29 CST +Showing nodes accounting for 11291.78kB, 100% of 11291.78kB total + flat flat% sum% cum cum% + 3084.01kB 27.31% 27.31% 3084.01kB 27.31% bufio.NewReaderSize (inline) + 3082.04kB 27.29% 54.61% 3082.04kB 27.29% bufio.NewWriterSize (inline) + 2048.03kB 18.14% 72.74% 2048.03kB 18.14% github.com/valyala/fasthttp.(*RequestHeader).parseFirstLine + 1026kB 9.09% 81.83% 1026kB 9.09% runtime.mallocgc + 514.63kB 4.56% 86.39% 514.63kB 4.56% golang.org/x/net/http2/hpack.init + 512.75kB 4.54% 90.93% 512.75kB 4.54% github.com/valyala/fasthttp.(*Server).acquireCtx + 512.25kB 4.54% 95.47% 512.25kB 4.54% github.com/rs/zerolog.init.func3 + 512.06kB 4.53% 100% 512.06kB 4.53% sync.(*poolChain).pushHead + 0 0% 100% 512.25kB 4.54% github.com/rs/zerolog.(*Logger).Info (inline) + 0 0% 100% 512.25kB 4.54% github.com/rs/zerolog.(*Logger).newEvent + 0 0% 100% 512.25kB 4.54% github.com/rs/zerolog.newEvent + 0 0% 100% 514kB 4.55% github.com/valyala/fasthttp.(*HostClient).AcquireReader + 0 0% 100% 512.06kB 4.53% github.com/valyala/fasthttp.(*HostClient).CloseConn + 0 0% 100% 1026.06kB 9.09% github.com/valyala/fasthttp.(*HostClient).Do + 0 0% 100% 1026.06kB 9.09% github.com/valyala/fasthttp.(*HostClient).do + 0 0% 100% 1026.06kB 9.09% github.com/valyala/fasthttp.(*HostClient).doNonNilReqResp + 0 0% 100% 2048.03kB 18.14% github.com/valyala/fasthttp.(*RequestHeader).Read (inline) + 0 0% 100% 2048.03kB 18.14% github.com/valyala/fasthttp.(*RequestHeader).parse + 0 0% 100% 2048.03kB 18.14% github.com/valyala/fasthttp.(*RequestHeader).readLoop + 0 0% 100% 2048.03kB 18.14% github.com/valyala/fasthttp.(*RequestHeader).tryRead + 0 0% 100% 9751.15kB 86.36% github.com/valyala/fasthttp.(*Server).serveConn + 0 0% 100% 1026.06kB 9.09% github.com/valyala/fasthttp.(*transport).RoundTrip + 0 0% 100% 9751.15kB 86.36% github.com/valyala/fasthttp.(*workerPool).getCh.func1 + 0 0% 100% 9751.15kB 86.36% github.com/valyala/fasthttp.(*workerPool).workerFunc