lolly/docs/nginx/31-nginx-observability.md
xfy 972eab4267 refactor(docs): 重构文档目录结构,nginx 文档移至子目录
将 docs/ 根目录下的 nginx 相关文档统一移动到 docs/nginx/ 子目录,
提高文档组织性和可维护性。

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-16 10:48:14 +08:00

1509 lines
35 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# NGINX OpenTelemetry 可观测性指南
本文档介绍如何在 NGINX 中使用 OpenTelemetry 模块实现分布式追踪和可观测性。
## 目录
1. [OpenTelemetry 概述](#opentelemetry-概述)
2. [模块指令参考](#模块指令参考)
3. [分布式追踪配置](#分布式追踪配置)
4. [与 Jaeger/Zipkin 集成](#与-jaegerzipkin-集成)
5. [自定义属性和事件](#自定义属性和事件)
6. [完整配置示例](#完整配置示例)
7. [最佳实践](#最佳实践)
---
## OpenTelemetry 概述
### 什么是 OpenTelemetry
OpenTelemetry 是一个开源的可观测性框架,提供标准化的 API、库和工具来收集分布式追踪、指标和日志数据。它由 Cloud Native Computing Foundation (CNCF) 托管,是 Prometheus、Jaeger 和 OpenCensus 等项目合并后的统一解决方案。
### 核心概念
| 概念 | 描述 |
|------|------|
| **Trace** | 分布式追踪,表示请求在系统中的完整调用链路 |
| **Span** | 追踪中的基本工作单元,包含操作名称、起止时间、属性等 |
| **Context** | 追踪上下文用于在服务间传播追踪信息traceparent/tracestate |
| **Resource** | 描述产生遥测数据的实体(如服务名称、版本、主机) |
| **Exporter** | 将遥测数据发送到后端存储(如 OTLP、gRPC |
### 架构流程
```
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│ Client │───▶│ NGINX │───▶│ Backend │───▶│Database │
└─────────┘ └────┬────┘ └─────────┘ └─────────┘
┌────────────────┐
│ ngx_otel_module │
└───────┬────────┘
┌───────────────┐ ┌───────────┐ ┌──────────┐
│OTEL Collector │───▶│ Jaeger │ │ Zipkin │
└───────────────┘ └───────────┘ └──────────┘
```
### 模块版本要求
- NGINX Plus R28 或更高版本
- `ngx_otel_module` 动态模块(从源码编译或 NGINX Plus 包含)
---
## 模块指令参考
### otel_exporter
配置 OpenTelemetry 数据导出参数。
| 指令 | 语法 | 默认值 | 上下文 |
|------|------|--------|--------|
| `otel_exporter` | `{ ... }` | — | `http` |
**子指令:**
| 子指令 | 语法 | 默认值 | 描述 |
|--------|------|--------|------|
| `endpoint` | `[(http\|https)://]host:port;` | — | OTLP/gRPC 端点地址 |
| `trusted_certificate` | `path;` | 系统 CA | PEM 格式 CA 证书文件v0.1.2+ |
| `header` | `name value;` | — | 自定义 HTTP 请求头 |
| `interval` | `time;` | `5s` | 导出最大间隔时间 |
| `batch_size` | `number;` | `512` | 每批次最大 Span 数量 |
| `batch_count` | `number;` | `4` | 每个 worker 的待处理批次数 |
**示例:**
```nginx
http {
otel_exporter {
endpoint otel-collector:4317;
interval 5s;
batch_size 512;
batch_count 4;
trusted_certificate /etc/nginx/certs/ca.pem;
header X-API-Key secret_key;
}
}
```
### otel_service_name
设置 OTel Resource 的 `service.name` 属性。
| 指令 | 语法 | 默认值 | 上下文 |
|------|------|--------|--------|
| `otel_service_name` | `name;` | `unknown_service:nginx` | `http` |
**示例:**
```nginx
http {
otel_service_name nginx-gateway;
}
```
### otel_resource_attr
设置自定义 OTel Resource 属性v0.1.2+)。
| 指令 | 语法 | 默认值 | 上下文 |
|------|------|--------|--------|
| `otel_resource_attr` | `name value;` | — | `http` |
**示例:**
```nginx
http {
otel_resource_attr deployment.environment production;
otel_resource_attr service.version 1.2.3;
otel_resource_attr host.name $hostname;
}
```
### otel_trace
启用或禁用 OpenTelemetry 追踪。
| 指令 | 语法 | 默认值 | 上下文 |
|------|------|--------|--------|
| `otel_trace` | `on \| off \| $variable;` | `off` | `http`, `server`, `location` |
**示例:**
```nginx
http {
otel_trace off;
server {
listen 80;
otel_trace on;
location /api {
otel_trace on;
}
location /health {
otel_trace off; # 健康检查不记录
}
}
}
```
### otel_trace_context
配置 traceparent/tracestate 头的传播方式。
| 指令 | 语法 | 默认值 | 上下文 |
|------|------|--------|--------|
| `otel_trace_context` | `extract \| inject \| propagate \| ignore;` | `ignore` | `http`, `server`, `location` |
**选项说明:**
| 值 | 描述 |
|----|------|
| `extract` | 从入站请求中提取追踪上下文,继承上游标识符 |
| `inject` | 向出站请求注入新的追踪上下文,覆盖现有上下文 |
| `propagate` | 更新现有上下文(先 extract 再 inject保持追踪链完整 |
| `ignore` | 忽略上下文头处理 |
**示例:**
```nginx
server {
location / {
# 作为入口网关,注入新追踪上下文
otel_trace_context inject;
proxy_pass http://backend;
}
location /api/ {
# 作为中间代理,传播上游追踪上下文
otel_trace_context propagate;
proxy_pass http://api_backend;
}
}
```
### otel_span_name
定义 OTel Span 的名称。
| 指令 | 语法 | 默认值 | 上下文 |
|------|------|--------|--------|
| `otel_span_name` | `name;` | location 名称 | `http`, `server`, `location` |
**示例:**
```nginx
server {
location /api/users {
otel_span_name "GET /api/users";
# 或使用变量
otel_span_name "$request_method $uri";
}
}
```
### otel_span_attr
添加自定义 OTel Span 属性。
| 指令 | 语法 | 默认值 | 上下文 |
|------|------|--------|--------|
| `otel_span_attr` | `name value;` | — | `http`, `server`, `location` |
**示例:**
```nginx
server {
location /api/ {
otel_span_attr http.route "/api/*";
otel_span_attr user.id $remote_user;
otel_span_attr client.ip $remote_addr;
}
}
```
### 嵌入式变量
| 变量 | 描述 |
|------|------|
| `$otel_trace_id` | 追踪标识符 |
| `$otel_span_id` | 当前 Span 标识符 |
| `$otel_parent_id` | 父 Span 标识符 |
| `$otel_parent_sampled` | 父 Span 的采样标志(`1``0` |
---
## 分布式追踪配置
### Trace 上下文传播
追踪上下文传播是分布式追踪的核心,确保请求在多个服务间保持相同的追踪标识。
#### W3C Trace Context 标准
NGINX 使用 W3C Trace Context 标准:
- **traceparent**: `00-{trace-id}-{parent-id}-{flags}`
- **tracestate**: 厂商特定的上下文信息
```
Traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
│ │ │ │ │
│ │ │ │ └── 标志位( sampled: 01)
│ │ │ └── 父 Span ID
│ │ └── Trace ID
│ └── 版本
└── 固定前缀
```
#### 传播模式配置
**场景 1: 边缘网关(追踪入口)**
```nginx
http {
otel_service_name nginx-edge-gateway;
otel_trace on;
server {
listen 80;
server_name api.example.com;
location / {
# 注入新的追踪上下文
otel_trace_context inject;
# 将追踪 ID 传递给后端
proxy_set_header X-Trace-ID $otel_trace_id;
proxy_set_header X-Span-ID $otel_span_id;
proxy_pass http://backend_cluster;
}
}
}
```
**场景 2: 中间代理(追踪传播)**
```nginx
server {
listen 8080;
location / {
# 传播上游追踪上下文
otel_trace_context propagate;
# 将追踪头传递给下游服务
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
proxy_pass http://internal_services;
}
}
```
**场景 3: 混合模式**
```nginx
server {
location /public/ {
# 公共 API: 创建新追踪
otel_trace_context inject;
proxy_pass http://public_backend;
}
location /internal/ {
# 内部服务: 传播已有追踪
otel_trace_context propagate;
proxy_pass http://internal_backend;
}
location /health {
# 健康检查: 忽略追踪
otel_trace off;
return 200 "healthy\n";
}
}
```
### Span 配置
#### 标准 Span 属性
NGINX 自动记录的 Span 属性:
| 属性 | 描述 | 示例值 |
|------|------|--------|
| `http.method` | HTTP 方法 | GET, POST, PUT |
| `http.url` | 请求 URL | `https://api.example.com/users` |
| `http.scheme` | 协议 | http, https |
| `http.host` | 主机名 | `api.example.com` |
| `http.status_code` | 响应状态码 | 200, 404, 500 |
| `http.user_agent` | 用户代理 | Mozilla/5.0... |
| `http.request_content_length` | 请求体大小 | 1024 |
| `http.response_content_length` | 响应体大小 | 2048 |
| `net.peer.ip` | 客户端 IP | 192.168.1.100 |
| `net.peer.port` | 客户端端口 | 54321 |
#### 自定义 Span 名称
```nginx
map $request_method $span_name {
default "$request_method $uri";
GET "get_request";
POST "create_resource";
}
server {
location /api/ {
otel_span_name $span_name;
proxy_pass http://backend;
}
}
```
#### 条件性 Span 属性
```nginx
map $status $error_type {
~^[45] "client_or_server_error";
default "";
}
server {
location / {
otel_span_attr error.class $error_type;
otel_span_attr request.id $request_id;
otel_span_attr tenant.id $http_x_tenant_id;
proxy_pass http://backend;
}
}
```
### 采样策略
采样控制追踪数据的收集量,平衡可观测性和性能开销。
#### 采样类型
| 采样类型 | 描述 | 使用场景 |
|----------|------|----------|
| **Head-Based** | 在追踪开始时决定采样 | 低延迟、低资源开销 |
| **Tail-Based** | 基于完整追踪数据决定 | 捕获错误/慢请求 |
| **Parent-Based** | 继承父 Span 的采样决定 | 保持追踪完整性 |
#### 配置示例
**1. 始终采样(开发/测试环境)**
```nginx
http {
otel_trace on;
# 所有请求都记录
}
```
**2. 比例采样(基于变量)**
```nginx
# 使用 Lua 或外部模块实现比例采样
# 这里展示基于 Nginx 变量的实现
split_clients "$remote_addr$request_id" $trace_sampled {
10% "1"; # 10% 采样率
* "0"; # 90% 不采样
}
server {
location / {
otel_trace $trace_sampled;
proxy_pass http://backend;
}
}
```
**3. 基于请求特征采样**
```nginx
map $uri $should_trace {
default "0";
~*\.html$ "1"; # 采样 HTML 页面
/api/critical/ "1"; # 采样关键 API
/api/payment/ "1"; # 采样支付相关
}
map $http_x_debug $force_trace {
default "";
true "1";
}
server {
location / {
# 优先使用 debug header其次基于 URI
otel_trace $force_trace$should_trace;
proxy_pass http://backend;
}
}
```
**4. 错误/慢请求采样(结合 OpenTelemetry Collector**
```yaml
# otel-collector-config.yaml
processors:
tail_sampling:
policies:
- name: slow_requests
type: latency
latency: {threshold_ms: 500}
- name: errors
type: status_code
status_code: {status_codes: [500, 502, 503, 504]}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 10}
```
---
## 与 Jaeger/Zipkin 集成
### Jaeger 集成
#### 方法 1: Jaeger 原生 OTLP推荐
Jaeger 1.35+ 原生支持 OTLP 协议。
**docker-compose.yaml:**
```yaml
version: "3.8"
services:
jaeger:
image: jaegertracing/all-in-one:1.60.0
container_name: jaeger
ports:
- "16686:16686" # Jaeger UI
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
environment:
- COLLECTOR_OTLP_ENABLED=true
networks:
- observability
nginx:
image: nginx:alpine
container_name: nginx
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
ports:
- "80:80"
depends_on:
- jaeger
networks:
- observability
networks:
observability:
driver: bridge
```
**nginx.conf:**
```nginx
load_module modules/ngx_otel_module.so;
events {
worker_connections 1024;
}
http {
# OTLP 导出器配置
otel_exporter {
endpoint jaeger:4317;
interval 5s;
batch_size 512;
}
# 服务标识
otel_service_name nginx-gateway;
otel_resource_attr deployment.environment production;
otel_resource_attr host.name $hostname;
# 启用追踪
otel_trace on;
server {
listen 80;
server_name localhost;
location / {
otel_trace_context inject;
otel_span_name "$request_method $uri";
# 传递追踪上下文给后端
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
proxy_pass http://backend;
}
location /jaeger {
# 返回当前追踪信息(调试用途)
default_type application/json;
return 200 '{"trace_id":"$otel_trace_id","span_id":"$otel_span_id"}';
}
}
}
```
#### 方法 2: 通过 OpenTelemetry Collector
用于需要额外处理的场景(过滤、转换、批处理)。
**docker-compose.yaml:**
```yaml
version: "3.8"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:0.117.0
container_name: otel-collector
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "9464:9464" # Prometheus metrics
networks:
- observability
jaeger:
image: jaegertracing/all-in-one:1.60.0
container_name: jaeger
ports:
- "16686:16686"
environment:
- COLLECTOR_OTLP_ENABLED=true
networks:
- observability
nginx:
image: nginx:alpine
container_name: nginx
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
ports:
- "80:80"
depends_on:
- otel-collector
networks:
- observability
networks:
observability:
driver: bridge
```
**otel-collector-config.yaml:**
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: production
action: upsert
tail_sampling:
policies:
- name: slow_requests
type: latency
latency: {threshold_ms: 500}
- name: errors
type: status_code
status_code: {status_codes: [500, 502, 503, 504]}
exporters:
otlp/jaeger:
endpoint: jaeger:4317
tls:
insecure: true
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, tail_sampling]
exporters: [otlp/jaeger, debug]
```
### Zipkin 集成
#### 方法 1: 通过 OpenTelemetry Collector
**docker-compose.yaml:**
```yaml
version: "3.8"
services:
zipkin:
image: openzipkin/zipkin:3
container_name: zipkin
ports:
- "9411:9411"
environment:
- STORAGE_TYPE=mem
networks:
- observability
otel-collector:
image: otel/opentelemetry-collector-contrib:0.117.0
container_name: otel-collector
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml:ro
ports:
- "4317:4317"
- "4318:4318"
depends_on:
- zipkin
networks:
- observability
nginx:
image: nginx:alpine
container_name: nginx
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
ports:
- "80:80"
depends_on:
- otel-collector
networks:
- observability
networks:
observability:
driver: bridge
```
**otel-collector-config.yaml:**
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
exporters:
zipkin:
endpoint: http://zipkin:9411/api/v2/spans
format: json
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [zipkin, debug]
```
#### 方法 2: Zipkin 直接接收
如果您的系统已使用 Zipkin可以让 Collector 同时接收 OTLP 和 Zipkin 格式。
```yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
exporters:
zipkin:
endpoint: http://zipkin:9411/api/v2/spans
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [zipkin]
```
---
## 自定义属性和事件
### 自定义 Span 属性
#### 静态属性
```nginx
http {
otel_resource_attr service.namespace ecommerce;
otel_resource_attr service.version 2.1.0;
server {
location /api/ {
otel_span_attr api.version v1;
otel_span_attr team backend;
}
}
}
```
#### 动态属性(使用变量)
```nginx
map $request_time $latency_bucket {
~^0\.[0-4] "fast";
~^0\.[5-9] "medium";
default "slow";
}
server {
location / {
otel_span_attr http.latency_bucket $latency_bucket;
otel_span_attr request.size $request_length;
otel_span_attr response.size $bytes_sent;
otel_span_attr upstream.addr $upstream_addr;
otel_span_attr upstream.response_time $upstream_response_time;
proxy_pass http://backend;
}
}
```
#### 条件属性
```nginx
map $upstream_status $upstream_error {
~^[45] "true";
default "false";
}
map $upstream_cache_status $cache_hit {
HIT "true";
default "false";
}
server {
location / {
otel_span_attr upstream.error $upstream_error;
otel_span_attr cache.hit $cache_hit;
otel_span_attr cache.status $upstream_cache_status;
proxy_pass http://backend;
proxy_cache my_cache;
}
}
```
### 业务属性
```nginx
server {
location /api/orders {
# 业务相关属性
otel_span_attr business.domain orders;
otel_span_attr business.criticality high;
otel_span_attr business.region $geoip_country_code;
# 用户相关属性(注意:避免 PII
otel_span_attr user.type $http_x_user_type;
otel_span_attr user.tier $http_x_user_tier;
proxy_pass http://order_service;
}
}
```
### 使用 Lua 扩展(需要 lua-nginx-module
```nginx
server {
location / {
access_by_lua_block {
local otel = require("opentelemetry")
local span = otel.get_current_span()
-- 添加自定义属性
span:set_attribute("custom.timestamp", ngx.now())
span:set_attribute("custom.request_hash", ngx.md5(ngx.var.request_uri))
-- 添加事件
span:add_event("request_processing_started", {
["http.method"] = ngx.var.request_method,
["client.ip"] = ngx.var.remote_addr
})
}
proxy_pass http://backend;
}
}
```
---
## 完整配置示例
### 示例 1: 基础配置
```nginx
# 加载动态模块
load_module modules/ngx_otel_module.so;
user nginx;
worker_processes auto;
error_log /var/log/nginx/error.log notice;
pid /var/run/nginx.pid;
events {
worker_connections 1024;
}
http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
# OpenTelemetry 导出器配置
otel_exporter {
endpoint otel-collector:4317;
interval 5s;
batch_size 512;
batch_count 4;
}
# 服务标识
otel_service_name nginx-proxy;
otel_resource_attr deployment.environment production;
otel_resource_attr host.name $hostname;
# 全局启用追踪
otel_trace on;
log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for" '
'trace_id=$otel_trace_id span_id=$otel_span_id';
access_log /var/log/nginx/access.log main;
sendfile on;
keepalive_timeout 65;
upstream backend {
server backend1:8080 weight=5;
server backend2:8080 weight=5;
keepalive 32;
}
server {
listen 80;
server_name localhost;
# 健康检查:禁用追踪
location /health {
otel_trace off;
access_log off;
return 200 "healthy\n";
}
# 静态资源:采样
location /static/ {
otel_trace $http_x_trace_sampled;
alias /var/www/static/;
expires 1d;
}
# API 请求:完整追踪
location /api/ {
otel_trace on;
otel_trace_context propagate;
otel_span_name "$request_method $uri";
otel_span_attr http.route /api/*;
otel_span_attr api.version v1;
otel_span_attr request.id $request_id;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Request-ID $request_id;
# 传递追踪上下文
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
proxy_pass http://backend;
}
# 默认位置
location / {
otel_trace_context inject;
proxy_pass http://backend;
}
}
}
```
### 示例 2: 多环境配置
```nginx
load_module modules/ngx_otel_module.so;
events {
worker_connections 1024;
}
http {
# 根据环境变量配置
env NGINX_ENV;
env OTEL_ENDPOINT;
# 动态采样率配置
split_clients "$remote_addr$request_id" $trace_sampled {
10% "1";
* "0";
}
map $http_x_b3_sampled $b3_sampled {
default "";
"1" "1";
"0" "";
"true" "1";
"false" "";
"d" "1";
}
map $b3_sampled$trace_sampled $final_trace {
default "0";
~.*1.* "1";
}
# OTLP 导出器
otel_exporter {
endpoint ${OTEL_ENDPOINT};
interval 5s;
batch_size 512;
}
otel_service_name nginx-${NGINX_ENV};
otel_resource_attr deployment.environment ${NGINX_ENV};
# 生产环境:按比例采样
# 测试环境:全量采样
otel_trace ${NGINX_ENV} == "prod" ? $final_trace : on;
# 上游配置
upstream api_backend {
server api1.internal:8080;
server api2.internal:8080;
}
upstream web_backend {
server web1.internal:8080;
server web2.internal:8080;
}
# API 网关
server {
listen 8080;
server_name api.example.com;
location / {
otel_trace_context propagate;
otel_span_name "api:$request_method $uri";
otel_span_attr upstream.service api;
otel_span_attr rate.limit.bucket $limit_req_status;
proxy_pass http://api_backend;
}
}
# Web 网关
server {
listen 80;
server_name www.example.com;
location / {
otel_trace_context inject;
otel_span_name "web:$request_method $uri";
otel_span_attr upstream.service web;
otel_span_attr cache.status $upstream_cache_status;
proxy_pass http://web_backend;
proxy_cache web_cache;
}
}
}
```
### 示例 3: 微服务网关配置
```nginx
load_module modules/ngx_otel_module.so;
events {
worker_connections 4096;
}
http {
# OpenTelemetry 配置
otel_exporter {
endpoint otel-collector:4317;
interval 3s;
batch_size 256;
header X-Scope-OrgID tenant-1;
}
otel_service_name nginx-microgateway;
otel_resource_attr service.namespace platform;
otel_resource_attr service.version 1.0.0;
otel_resource_attr deployment.environment production;
# 追踪配置
otel_trace on;
# 日志格式包含追踪信息
log_format trace '$remote_addr - $remote_user [$time_iso8601] '
'"$request" $status $body_bytes_sent '
'"$http_referer" "$http_user_agent" '
'"trace_id":"$otel_trace_id",'
'"span_id":"$otel_span_id",'
'"parent_id":"$otel_parent_id"';
access_log /var/log/nginx/access.log trace;
# 服务发现(使用 resolver
resolver 127.0.0.11 valid=30s;
# 服务定义
upstream user_service {
server user-service:8080 resolve;
keepalive 64;
}
upstream order_service {
server order-service:8080 resolve;
keepalive 64;
}
upstream inventory_service {
server inventory-service:8080 resolve;
keepalive 64;
}
# 通用追踪配置
map $request_method $trace_operation {
GET "read";
POST "create";
PUT "update";
DELETE "delete";
PATCH "patch";
default "unknown";
}
server {
listen 80;
server_name gateway.internal;
# 追踪上下文传播
otel_trace_context propagate;
# User Service
location /api/users/ {
otel_span_name "users:$trace_operation";
otel_span_attr service.name user-service;
otel_span_attr service.operation $trace_operation;
otel_span_attr service.resource users;
proxy_pass http://user_service/;
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
}
# Order Service
location /api/orders/ {
otel_span_name "orders:$trace_operation";
otel_span_attr service.name order-service;
otel_span_attr service.operation $trace_operation;
otel_span_attr service.resource orders;
proxy_pass http://order_service/;
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
}
# Inventory Service
location /api/inventory/ {
otel_span_name "inventory:$trace_operation";
otel_span_attr service.name inventory-service;
otel_span_attr service.operation $trace_operation;
otel_span_attr service.resource inventory;
proxy_pass http://inventory_service/;
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
}
# 健康检查(无追踪)
location /health {
otel_trace off;
access_log off;
return 200 '{"status":"healthy","service":"nginx"}';
}
# 追踪信息端点(调试)
location /debug/trace {
otel_trace on;
default_type application/json;
return 200 '{
"trace_id": "$otel_trace_id",
"span_id": "$otel_span_id",
"parent_id": "$otel_parent_id",
"sampled": "$otel_parent_sampled"
}';
}
}
}
```
### 示例 4: Kubernetes 环境配置
```nginx
load_module modules/ngx_otel_module.so;
events {
worker_connections 1024;
}
http {
# 从环境变量读取 K8s 信息
env KUBERNETES_NAMESPACE;
env KUBERNETES_POD_NAME;
env KUBERNETES_NODE_NAME;
env OTEL_COLLECTOR_SERVICE;
# OTLP 导出器
otel_exporter {
endpoint ${OTEL_COLLECTOR_SERVICE}:4317;
interval 5s;
batch_size 512;
}
# 丰富的资源属性
otel_service_name nginx-ingress;
otel_resource_attr k8s.namespace.name ${KUBERNETES_NAMESPACE};
otel_resource_attr k8s.pod.name ${KUBERNETES_POD_NAME};
otel_resource_attr k8s.node.name ${KUBERNETES_NODE_NAME};
otel_resource_attr host.name ${KUBERNETES_POD_NAME};
# 启用追踪
otel_trace on;
# 上游配置K8s Service
resolver kube-dns.kube-system.svc.cluster.local valid=10s;
server {
listen 80;
location / {
otel_trace_context propagate;
otel_span_name "$request_method $uri";
otel_span_attr k8s.destination.service $proxy_host;
otel_span_attr k8s.destination.namespace ${KUBERNETES_NAMESPACE};
# 传递 K8s 相关的追踪头
proxy_set_header X-Request-ID $request_id;
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
proxy_pass http://backend-service;
}
}
}
```
---
## 最佳实践
### 1. 采样策略
**生产环境建议:**
```nginx
# 使用 Head-Based 采样降低开销
split_clients "$request_id" $trace_decision {
5% "1"; # 5% 基础采样
* "";
}
# 关键路径始终采样
map $uri $is_critical {
default "";
~*payment "1";
~*order "1";
~*auth "1";
}
map $trace_decision$is_critical $should_trace {
default "0";
~.*1.* "1";
}
otel_trace $should_trace;
```
**关键原则:**
- 错误率高的服务:提高采样率
- 高流量服务降低采样率0.1% - 1%
- 关键业务路径:全量采样
- 使用 Parent-Based 采样保持追踪链完整
### 2. 敏感数据处理
**禁止在 Span 属性中包含:**
- 密码、API Key
- 信用卡号、身份证号
- 个人身份信息 (PII)
- 会话令牌
**安全实践:**
```nginx
# 正确:使用安全的标识符
otel_span_attr user.id $http_x_user_id; # 用户 ID
otel_span_attr session.hash $cookie_session_hash; # 会话哈希
# 错误:不要记录敏感信息
# otel_span_attr user.email $http_x_user_email; # 禁止!
# otel_span_attr auth.token $http_authorization; # 禁止!
# 敏感路径禁用追踪
location /auth/login {
otel_span_attr auth.endpoint login;
# 不记录请求体
proxy_pass http://auth_service;
}
```
### 3. Span 命名规范
使用清晰、一致的命名:
```nginx
# 推荐:包含 HTTP 方法和路径
otel_span_name "$request_method $uri";
# 或按服务分类
otel_span_name "nginx:$request_method $uri";
# 避免:过于笼统或过于详细
# otel_span_name "request"; # 太笼统
# otel_span_name "GET /api/v1/users/12345"; # 包含动态 ID
```
### 4. 上下文传播
**服务边界处理:**
```nginx
# 入口服务:注入新上下文
server {
location /api/ {
otel_trace_context inject;
# 向后传递
proxy_set_header traceparent $http_traceparent;
proxy_pass http://backend;
}
}
# 中间服务:传播上下文
server {
location / {
otel_trace_context propagate;
# 既提取上游上下文,又注入到下游
proxy_set_header traceparent $http_traceparent;
proxy_pass http://next_service;
}
}
# 出口服务:提取上下文
server {
location / {
otel_trace_context extract;
# 只使用上游传入的上下文,不向后传播
proxy_pass http://final_backend;
}
}
```
### 5. 性能优化
**减少开销的配置:**
```nginx
http {
# 增大批处理大小减少网络开销
otel_exporter {
endpoint otel-collector:4317;
interval 10s; # 增大导出间隔
batch_size 1024; # 增大批大小
batch_count 8; # 增加队列深度
}
# 选择性启用追踪
map $request_uri $trace_enabled {
~*\.(css|js|png|jpg|gif|ico)$ ""; # 静态资源不追踪
/health ""; # 健康检查不追踪
/metrics ""; # 指标端点不追踪
default "1"; # 其他请求追踪
}
otel_trace $trace_enabled;
}
```
### 6. 监控 Collector 健康
```nginx
# 监控 OTLP 导出器状态
server {
location /nginx_status {
stub_status on;
allow 10.0.0.0/8;
deny all;
}
location /otel_status {
default_type application/json;
return 200 '{
"module": "ngx_otel_module",
"service_name": "${otel_service_name}",
"trace_enabled": "${otel_trace}"
}';
}
}
```
### 7. 故障排查
**常见问题及解决方案:**
| 问题 | 可能原因 | 解决方案 |
|------|----------|----------|
| 没有追踪数据 | Collector 不可达 | 检查网络连通性和端口 |
| 追踪链断裂 | 上下文传播配置错误 | 检查 `otel_trace_context` 设置 |
| Span 名称重复 | 未使用变量 | 使用 `$uri``$request_uri` |
| 采样率异常 | 变量配置错误 | 检查 `split_clients` 或 map |
| 属性缺失 | 变量未定义 | 使用 `map` 提供默认值 |
**调试配置:**
```nginx
# 临时开启详细日志
error_log /var/log/nginx/error.log debug;
# 添加调试端点
server {
location /debug/otel {
default_type application/json;
return 200 '{
"trace_id": "$otel_trace_id",
"span_id": "$otel_span_id",
"parent_id": "$otel_parent_id",
"parent_sampled": "$otel_parent_sampled",
"request_id": "$request_id",
"http_traceparent": "$http_traceparent",
"http_tracestate": "$http_tracestate"
}';
}
}
```
### 8. 多协议支持
如果后端服务使用不同协议:
```nginx
# W3C Trace Context (标准)
proxy_set_header traceparent $http_traceparent;
proxy_set_header tracestate $http_tracestate;
# B3 Propagation (Zipkin)
proxy_set_header X-B3-TraceId $otel_trace_id;
proxy_set_header X-B3-SpanId $otel_span_id;
proxy_set_header X-B3-ParentSpanId $otel_parent_id;
proxy_set_header X-B3-Sampled $otel_parent_sampled;
# Jaeger Propagation
proxy_set_header uber-trace-id "$otel_trace_id:$otel_span_id:$otel_parent_id:$otel_parent_sampled";
```
---
## 参考资源
- [NGINX OpenTelemetry Module 官方文档](https://nginx.org/en/docs/ngx_otel_module.html)
- [OpenTelemetry 官方文档](https://opentelemetry.io/docs/)
- [W3C Trace Context 规范](https://www.w3.org/TR/trace-context/)
- [Jaeger 文档](https://www.jaegertracing.io/docs/)
- [Zipkin 文档](https://zipkin.io/)
---
*文档版本: 1.0 | 最后更新: 2025-01*