監視・ロギング：観測の3原則

0 / 6 完了（0%）

「AI機能、動いてる？」と聞かれて即答できない組織は、いずれ大きな障害を起こします。本レッスンでは、AI機能の観測（Observability）を学びます。

観測の3原則

柱	役割	例
メトリクス	「どれくらい？」を集計値で	呼び出し数、エラー率、レイテンシP95
トレース	「1リクエストで何が起きたか」	リクエストID毎の処理フロー
ログ	「具体的に何が起きたか」	個別エラーメッセージ、入出力

メトリクスの設計

SLI（Service Level Indicator）の定義

# 重要なメトリクスを SLI として定義

SLI = {
    # 可用性
    "availability": "成功リクエスト / 全リクエスト",
    "error_rate": "5xx エラー / 全リクエスト",

    # パフォーマンス
    "latency_p50": "中央値レイテンシ",
    "latency_p95": "95パーセンタイル",
    "latency_p99": "99パーセンタイル",
    "ttft_p95": "Time to First Token P95",

    # 品質
    "task_completion_rate": "タスク完了 / 全試行",
    "hallucination_rate": "ハルシネーション率",
    "user_satisfaction": "ユーザー評価平均",

    # コスト
    "cost_per_request_usd": "リクエストあたりコスト",
    "cost_daily_total_usd": "日次総コスト",

    # AI 特有
    "tokens_per_request": "リクエストあたりトークン",
    "cache_hit_rate": "プロンプトキャッシュヒット率",
}

Prometheus / DataDog への送信

from prometheus_client import Counter, Histogram, Gauge

# カウンター
api_calls_total = Counter('claude_api_calls_total', 'Total API calls', ['model', 'status'])
api_errors_total = Counter('claude_api_errors_total', 'Total errors', ['error_type'])

# ヒストグラム（レイテンシ分布）
latency_seconds = Histogram(
    'claude_api_latency_seconds',
    'API call latency',
    ['model'],
    buckets=[0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)

# ゲージ（現在値）
active_streams = Gauge('claude_active_streams', 'Currently active streaming connections')


# 計測コード
def call_claude(model, messages):
    with latency_seconds.labels(model=model).time():
        try:
            response = client.messages.create(model=model, messages=messages)
            api_calls_total.labels(model=model, status='success').inc()
            return response
        except Exception as e:
            api_calls_total.labels(model=model, status='error').inc()
            api_errors_total.labels(error_type=type(e).__name__).inc()
            raise

トレースの実装（OpenTelemetry）

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# 初期化
trace.set_tracer_provider(TracerProvider())
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317"))
)

tracer = trace.get_tracer(__name__)


# トレースの埋め込み
def handle_user_query(user_id: str, query: str):
    with tracer.start_as_current_span("handle_user_query") as span:
        span.set_attribute("user.id", user_id)
        span.set_attribute("query.length", len(query))

        with tracer.start_as_current_span("fetch_user_context"):
            context = fetch_context(user_id)

        with tracer.start_as_current_span("claude.api.call") as ai_span:
            ai_span.set_attribute("model", "claude-sonnet-4-7")
            response = call_claude(context, query)
            ai_span.set_attribute("input.tokens", response.usage.input_tokens)
            ai_span.set_attribute("output.tokens", response.usage.output_tokens)

        with tracer.start_as_current_span("post_process"):
            return process_response(response)

ログの構造化

ログは 必ず JSON形式。検索・集計・分析がしやすい。

import json
import logging
from pythonjsonlogger import jsonlogger

# JSON ロガーセットアップ
logger = logging.getLogger("ai_ops")
handler = logging.StreamHandler()
handler.setFormatter(jsonlogger.JsonFormatter())
logger.addHandler(handler)
logger.setLevel(logging.INFO)


def call_claude_with_logging(model, messages, request_id):
    logger.info("api_call_start", extra={
        "request_id": request_id,
        "model": model,
        "user_id": current_user(),
    })

    try:
        response = client.messages.create(model=model, messages=messages)

        logger.info("api_call_success", extra={
            "request_id": request_id,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "stop_reason": response.stop_reason,
        })

        return response
    except Exception as e:
        logger.error("api_call_error", extra={
            "request_id": request_id,
            "error_type": type(e).__name__,
            "error_message": str(e),
        }, exc_info=True)
        raise

個人情報を含むログの注意

# PII を含むログは漏れないよう注意

def redact_for_logging(data):
    """ログ用にセンシティブ情報をマスキング"""
    if isinstance(data, dict):
        return {k: redact_for_logging(v) if k not in SENSITIVE_KEYS else "[REDACTED]"
                for k, v in data.items()}
    if isinstance(data, str):
        # メール・電話・カード番号をマスキング
        return mask_pii(data)
    return data

SENSITIVE_KEYS = {"password", "token", "api_key", "ssn", "credit_card"}

# ログ出力時
logger.info("user_query", extra={
    "query_redacted": redact_for_logging(query),
    "user_id": user_id,
})

ダッシュボードの構成（Grafana）

パネル	表示内容
1. API健全性	呼び出し数・エラー率・成功率（直近24h）
2. レイテンシ	P50/P95/P99 の時系列
3. コスト	日次コスト・モデル別ブレークダウン
4. トークン消費	入力/出力/キャッシュ別
5. ユーザー単位	Top10 ヘビーユーザー
6. エラー詳細	エラー種別Top・最近のエラーログ
7. ツール使用	ツール別呼び出し数・エラー率
8. ユーザー満足度	フィードバックスコア・サムズアップ率

アラートルール

# Prometheus AlertManager 設定例

groups:
  - name: ai_ops_alerts
    rules:
      - alert: HighErrorRate
        expr: rate(claude_api_errors_total[5m]) / rate(claude_api_calls_total[5m]) > 0.05
        for: 5m
        annotations:
          summary: "Claude API エラー率が 5% を超過"

      - alert: HighLatency
        expr: histogram_quantile(0.95, claude_api_latency_seconds) > 10
        for: 10m
        annotations:
          summary: "P95 レイテンシが 10秒を超過"

      - alert: CostAnomaly
        expr: rate(claude_cost_usd[1h]) > 100
        for: 5m
        annotations:
          summary: "1時間で $100 以上のコスト"

      - alert: CacheMissRateHigh
        expr: claude_cache_hit_rate < 0.3
        for: 30m
        annotations:
          summary: "キャッシュヒット率が 30% を下回る"

このレッスンのまとめ

「メトリクス（Prometheus）→ トレース（OpenTelemetry）→ 構造化ログ → ダッシュボード → アラート」の組み合わせで、AI機能を 継続観測 できる体制が作れます。次のレッスンでは、セキュリティを学びます。

よくある質問

この記事に関連する質問と答えをまとめました。

Q.AI 機能で観測すべき必須メトリクスは？

可用性・エラー率・P50/P95/P99レイテンシ・タスク完了率・コスト/リクエスト・トークン消費・キャッシュヒット率の7つは最低限見るべきです。Prometheus + Grafana が定番。

Q.構造化ログの効果は？

検索・集計・分析のしやすさが段違いです。JSON 形式でログを出すことで、特定エラーパターンの集計、異常検知、パフォーマンス分析が劇的に効率化します。