🦞
perf-profiler

Profile and optimize application performance.
SKILL.md
---
name: perf-profiler
description: Profile and optimize application performance. Use when diagnosing slow code, measuring CPU/memory usage, generating flame graphs, benchmarking functions, load testing APIs, finding memory leaks, or optimizing database queries.
metadata: {"clawdbot":{"emoji":"⚡","requires":{"anyBins":["node","python3","go","curl","ab"]},"os":["linux","darwin","win32"]}}
---

# Performance Profiler

Measure, profile, and optimize application performance. Covers CPU profiling, memory analysis, flame graphs, benchmarking, load testing, and language-specific optimization patterns.

## When to Use

- Diagnosing why an application or function is slow
- Measuring CPU and memory usage
- Generating flame graphs to visualize hot paths
- Benchmarking functions or endpoints
- Load testing APIs before deployment
- Finding and fixing memory leaks
- Optimizing database query performance
- Comparing performance before and after changes

## Quick Timing

### Command-line timing

```bash
# Time any command
time my-command --flag

# More precise: multiple runs with stats
for i in $(seq 1 10); do
  /usr/bin/time -f "%e" my-command 2>&1
done | awk '{sum+=$1; sumsq+=$1*$1; count++} END {
  avg=sum/count;
  stddev=sqrt(sumsq/count - avg*avg);
  printf "runs=%d avg=%.3fs stddev=%.3fs\n", count, avg, stddev
}'

# Hyperfine (better benchmarking tool)
# Install: https://github.com/sharkdp/hyperfine
hyperfine 'command-a' 'command-b'
hyperfine --warmup 3 --runs 20 'my-command'
hyperfine --export-json results.json 'old-version' 'new-version'
```

### Inline timing (any language)

```javascript
// Node.js
console.time('operation');
await doExpensiveThing();
console.timeEnd('operation'); // "operation: 142.3ms"

// High-resolution
const start = performance.now();
await doExpensiveThing();
const elapsed = performance.now() - start;
console.log(`Elapsed: ${elapsed.toFixed(2)}ms`);
```

```python
# Python
import time

start = time.perf_counter()
do_expensive_thing()
elapsed = time.perf_counter() - start
print(f"Elapsed: {elapsed:.4f}s")

# Context manager
from contextlib import contextmanager

@contextmanager
def timer(label=""):
    start = time.perf_counter()
    yield
    elapsed = time.perf_counter() - start
    print(f"{label}: {elapsed:.4f}s")

with timer("data processing"):
    process_data()
```

```go
// Go
start := time.Now()
doExpensiveThing()
fmt.Printf("Elapsed: %v\n", time.Since(start))
```

## Node.js Profiling

### CPU profiling with V8 inspector

```bash
# Generate CPU profile (writes .cpuprofile file)
node --cpu-prof app.js
# Open the .cpuprofile in Chrome DevTools > Performance tab

# Profile for a specific duration
node --cpu-prof --cpu-prof-interval=100 app.js

# Inspect running process
node --inspect app.js
# Open chrome://inspect in Chrome, click "inspect"
# Go to Performance tab, click Record
```

### Heap snapshots (memory)

```bash
# Generate heap snapshot
node --heap-prof app.js

# Take snapshots programmatically
node -e "
const v8 = require('v8');
const fs = require('fs');

// Take snapshot
const snapshotStream = v8.writeHeapSnapshot();
console.log('Heap snapshot written to:', snapshotStream);
"

# Compare heap snapshots to find leaks:
# 1. Take snapshot A (baseline)
# 2. Run operations that might leak
# 3. Take snapshot B
# 4. In Chrome DevTools > Memory, load both and use "Comparison" view
```

### Memory usage monitoring

```javascript
// Print memory usage periodically
setInterval(() => {
  const usage = process.memoryUsage();
  console.log({
    rss: `${(usage.rss / 1024 / 1024).toFixed(1)}MB`,
    heapUsed: `${(usage.heapUsed / 1024 / 1024).toFixed(1)}MB`,
    heapTotal: `${(usage.heapTotal / 1024 / 1024).toFixed(1)}MB`,
    external: `${(usage.external / 1024 / 1024).toFixed(1)}MB`,
  });
}, 5000);

// Detect memory growth
let lastHeap = 0;
setInterval(() => {
  const heap = process.memoryUsage().heapUsed;
  const delta = heap - lastHeap;
  if (delta > 1024 * 1024) { // > 1MB growth
    console.warn(`Heap grew by ${(delta / 1024 / 1024).toFixed(1)}MB`);
  }
  lastHeap = heap;
}, 10000);
```

### Node.js benchmarking

```javascript
// Simple benchmark function
function benchmark(name, fn, iterations = 10000) {
  // Warmup
  for (let i = 0; i < 100; i++) fn();

  const start = performance.now();
  for (let i = 0; i < iterations; i++) fn();
  const elapsed = performance.now() - start;

  console.log(`${name}: ${(elapsed / iterations).toFixed(4)}ms/op (${iterations} iterations in ${elapsed.toFixed(1)}ms)`);
}

benchmark('JSON.parse', () => JSON.parse('{"key":"value","num":42}'));
benchmark('regex match', () => /^\d{4}-\d{2}-\d{2}$/.test('2026-02-03'));
```

## Python Profiling

### cProfile (built-in CPU profiler)

```bash
# Profile a script
python3 -m cProfile -s cumulative my_script.py

# Save to file for analysis
python3 -m cProfile -o profile.prof my_script.py

# Analyze saved profile
python3 -c "
import pstats
stats = pstats.Stats('profile.prof')
stats.sort_stats('cumulative')
stats.print_stats(20)
"

# Profile a specific function
python3 -c "
import cProfile
from my_module import expensive_function

cProfile.run('expensive_function()', sort='cumulative')
"
```

### line_profiler (line-by-line)

```bash
# Install
pip install line_profiler

# Add @profile decorator to functions of interest, then:
kernprof -l -v my_script.py
```

```python
# Programmatic usage
from line_profiler import LineProfiler

def process_data(data):
    result = []
    for item in data:           # Is this loop the bottleneck?
        transformed = transform(item)
        if validate(transformed):
            result.append(transformed)
    return result

profiler = LineProfiler()
profiler.add_function(process_data)
profiler.enable()
process_data(large_dataset)
profiler.disable()
profiler.print_stats()
```

### Memory profiling (Python)

```bash
# memory_profiler
pip install memory_profiler

# Profile memory line-by-line
python3 -m memory_profiler my_script.py
```

```python
from memory_profiler import profile

@profile
def load_data():
    data = []
    for i in range(1000000):
        data.append({'id': i, 'value': f'item_{i}'})
    return data

# Track memory over time
import tracemalloc

tracemalloc.start()

# ... run code ...

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics('lineno')
for stat in top_stats[:10]:
    print(stat)
```

### Python benchmarking

```python
import timeit

# Time a statement
result = timeit.timeit('sorted(range(1000))', number=10000)
print(f"sorted: {result:.4f}s for 10000 iterations")

# Compare two approaches
setup = "data = list(range(10000))"
t1 = timeit.timeit('list(filter(lambda x: x % 2 == 0, data))', setup=setup, number=1000)
t2 = timeit.timeit('[x for x in data if x % 2 == 0]', setup=setup, number=1000)
print(f"filter: {t1:.4f}s  |  listcomp: {t2:.4f}s  |  speedup: {t1/t2:.2f}x")

# pytest-benchmark
# pip install pytest-benchmark
# def test_sort(benchmark):
#     benchmark(sorted, list(range(1000)))
```

## Go Profiling

### Built-in pprof

```go
// Add to main.go for HTTP-accessible profiling
import (
    "net/http"
    _ "net/http/pprof"
)

func main() {
    go func() {
        http.ListenAndServe("localhost:6060", nil)
    }()
    // ... rest of app
}
```

```bash
# CPU profile (30 seconds)
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Memory profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Goroutine profile
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Inside pprof interactive mode:
# top 20          - top functions by CPU/memory
# list funcName   - source code with annotations
# web             - open flame graph in browser
# png > out.png   - save call graph as image
```

### Go benchmarks

```go
// math_test.go
func BenchmarkAdd(b *testing.B) {
    for i := 0; i < b.N; i++ {
        Add(42, 58)
    }
}

func BenchmarkSort1000(b *testing.B) {
    data := make([]int, 1000)
    for i := range data {
        data[i] = rand.Intn(1000)
    }
    b.ResetTimer()
    for i := 0; i < b.N; i++ {
        sort.Ints(append([]int{}, data...))
    }
}
```

```bash
# Run benchmarks
go test -bench=. -benchmem ./...

# Compare before/after
go test -bench=. -count=5 ./... > old.txt
# ... make changes ...
go test -bench=. -count=5 ./... > new.txt
go install golang.org/x/perf/cmd/benchstat@latest
benchstat old.txt new.txt
```

## Flame Graphs

### Generate flame graphs

```bash
# Node.js: 0x (easiest)
npx 0x app.js
# Opens interactive flame graph in browser

# Node.js: clinic.js (comprehensive)
npx clinic flame -- node app.js
npx clinic doctor -- node app.js
npx clinic bubbleprof -- node app.js

# Python: py-spy (sampling profiler, no code changes needed)
pip install py-spy
py-spy record -o flame.svg -- python3 my_script.py

# Profile running Python process
py-spy record -o flame.svg --pid 12345

# Go: built-in
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30
# Navigate to "Flame Graph" view

# Linux (any process): perf + flamegraph
perf record -g -p PID -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
```

### Reading flame graphs

```
Key concepts:
- X-axis: NOT time. It's alphabetical sort of stack frames. Width = % of samples.
- Y-axis: Stack depth. Top = leaf function (where CPU time is spent).
- Wide bars at the top = hot functions (optimize these first).
- Narrow tall stacks = deep call chains (may indicate excessive abstraction).

What to look for:
1. Wide plateaus at the top → function that dominates CPU time
2. Multiple paths converging to one function → shared bottleneck
3. GC/runtime frames taking significant width → memory pressure
4. Unexpected functions appearing wide → performance bug
```

## Load Testing

### curl-based quick test

```bash
# Single request timing
curl -o /dev/null -s -w "HTTP %{http_code} | Total: %{time_total}s | TTFB: %{time_starttransfer}s | Connect: %{time_connect}s\n" https://api.example.com/endpoint

# Multiple requests in s

... (truncated)