Back to Skills
    🦞

    ab-test-setup

    When the user wants to plan

    By @jchopard69
    View on GitHub
    SKILL.md
    ---
    name: ab-test-setup
    description: When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking.
    ---
    
    # A/B Test Setup
    
    You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.
    
    ## Initial Assessment
    
    Before designing a test, understand:
    
    1. **Test Context**
       - What are you trying to improve?
       - What change are you considering?
       - What made you want to test this?
    
    2. **Current State**
       - Baseline conversion rate?
       - Current traffic volume?
       - Any historical test data?
    
    3. **Constraints**
       - Technical implementation complexity?
       - Timeline requirements?
       - Tools available?
    
    ---
    
    ## Core Principles
    
    ### 1. Start with a Hypothesis
    - Not just "let's see what happens"
    - Specific prediction of outcome
    - Based on reasoning or data
    
    ### 2. Test One Thing
    - Single variable per test
    - Otherwise you don't know what worked
    - Save MVT for later
    
    ### 3. Statistical Rigor
    - Pre-determine sample size
    - Don't peek and stop early
    - Commit to the methodology
    
    ### 4. Measure What Matters
    - Primary metric tied to business value
    - Secondary metrics for context
    - Guardrail metrics to prevent harm
    
    ---
    
    ## Hypothesis Framework
    
    ### Structure
    
    ```
    Because [observation/data],
    we believe [change]
    will cause [expected outcome]
    for [audience].
    We'll know this is true when [metrics].
    ```
    
    ### Examples
    
    **Weak hypothesis:**
    "Changing the button color might increase clicks."
    
    **Strong hypothesis:**
    "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."
    
    ### Good Hypotheses Include
    
    - **Observation**: What prompted this idea
    - **Change**: Specific modification
    - **Effect**: Expected outcome and direction
    - **Audience**: Who this applies to
    - **Metric**: How you'll measure success
    
    ---
    
    ## Test Types
    
    ### A/B Test (Split Test)
    - Two versions: Control (A) vs. Variant (B)
    - Single change between versions
    - Most common, easiest to analyze
    
    ### A/B/n Test
    - Multiple variants (A vs. B vs. C...)
    - Requires more traffic
    - Good for testing several options
    
    ### Multivariate Test (MVT)
    - Multiple changes in combinations
    - Tests interactions between changes
    - Requires significantly more traffic
    - Complex analysis
    
    ### Split URL Test
    - Different URLs for variants
    - Good for major page changes
    - Easier implementation sometimes
    
    ---
    
    ## Sample Size Calculation
    
    ### Inputs Needed
    
    1. **Baseline conversion rate**: Your current rate
    2. **Minimum detectable effect (MDE)**: Smallest change worth detecting
    3. **Statistical significance level**: Usually 95%
    4. **Statistical power**: Usually 80%
    
    ### Quick Reference
    
    | Baseline Rate | 10% Lift | 20% Lift | 50% Lift |
    |---------------|----------|----------|----------|
    | 1% | 150k/variant | 39k/variant | 6k/variant |
    | 3% | 47k/variant | 12k/variant | 2k/variant |
    | 5% | 27k/variant | 7k/variant | 1.2k/variant |
    | 10% | 12k/variant | 3k/variant | 550/variant |
    
    ### Formula Resources
    - Evan Miller's calculator: https://www.evanmiller.org/ab-testing/sample-size.html
    - Optimizely's calculator: https://www.optimizely.com/sample-size-calculator/
    
    ### Test Duration
    
    ```
    Duration = Sample size needed per variant Γ— Number of variants
               ───────────────────────────────────────────────────
               Daily traffic to test page Γ— Conversion rate
    ```
    
    Minimum: 1-2 business cycles (usually 1-2 weeks)
    Maximum: Avoid running too long (novelty effects, external factors)
    
    ---
    
    ## Metrics Selection
    
    ### Primary Metric
    - Single metric that matters most
    - Directly tied to hypothesis
    - What you'll use to call the test
    
    ### Secondary Metrics
    - Support primary metric interpretation
    - Explain why/how the change worked
    - Help understand user behavior
    
    ### Guardrail Metrics
    - Things that shouldn't get worse
    - Revenue, retention, satisfaction
    - Stop test if significantly negative
    
    ### Metric Examples by Test Type
    
    **Homepage CTA test:**
    - Primary: CTA click-through rate
    - Secondary: Time to click, scroll depth
    - Guardrail: Bounce rate, downstream conversion
    
    **Pricing page test:**
    - Primary: Plan selection rate
    - Secondary: Time on page, plan distribution
    - Guardrail: Support tickets, refund rate
    
    **Signup flow test:**
    - Primary: Signup completion rate
    - Secondary: Field-level completion, time to complete
    - Guardrail: User activation rate (post-signup quality)
    
    ---
    
    ## Designing Variants
    
    ### Control (A)
    - Current experience, unchanged
    - Don't modify during test
    
    ### Variant (B+)
    
    **Best practices:**
    - Single, meaningful change
    - Bold enough to make a difference
    - True to the hypothesis
    
    **What to vary:**
    
    Headlines/Copy:
    - Message angle
    - Value proposition
    - Specificity level
    - Tone/voice
    
    Visual Design:
    - Layout structure
    - Color and contrast
    - Image selection
    - Visual hierarchy
    
    CTA:
    - Button copy
    - Size/prominence
    - Placement
    - Number of CTAs
    
    Content:
    - Information included
    - Order of information
    - Amount of content
    - Social proof type
    
    ### Documenting Variants
    
    ```
    Control (A):
    - Screenshot
    - Description of current state
    
    Variant (B):
    - Screenshot or mockup
    - Specific changes made
    - Hypothesis for why this will win
    ```
    
    ---
    
    ## Traffic Allocation
    
    ### Standard Split
    - 50/50 for A/B test
    - Equal split for multiple variants
    
    ### Conservative Rollout
    - 90/10 or 80/20 initially
    - Limits risk of bad variant
    - Longer to reach significance
    
    ### Ramping
    - Start small, increase over time
    - Good for technical risk mitigation
    - Most tools support this
    
    ### Considerations
    - Consistency: Users see same variant on return
    - Segment sizes: Ensure segments are large enough
    - Time of day/week: Balanced exposure
    
    ---
    
    ## Implementation Approaches
    
    ### Client-Side Testing
    
    **Tools**: PostHog, Optimizely, VWO, custom
    
    **How it works**:
    - JavaScript modifies page after load
    - Quick to implement
    - Can cause flicker
    
    **Best for**:
    - Marketing pages
    - Copy/visual changes
    - Quick iteration
    
    ### Server-Side Testing
    
    **Tools**: PostHog, LaunchDarkly, Split, custom
    
    **How it works**:
    - Variant determined before page renders
    - No flicker
    - Requires development work
    
    **Best for**:
    - Product features
    - Complex changes
    - Performance-sensitive pages
    
    ### Feature Flags
    
    - Binary on/off (not true A/B)
    - Good for rollouts
    - Can convert to A/B with percentage split
    
    ---
    
    ## Running the Test
    
    ### Pre-Launch Checklist
    
    - [ ] Hypothesis documented
    - [ ] Primary metric defined
    - [ ] Sample size calculated
    - [ ] Test duration estimated
    - [ ] Variants implemented correctly
    - [ ] Tracking verified
    - [ ] QA completed on all variants
    - [ ] Stakeholders informed
    
    ### During the Test
    
    **DO:**
    - Monitor for technical issues
    - Check segment quality
    - Document any external factors
    
    **DON'T:**
    - Peek at results and stop early
    - Make changes to variants
    - Add traffic from new sources
    - End early because you "know" the answer
    
    ### Peeking Problem
    
    Looking at results before reaching sample size and stopping when you see significance leads to:
    - False positives
    - Inflated effect sizes
    - Wrong decisions
    
    **Solutions:**
    - Pre-commit to sample size and stick to it
    - Use sequential testing if you must peek
    - Trust the process
    
    ---
    
    ## Analyzing Results
    
    ### Statistical Significance
    
    - 95% confidence = p-value < 0.05
    - Means: <5% chance result is random
    - Not a guaranteeβ€”just a threshold
    
    ### Practical Significance
    
    Statistical β‰  Practical
    
    - Is the effect size meaningful for business?
    - Is it worth the implementation cost?
    - Is it sustainable over time?
    
    ### What to Look At
    
    1. **Did you reach sample size?**
       - If not, result is preliminary
    
    2. **Is it statistically significant?**
       - Check confidence intervals
       - Check p-value
    
    3. **Is the effect size meaningful?**
       - Compare to your MDE
       - Project business impact
    
    4. **Are secondary metrics consistent?**
       - Do they support the primary?
       - Any unexpected effects?
    
    5. **Any guardrail concerns?**
       - Did anything get worse?
       - Long-term risks?
    
    6. **Segment differences?**
       - Mobile vs. desktop?
       - New vs. returning?
       - Traffic source?
    
    ### Interpreting Results
    
    | Result | Conclusion |
    |--------|------------|
    | Significant winner | Implement variant |
    | Significant loser | Keep control, learn why |
    | No significant difference | Need more traffic or bolder test |
    | Mixed signals | Dig deeper, maybe segment |
    
    ---
    
    ## Documenting and Learning
    
    ### Test Documentation
    
    ```
    Test Name: [Name]
    Test ID: [ID in testing tool]
    Dates: [Start] - [End]
    Owner: [Name]
    
    Hypothesis:
    [Full hypothesis statement]
    
    Variants:
    - Control: [Description + screenshot]
    - Variant: [Description + screenshot]
    
    Results:
    - Sample size: [achieved vs. target]
    - Primary metric: [control] vs. [variant] ([% change], [confidence])
    - Secondary metrics: [summary]
    - Segment insights: [notable differences]
    
    Decision: [Winner/Loser/Inconclusive]
    Action: [What we're doing]
    
    Learnings:
    [What we learned, what to test next]
    ```
    
    ### Building a Learning Repository
    
    - Central location for all tests
    - Searchable by page, element, outcome
    - Prevents re-running failed tests
    - Builds institutional knowledge
    
    ---
    
    ## Output Format
    
    ### Test Plan Document
    
    ```
    # A/B Test: [Name]
    
    ## Hypothesis
    [Full hypothesis using framework]
    
    ## Test Design
    - Type: A/B / A/B/n / MVT
    - Duration: X weeks
    - Sample size: X per variant
    - Traffic allocation: 50/50
    
    ## Variants
    [Control and variant descriptions with visuals]
    
    ## Metrics
    - Primary: [metric and definition]
    - Secondary: [list]
    - Guardrails: [list]
    
    ## Implementation
    - Method: Client-side / Server-side
    - Tool: [Tool name]
    - Dev requirements: [If any]
    
    ## Analysis Plan
    - Success criteria: [What constitutes a win]
    - Segmen
    
    ... (truncated)