Back to Skills
    🦞

    google-gemini-media

    Use the Gemini API

    By @xsir0
    View on GitHub
    SKILL.md
    ---
    name: google-gemini-media
    description: Use the Gemini API (Nano Banana image generation, Veo video, Gemini TTS speech and audio understanding) to deliver end-to-end multimodal media workflows and code templates for "generation + understanding".
    license: MIT
    ---
    
    # Gemini Multimodal Media (Image/Video/Speech) Skill
    
    ## 1. Goals and scope
    
    This Skill consolidates six Gemini API capabilities into reusable workflows and implementation templates:
    
    - Image generation (Nano Banana: text-to-image, image editing, multi-turn iteration)
    - Image understanding (caption/VQA/classification/comparison, multi-image prompts; supports inline and Files API)
    - Video generation (Veo 3.1: text-to-video, aspect ratio/resolution control, reference-image guidance, first/last frames, video extension, native audio)
    - Video understanding (upload/inline/YouTube URL; summaries, Q&A, timestamped evidence)
    - Speech generation (Gemini native TTS: single-speaker and multi-speaker; controllable style/accent/pace/tone)
    - Audio understanding (upload/inline; description, transcription, time-range transcription, token counting)
    
    > Convention: This Skill follows the official Google Gen AI SDK (Node.js/REST) as the main line; currently only Node.js/REST examples are provided. If your project already wraps other languages or frameworks, map this Skill's request structure, model selection, and I/O spec to your wrapper layer.
    
    ---
    
    ## 2. Quick routing (decide which capability to use)
    
    1) **Do you need to produce images?**
    - Need to generate images from scratch or edit based on an image -> use **Nano Banana image generation** (see Section 5)
    
    2) **Do you need to understand images?**
    - Need recognition, description, Q&A, comparison, or info extraction -> use **Image understanding** (see Section 6)
    
    3) **Do you need to produce video?**
    - Need to generate an 8-second video (optionally with native audio) -> use **Veo 3.1 video generation** (see Section 7)
    
    4) **Do you need to understand video?**
    - Need summaries/Q&A/segment extraction with timestamps -> use **Video understanding** (see Section 8)
    
    5) **Do you need to read text aloud?**
    - Need controllable narration, podcast/audiobook style, etc. -> use **Speech generation (TTS)** (see Section 9)
    
    6) **Do you need to understand audio?**
    - Need audio descriptions, transcription, time-range transcription, token counting -> use **Audio understanding** (see Section 10)
    
    ---
    
    ## 3. Unified engineering constraints and I/O spec (must read)
    
    ### 3.0 Prerequisites (dependencies and tools)
    
    - Node.js 18+ (match your project version)
    - Install SDK (example):
    ```bash
    npm install @google/genai
    ```
    - REST examples only need `curl`; if you need to parse image Base64, install `jq` (optional).
    
    ### 3.1 Authentication and environment variables
    
    - Put your API key in `GEMINI_API_KEY`
    - REST requests use `x-goog-api-key: $GEMINI_API_KEY`
    
    ### 3.2 Two file input modes: Inline vs Files API
    
    **Inline (embedded bytes/Base64)**
    - Pros: shorter call chain, good for small files.
    - Key constraint: total request size (text prompt + system instructions + embedded bytes) typically has a ~20MB ceiling.
    
    **Files API (upload then reference)**
    - Pros: good for large files, reusing the same file, or multi-turn conversations.
    - Typical flow:
      1. `files.upload(...)` (SDK) or `POST /upload/v1beta/files` (REST resumable)
      2. Use `file_data` / `file_uri` in `generateContent`
    
    > Engineering suggestion: implement `ensure_file_uri()` so that when a file exceeds a threshold (for example 10-15MB warning) or is reused, you automatically route through the Files API.
    
    ### 3.3 Unified handling of binary media outputs
    
    - **Images**: usually returned as `inline_data` (Base64) in response parts; in the SDK use `part.as_image()` or decode Base64 and save as PNG/JPG.
    - **Speech (TTS)**: usually returns **PCM** bytes (Base64); save as `.pcm` or wrap into `.wav` (commonly 24kHz, 16-bit, mono).
    - **Video (Veo)**: long-running async task; poll the operation; download the file (or use the returned URI).
    
    ---
    
    ## 4. Model selection matrix (choose by scenario)
    
    > Important: model names, versions, limits, and quotas can change over time. Verify against official docs before use. Last updated: 2026-01-22.
    
    ### 4.1 Image generation (Nano Banana)
    - **gemini-2.5-flash-image**: optimized for speed/throughput; good for frequent, low-latency generation/editing.
    - **gemini-3-pro-image-preview**: stronger instruction following and high-fidelity text rendering; better for professional assets and complex edits.
    
    ### 4.2 General image/video/audio understanding
    - Docs use `gemini-3-flash-preview` for image, video, and audio understanding (choose stronger models as needed for quality/cost).
    
    ### 4.3 Video generation (Veo)
    - Example model: `veo-3.1-generate-preview` (generates 8-second video and can natively generate audio).
    
    ### 4.4 Speech generation (TTS)
    - Example model: `gemini-2.5-flash-preview-tts` (native TTS, currently in preview).
    
    ---
    
    ## 5. Image generation (Nano Banana)
    
    ### 5.1 Text-to-Image
    
    **SDK (Node.js) minimal template**
    ```js
    import { GoogleGenAI } from "@google/genai";
    import * as fs from "node:fs";
    
    const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
    
    const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-image",
      contents:
        "Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme",
    });
    
    const parts = response.candidates?.[0]?.content?.parts ?? [];
    for (const part of parts) {
      if (part.text) console.log(part.text);
      if (part.inlineData?.data) {
        fs.writeFileSync("out.png", Buffer.from(part.inlineData.data, "base64"));
      }
    }
    ```
    
    **REST (with imageConfig) minimal template**
    ```bash
    curl -s -X POST   "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent"   -H "x-goog-api-key: $GEMINI_API_KEY"   -H "Content-Type: application/json"   -d '{
        "contents":[{"parts":[{"text":"Create a picture of a nano banana dish in a fancy restaurant with a Gemini theme"}]}],
        "generationConfig": {"imageConfig": {"aspectRatio":"16:9"}}
      }'
    ```
    
    **REST image parsing (Base64 decode)**
    ```bash
    curl -s -X POST "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash-image:generateContent" \
      -H "x-goog-api-key: $GEMINI_API_KEY" \
      -H "Content-Type: application/json" \
      -d '{"contents":[{"parts":[{"text":"A minimal studio product shot of a nano banana"}]}]}' \
      | jq -r '.candidates[0].content.parts[] | select(.inline_data) | .inline_data.data' \
      | base64 --decode > out.png
    
    # macOS can use: base64 -D > out.png
    ```
    
    ### 5.2 Text-and-Image-to-Image
    
    Use case: given an image, **add/remove/modify elements**, change style, color grading, etc.
    
    **SDK (Node.js) minimal template**
    ```js
    import { GoogleGenAI } from "@google/genai";
    import * as fs from "node:fs";
    
    const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
    
    const prompt =
      "Add a nano banana on the table, keep lighting consistent, cinematic tone.";
    const imageBase64 = fs.readFileSync("input.png").toString("base64");
    
    const response = await ai.models.generateContent({
      model: "gemini-2.5-flash-image",
      contents: [
        { text: prompt },
        { inlineData: { mimeType: "image/png", data: imageBase64 } },
      ],
    });
    
    const parts = response.candidates?.[0]?.content?.parts ?? [];
    for (const part of parts) {
      if (part.inlineData?.data) {
        fs.writeFileSync("edited.png", Buffer.from(part.inlineData.data, "base64"));
      }
    }
    ```
    
    ### 5.3 Multi-turn image iteration (Multi-turn editing)
    
    Best practice: use chat for continuous iteration (for example: generate first, then "only edit a specific region/element", then "make variants in the same style").  
    To output mixed "text + image" results, set `response_modalities` to `["TEXT", "IMAGE"]`.
    
    ### 5.4 ImageConfig
    
    You can set in `generationConfig.imageConfig` or the SDK config:
    - `aspectRatio`: e.g. `16:9`, `1:1`.
    - `imageSize`: e.g. `2K`, `4K` (higher resolution is usually slower/more expensive and model support can vary).
    
    ---
    
    ## 6. Image understanding (Image Understanding)
    
    ### 6.1 Two ways to provide input images
    
    - **Inline image data**: suitable for small files (total request size < 20MB).
    - **Files API upload**: better for large files or reuse across multiple requests.
    
    ### 6.2 Inline images (Node.js) minimal template
    ```js
    import { GoogleGenAI } from "@google/genai";
    import * as fs from "node:fs";
    
    const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
    
    const imageBase64 = fs.readFileSync("image.jpg").toString("base64");
    
    const response = await ai.models.generateContent({
      model: "gemini-3-flash-preview",
      contents: [
        { inlineData: { mimeType: "image/jpeg", data: imageBase64 } },
        { text: "Caption this image, and list any visible brands." },
      ],
    });
    
    console.log(response.text);
    ```
    
    ### 6.3 Upload and reference with Files API (Node.js) minimal template
    ```js
    import { GoogleGenAI, createPartFromUri, createUserContent } from "@google/genai";
    
    const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
    const uploaded = await ai.files.upload({ file: "image.jpg" });
    
    const response = await ai.models.generateContent({
      model: "gemini-3-flash-preview",
      contents: createUserContent([
        createPartFromUri(uploaded.uri, uploaded.mimeType),
        "Caption this image.",
      ]),
    });
    
    console.log(response.text);
    ```
    
    ### 6.4 Multi-image prompts
    
    Append multiple images as multiple `Part` entries in the same `contents`; you can mix uploaded references and inline bytes.
    
    ---
    
    ## 7. Video generation (Veo 3.1)
    
    ### 7.1 Core features (must know)
    - Generates **8-second** high-fidelity video, optionally 720p / 1080p / 4k, and supports native audio generation (dialogue, ambience, SFX).
    - Supports:
      - Aspect ratio (16:9 / 9:16)
      - Video extension (extend a generated video; typically limited to 720p)
      - First/last frame control (frame-specific)
      - Up to 3 reference images (image-based direction)
    
    ### 7.2 SDK (Node.js) minimal t
    
    ... (truncated)