This is documentation for the next version of Grafana Pyroscope documentation. For the latest stable release, go to the latest version.

Documentation

Grafana Pyroscope

Reference: v2 Architecture

Compaction

Open source

Compaction

Compaction is the process of merging multiple small segments into larger, optimized blocks. This is essential for maintaining query performance and controlling metadata index size.

Why compaction matters

The ingestion pipeline creates many small segments—potentially millions of objects per hour at scale. Without compaction:

Read amplification: Queries must fetch many small objects
API costs: More calls to object storage
Metadata bloat: The metastore index grows unboundedly
Performance degradation: Impacts both read and write paths

How it works

Compaction in Pyroscope v2 is coordinated by the metastore and executed by compaction-workers.

sequenceDiagram
    participant W as Compaction Worker
    participant M as Metastore
    participant S as Object Storage

    loop Continuous
        W->>M: Poll for jobs
        M->>W: Assign job with source blocks
        W->>S: Download source segments
        W->>W: Merge segments into block
        W->>S: Upload compacted block
        W->>M: Report completion
        M->>M: Update metadata index
    end

Compaction service

The compaction service runs within the metastore and is responsible for:

Job planning: Creating compaction jobs when enough segments are available
Job scheduling: Assigning jobs to workers based on capacity
Job tracking: Monitoring progress and handling failures
Index updates: Replacing source block entries with compacted block entries

Raft consistency

The compaction service relies on Raft to guarantee consistency:

Plan preparation: The leader prepares job state changes (read-only).
Plan proposal: Changes are committed to the Raft log.
State update: All replicas apply the changes atomically.

This ensures all replicas maintain consistent views of compaction state.

Job planner

The job planner maintains a queue of blocks eligible for compaction:

Queue structure: FIFO queue, segmented by tenant, shard, and level
Job creation: Jobs are created when enough blocks are queued
Boundaries: Compaction never crosses tenant, shard, or level boundaries

Data layout

Profiling data from each service is stored as a separate dataset within a block. During compaction:

Matching datasets from source blocks are merged
TSDB indexes are combined
Symbols and profile tables are merged and rewritten
Output block contains optimized, non-overlapping datasets

Job scheduler

The scheduler uses a Small Job First strategy:

Lower-level blocks are prioritized (smaller, affect read amplification more).
Within a level, unassigned jobs are processed first.
Jobs with fewer failures are prioritized.
Jobs with earlier lease expiration are considered first.

Adaptive capacity

Workers specify available capacity when polling for jobs. The scheduler:

Creates jobs based on reported worker capacity
Balances queue size with worker utilization
Adapts to available resources automatically

Job ownership

Jobs are assigned using a lease-based model:

Lease duration: Workers are granted ownership for a limited time
Fencing tokens: Raft log index serves as a unique token
Lease refresh: Workers must refresh leases before expiration
Reassignment: Expired leases allow job reassignment

Failure handling

When a worker fails:

The job lease expires.
The metastore detects the expired lease.
The job is reassigned to another worker.
Source blocks remain until compaction succeeds.

Jobs that repeatedly fail are deprioritized to prevent blocking the queue.

Job status lifecycle

stateDiagram-v2
    [*] --> Unassigned : Create Job
    Unassigned --> InProgress : Assign Job
    InProgress --> Success : Job Completed
    InProgress --> LeaseExpired: Job Lease Expires
    LeaseExpired: Abandoned Job

    LeaseExpired --> Excluded: Failure Threshold Exceeded
    Excluded: Faulty Job

    Success --> [*] : Remove Job from Schedule
    LeaseExpired --> InProgress : Reassign Job