Sai Charan - C++ Developer

HPX Parallel Scheduler Implementation

Based on P2079R10 Parallel Scheduler Proposal @ 2025

Sai Charan Arvapally
University of Alberta, Canada
View Project on GitHub

Abstract

This article explains how we implemented a parallel scheduler for HPX based on the P2079R10 proposal. The parallel scheduler provides a shared execution context for parallel work, solving critical problems like oversubscription and poor composability in traditional thread-based parallelism. We show how HPX's work-stealing runtime naturally maps to the parallel scheduler model, delivering high performance and excellent scalability.

📖 Deep Dive: Want to understand the internal architecture and how scheduling actually works? Check out the Parallel Scheduler Internals page for a comprehensive explanation of the type hierarchy, execution policies, task scheduling flow, and domain transformation pipeline.

Table of Contents


Introduction

Modern C++ applications need efficient parallel execution. For years, developers struggled with manual thread management, thread pools, and executors that didn't work well together. The C++ execution model has evolved significantly, especially with the introduction of senders and receivers in P2300. The parallel scheduler proposal (P2079R10) takes this further by defining a standard way to execute parallel work efficiently.

This article explains how we built a parallel scheduler for HPX that follows the P2079R10 design. We'll show you what problems it solves, how it works, and why it matters for real-world C++ applications.


Motivation

Traditional Thread-Based Parallelism

Before modern execution models, C++ developers had limited options for parallel execution:

// Old approach: Manual thread management
void process_data(std::vector<int>& data) {
  std::vector<std::thread> threads;
  int num_threads = std::thread::hardware_concurrency();
  
  for (int i = 0; i < num_threads; ++i) {
    threads.emplace_back([&, i]() {
      for (size_t j = i; j < data.size(); j += num_threads) {
        data[j] *= 2; // Process each element
      }
    });
  }
  
  for (auto& t : threads) t.join();
}

Major Problems:

Why Earlier Executor Designs Failed

Early attempts at standardizing executors (like static_thread_pool) had limitations:

// Old static_thread_pool approach
static_thread_pool pool(4); // Fixed number of threads
auto ex = pool.executor();

// Problem 1: Each library creates its own pool
static_thread_pool library_a_pool(4);
static_thread_pool library_b_pool(4);
// Now you have 8 threads competing for 4 cores!

// Problem 2: Nested parallelism doesn't work well
for (int i = 0; i < 100; ++i) {
  ex.execute([&]() {
    // Inner parallel work blocks outer threads
    parallel_algorithm(data); // Deadlock risk!
  });
}

Key Issues:

The Evolution: Senders and Receivers

The C++ execution model evolved with P2300, introducing senders and receivers. This provided better composability, but still lacked a standard parallel execution context:

// P2300 senders/receivers - better, but still missing something
auto work = stdexec::just(42)
  | stdexec::then([](int x) { return x * 2; })
  | stdexec::bulk(1000, [](int i, int val) { /* work */ });

// Question: WHERE does this bulk work execute?
// Answer: We need a parallel scheduler!

What Was Still Missing:


The Parallel Scheduler Concept (P2079R10)

Core Idea

P2079R10 proposes a parallel scheduler that provides a shared execution context for parallel work. Think of it as a system-wide thread pool that all components can use:

// Get the system's parallel scheduler
auto sched = stdexec::get_parallel_scheduler();

// Use it for parallel work
auto work = stdexec::schedule(sched)
  | stdexec::bulk(10000, [](int i) { /* parallel work */ });

// All libraries share the same execution context!

Key Design Goals

  1. Avoid Oversubscription: Single shared thread pool prevents too many threads
  2. Parallel Forward Progress: Guarantees that parallel work makes progress without deadlocks
  3. System Integration: Can use OS thread pools (like Windows thread pool, Grand Central Dispatch)
  4. Portability: Works across different platforms and hardware
  5. Replaceability: Applications can provide custom implementations

What is "Parallel Forward Progress"?

This is a key concept from P2079R10. It means that when you submit parallel work, the scheduler guarantees that all tasks will eventually make progress. This prevents deadlocks from nested parallelism:

// Nested parallelism - outer parallel work calls inner parallel work
auto sched = stdexec::get_parallel_scheduler();

stdexec::schedule(sched)
  | stdexec::bulk(100, [](int i) {
       // Outer parallel work
       auto inner = stdexec::schedule(sched)
         | stdexec::bulk(50, [](int j) { /* inner work */ });
       stdexec::sync_wait(std::move(inner));
     })
  | stdexec::sync_wait();

// With parallel forward progress: No deadlock!
// The scheduler ensures inner work can make progress
// even when outer work is using all threads

How It Fits with Senders/Receivers

The parallel scheduler integrates seamlessly with the P2300 sender/receiver model. It's just another scheduler, but with special properties that make it suitable for parallel algorithms:

// Parallel scheduler is a regular scheduler
auto par_sched = stdexec::get_parallel_scheduler();

// Works with all sender algorithms
auto work = stdexec::schedule(par_sched)
  | stdexec::then([]() { return load_data(); })
  | stdexec::bulk(1000, [](int i, auto data) { process(i, data); })
  | stdexec::then([](auto data) { return finalize(data); });

stdexec::sync_wait(std::move(work));

Our Implementation Using HPX

Why HPX?

HPX (High Performance ParalleX) is a C++ runtime system with a sophisticated work-stealing scheduler. It's a perfect fit for implementing the parallel scheduler concept because:

Core Architecture

Our implementation has three main components:

// 1. parallel_scheduler - The main scheduler interface
class parallel_scheduler {
  std::shared_ptr<parallel_scheduler_backend> backend_;
public:
  // Standard scheduler interface
  auto schedule() const noexcept;
};

// 2. parallel_scheduler_backend - Abstract backend interface
class parallel_scheduler_backend {
public:
  virtual ~parallel_scheduler_backend() = default;
  
  // Core operations
  virtual void schedule(/*...*/) = 0;
  virtual void schedule_bulk_chunked(/*...*/) = 0;
  virtual void schedule_bulk_unchunked(/*...*/) = 0;
};

// 3. hpx_parallel_scheduler_backend - HPX implementation
class hpx_parallel_scheduler_backend : public parallel_scheduler_backend {
  thread_pool_policy_scheduler underlying_; // HPX work-stealing scheduler
};

This layered design provides replaceability - applications can provide custom backends while maintaining the same interface.

How HPX Maps to the Parallel Scheduler Model

HPX's features naturally align with P2079R10 requirements:

P2079R10 Requirement HPX Feature
Shared execution context Global thread pool with work-stealing
Parallel forward progress Lightweight threads + task suspension
Efficient bulk operations Optimized chunking strategies
Scalability Proven to scale to thousands of cores

Key Implementation Details

1. Scheduler Abstraction

// Users get the parallel scheduler through a factory function
auto sched = hpx::execution::experimental::get_parallel_scheduler();

// Internally, this creates a scheduler wrapping HPX's thread pool
inline parallel_scheduler get_parallel_scheduler() {
  static auto backend = query_parallel_scheduler_backend();
  return parallel_scheduler{backend};
}

2. Task Submission

// When you schedule work, it goes through the backend
auto sender = sched.schedule(); // Returns a sender

// For bulk operations, we use optimized HPX bulk execution
auto bulk_work = sender | stdexec::bulk(10000, [](int i) {
  // This work runs on HPX's thread pool
  // with work-stealing and chunking optimizations
});

3. Domain Customization for Performance

// The parallel_scheduler has a custom domain
// that intercepts bulk operations and optimizes them
struct parallel_scheduler_domain {
  template<bulk_sender Sender>
  auto transform_sender(Sender&& snd) const {
    // Transform to HPX's optimized bulk sender
    // Uses work-stealing, chunking, NUMA awareness
    return hpx_optimized_bulk_sender{/*...*/};
  }
};

Before vs After: Key Changes

Let's compare how parallel execution works before and after implementing the parallel scheduler:

Aspect Before (Manual Threads) After (Parallel Scheduler)
Resource Usage Each library creates own thread pool → oversubscription Shared thread pool → optimal resource usage
Composability Poor - different systems don't work together Excellent - all use same execution context
Nested Parallelism Deadlock prone, thread pool exhaustion Handled correctly with forward progress guarantee
Abstraction Level Low - manual thread management High - declarative sender/receiver model
Scalability Limited by fixed thread counts Scales to thousands of cores with work-stealing
Performance Inconsistent, depends on manual tuning Optimized automatically with chunking & work-stealing

Detailed Comparison: Resource Usage

Before (Manual Thread Management):

// Library A creates its own thread pool
std::vector<std::thread> library_a_threads(4);

// Library B creates its own thread pool
std::vector<std::thread> library_b_threads(4);

// Your application creates threads too
std::vector<std::thread> app_threads(4);

// Result: 12 threads competing for 4 cores!
// Massive context switching overhead

After (Parallel Scheduler):

// Everyone uses the same parallel scheduler
auto sched = stdexec::get_parallel_scheduler();

// Library A uses it
stdexec::schedule(sched) | library_a_work();

// Library B uses it
stdexec::schedule(sched) | library_b_work();

// Your application uses it
stdexec::schedule(sched) | app_work();

// Result: Shared thread pool with optimal thread count
// Work-stealing ensures efficient load balancing

Detailed Comparison: Nested Parallelism

Before (Deadlock Risk):

static_thread_pool pool(4); // Only 4 threads

// Outer parallel work uses all 4 threads
for (int i = 0; i < 4; ++i) {
  pool.execute([&]() {
    // Inner parallel work needs threads too
    for (int j = 0; j < 4; ++j) {
      pool.execute([&]() { inner_work(); });
    }
    wait_for_inner_work(); // DEADLOCK! No threads available
  });
}

After (Parallel Forward Progress):

auto sched = stdexec::get_parallel_scheduler();

// Outer parallel work
stdexec::schedule(sched)
  | stdexec::bulk(4, [](int i) {
       // Inner parallel work
       auto inner = stdexec::schedule(sched)
         | stdexec::bulk(4, [](int j) { inner_work(); });
       stdexec::sync_wait(std::move(inner));
       // No deadlock! HPX suspends outer task and runs inner work
     })
  | stdexec::sync_wait();

Abstraction Level Improvement

The parallel scheduler raises the abstraction level significantly. Instead of managing threads manually, you work with high-level sender/receiver operations:

Before (Low-Level):

// Manual thread management, synchronization, error handling
std::vector<std::thread> threads;
std::mutex mtx;
std::exception_ptr error;

for (int i = 0; i < num_threads; ++i) {
  threads.emplace_back([&, i]() {
    try {
      process_chunk(i);
    } catch (...) {
      std::lock_guard lock(mtx);
      error = std::current_exception();
    }
  });
}

for (auto& t : threads) t.join();
if (error) std::rethrow_exception(error);

After (High-Level):

// Declarative, composable, automatic error propagation
auto sched = stdexec::get_parallel_scheduler();

stdexec::schedule(sched)
  | stdexec::bulk(num_tasks, [](int i) { process_chunk(i); })
  | stdexec::sync_wait();

// Errors automatically propagated, resources automatically managed

How It Works in Practice

Basic Usage Workflow

Step 1: Get the Parallel Scheduler

// Get the system's parallel scheduler
auto sched = hpx::execution::experimental::get_parallel_scheduler();

// This returns a shared scheduler backed by HPX's thread pool
// All calls to get_parallel_scheduler() return the same logical scheduler

Step 2: Submit Parallel Work

// Create a sender that schedules work on the parallel scheduler
auto work = stdexec::schedule(sched)
  | stdexec::bulk(10000, [](int i) {
       // Process element i
       process_element(i);
     });

// The work is described but not yet executed

Step 3: Execute and Synchronize

// Execute the work and wait for completion
stdexec::sync_wait(std::move(work));

// Or chain more operations before executing
auto result = stdexec::schedule(sched)
  | stdexec::bulk(1000, process_fn)
  | stdexec::then([]() { return finalize(); })
  | stdexec::sync_wait();

Real-World Usage Scenarios

Scenario 1: CPU-Bound Data Processing

// Process a large dataset in parallel
std::vector<double> data(1000000);
auto sched = hpx::execution::experimental::get_parallel_scheduler();

stdexec::schedule(sched)
  | stdexec::bulk(data.size(), [&data](size_t i) {
       // Complex computation on each element
       data[i] = std::sqrt(data[i] * data[i] + 1.0);
     })
  | stdexec::sync_wait();

// HPX automatically chunks the work and distributes it efficiently

Scenario 2: Parallel Algorithms

// Use parallel scheduler with standard algorithms
auto sched = hpx::execution::experimental::get_parallel_scheduler();

// Parallel transform
stdexec::schedule(sched)
  | stdexec::bulk(input.size(), [&](size_t i) {
       output[i] = transform_fn(input[i]);
     })
  | stdexec::sync_wait();

Scenario 3: Large-Scale Scientific Computing

// Simulate particles in parallel
struct Particle { /* ... */ };
std::vector<Particle> particles(1000000);

auto sched = hpx::execution::experimental::get_parallel_scheduler();

// Run simulation steps
for (int step = 0; step < 1000; ++step) {
  stdexec::schedule(sched)
    | stdexec::bulk(particles.size(), [&](size_t i) {
         particles[i].update_position();
         particles[i].update_velocity();
       })
    | stdexec::sync_wait();
}

// Scales efficiently to thousands of cores

Under the Hood: How Scheduling is Implemented

HPX's parallel_scheduler is a thin domain-aware wrapper around the existing thread_pool_policy_scheduler<hpx::launch>. Rather than building an entirely new execution backend, it reuses HPX's mature thread pool infrastructure and plugs into stdexec's domain-based sender transformation pipeline.

Class Hierarchy & Object Model

Here is a breakdown of how the different components relate to each other:

// The parallel_scheduler wrapper
parallel_scheduler
  └── shared_ptr<parallel_scheduler_backend>   (type-erased, heap-allocated)

// The abstract backend interface
parallel_scheduler_backend (abstract)
  ├── schedule(receiver_proxy&)
  ├── schedule_bulk_chunked(n, bulk_proxy&)
  ├── schedule_bulk_unchunked(n, bulk_proxy&)
  ├── equal_to(other&) → bool
  ├── get_underlying_scheduler() → const thread_pool_policy_scheduler*
  └── get_pu_mask() → const mask_type*

// The concrete default implementation
hpx_parallel_scheduler_backend (concrete default)
  └── wraps thread_pool_policy_scheduler + mask_type

// The factory function
get_parallel_scheduler()
  └── queries query_parallel_scheduler_backend()
  └── wraps the returned shared_ptr in a parallel_scheduler

// Operation state for connecting senders and receivers
operation_state<Receiver>
  ├── receiver_
  ├── backend_  (shared_ptr to backend)
  ├── proxy_    (concrete_receiver_proxy, member — not local)
  └── start() → backend_->schedule(proxy_)

// Scheduler equality comparison
operator== → backend_->equal_to(*other.backend_)

Execution Flow: How Polices Are Handled

When you use the parallel scheduler, here is the exact flow of how policies (like bulk dispatching) are implemented under the hood:

get_parallel_scheduler()
    |
    v
parallel_scheduler  (owns thread_pool_policy_scheduler + cached PU mask)
    |
    |-- schedule() --> sender<parallel_scheduler>
    |                    |-- env exposes: get_completion_scheduler<set_value_t>
    |                    |                get_domain --> parallel_scheduler_domain
    |                    |-- connect() --> operation_state
    |                                        |-- start() --> checks stop_token
    |                                                        delegates to thread_pool.execute()
    |
    |-- bulk / bulk_chunked / bulk_unchunked
         |-- parallel_scheduler_domain::transform_sender()
              |-- extracts underlying thread_pool_policy_scheduler
              |-- creates thread_pool_bulk_sender<..., IsChunked, IsParallel>
                   |-- uses work-stealing index queues
                   |-- NUMA-aware thread placement
                   |-- main-thread participation

Sequential vs Parallel Execution Policies (seq / par)

A crucial part of scheduling tasks effectively is deciding whether the execution should happen sequentially (hpx::execution::seq) or in parallel (hpx::execution::par). Within the parallel_scheduler infrastructure, this policy directly controls the underlying bulk sender characteristics:


Why This Matters

Performance Benefits

1. Eliminates Oversubscription

By sharing a single thread pool across all libraries and components, the parallel scheduler prevents the performance degradation caused by too many threads competing for CPU cores:

2. Work-Stealing Efficiency

HPX's work-stealing scheduler automatically balances load across threads. When one thread finishes its work, it steals tasks from busy threads:

3. Optimized Bulk Operations

The parallel scheduler uses intelligent chunking strategies to minimize overhead:

Maintainability Benefits

Simpler Code

The high-level sender/receiver model eliminates boilerplate:

Better Composability

Different libraries can work together seamlessly:

Portability Benefits

Platform Independence

The parallel scheduler abstraction works across different platforms:

Alignment with Future C++ Concurrency

The parallel scheduler is part of the evolution toward structured concurrency in C++:

By implementing the parallel scheduler now, HPX users are ready for the future of C++ concurrency.


Design Decisions and Replaceability

Why Replaceability Matters

P2079R10 emphasizes that the parallel scheduler should be replaceable. This means applications can provide their own implementation while maintaining the same interface:

Our Replaceability Architecture

We implemented replaceability through a backend abstraction layer:

// Abstract backend interface
class parallel_scheduler_backend {
public:
  virtual ~parallel_scheduler_backend() = default;
  
  // Core scheduling operations
  virtual void schedule(/*...*/) = 0;
  virtual void schedule_bulk_chunked(/*...*/) = 0;
  virtual void schedule_bulk_unchunked(/*...*/) = 0;
  
  // Equality and introspection
  virtual bool equal_to(const parallel_scheduler_backend*) const = 0;
};

// Default HPX implementation
class hpx_parallel_scheduler_backend : public parallel_scheduler_backend {
  thread_pool_policy_scheduler underlying_;
public:
  // Implements all virtual methods using HPX's thread pool
};

How to Provide a Custom Backend

Applications can replace the default backend:

// 1. Implement your custom backend
class my_custom_backend : public parallel_scheduler_backend {
  // Implement all virtual methods
  // Could use a different thread pool, GPU executor, etc.
};

// 2. Set your backend as the factory
hpx::execution::experimental::set_parallel_scheduler_backend_factory(
  []() { return std::make_shared<my_custom_backend>(); }
);

// 3. Now all calls to get_parallel_scheduler() use your backend
auto sched = hpx::execution::experimental::get_parallel_scheduler();
// Uses my_custom_backend internally!

Benefits of This Design

1. Flexibility

2. Testability

3. Platform Integration

Key Design Decisions

Shared Pointer for Backend

The parallel scheduler stores a shared_ptr<parallel_scheduler_backend>. This allows:

Virtual Interface for Operations

Using virtual methods allows runtime polymorphism:

Domain Customization for Performance

The parallel scheduler has a custom domain that optimizes bulk operations:


Best Practices

Using the Parallel Scheduler

Do:

Don't:

Performance Tips

Do:

Don't:


Conclusion

The parallel scheduler represents a major step forward in C++ parallel programming. By providing a shared execution context with parallel forward progress guarantees, it solves fundamental problems that have plagued parallel C++ code for years.

What we've achieved:

Our HPX-based implementation demonstrates that the P2079R10 parallel scheduler concept is practical and delivers real performance benefits. The combination of HPX's proven runtime with the standard sender/receiver model creates a powerful foundation for modern C++ parallel programming.

As C++ continues to evolve toward structured concurrency, the parallel scheduler will play a central role. By adopting it now, you're preparing your codebase for the future of C++ while gaining immediate performance and maintainability benefits.


Special Thanks

My deepest thanks go to Hartmut Kaiser and Isidoros Tsaousis-Seiras for their mentorship and invaluable insights throughout this project.


References

  1. [P2079R10] System Execution Context - Parallel Scheduler Proposal
  2. [P2300R10] std::execution - Senders and Receivers
  3. [HPX] HPX - High Performance ParalleX Runtime System
  4. [stdexec] NVIDIA stdexec - Reference Implementation
  5. [Project #6655] HPX Parallel Scheduler Implementation