Sai Charan Arvapally
University of Alberta, Canada
View Project on GitHub
This article explains how we implemented a parallel scheduler for HPX based on the P2079R10 proposal. The parallel scheduler provides a shared execution context for parallel work, solving critical problems like oversubscription and poor composability in traditional thread-based parallelism. We show how HPX's work-stealing runtime naturally maps to the parallel scheduler model, delivering high performance and excellent scalability.
📖 Deep Dive: Want to understand the internal architecture and how scheduling actually works? Check out the Parallel Scheduler Internals page for a comprehensive explanation of the type hierarchy, execution policies, task scheduling flow, and domain transformation pipeline.
Modern C++ applications need efficient parallel execution. For years, developers struggled with manual thread management, thread pools, and executors that didn't work well together. The C++ execution model has evolved significantly, especially with the introduction of senders and receivers in P2300. The parallel scheduler proposal (P2079R10) takes this further by defining a standard way to execute parallel work efficiently.
This article explains how we built a parallel scheduler for HPX that follows the P2079R10 design. We'll show you what problems it solves, how it works, and why it matters for real-world C++ applications.
Before modern execution models, C++ developers had limited options for parallel execution:
// Old approach: Manual thread management void process_data(std::vector<int>& data) { std::vector<std::thread> threads; int num_threads = std::thread::hardware_concurrency(); for (int i = 0; i < num_threads; ++i) { threads.emplace_back([&, i]() { for (size_t j = i; j < data.size(); j += num_threads) { data[j] *= 2; // Process each element } }); } for (auto& t : threads) t.join(); }
Major Problems:
Early attempts at standardizing executors (like static_thread_pool) had limitations:
// Old static_thread_pool approach static_thread_pool pool(4); // Fixed number of threads auto ex = pool.executor(); // Problem 1: Each library creates its own pool static_thread_pool library_a_pool(4); static_thread_pool library_b_pool(4); // Now you have 8 threads competing for 4 cores! // Problem 2: Nested parallelism doesn't work well for (int i = 0; i < 100; ++i) { ex.execute([&]() { // Inner parallel work blocks outer threads parallel_algorithm(data); // Deadlock risk! }); }
Key Issues:
The C++ execution model evolved with P2300, introducing senders and receivers. This provided better composability, but still lacked a standard parallel execution context:
// P2300 senders/receivers - better, but still missing something auto work = stdexec::just(42) | stdexec::then([](int x) { return x * 2; }) | stdexec::bulk(1000, [](int i, int val) { /* work */ }); // Question: WHERE does this bulk work execute? // Answer: We need a parallel scheduler!
What Was Still Missing:
P2079R10 proposes a parallel scheduler that provides a shared execution context for parallel work. Think of it as a system-wide thread pool that all components can use:
// Get the system's parallel scheduler auto sched = stdexec::get_parallel_scheduler(); // Use it for parallel work auto work = stdexec::schedule(sched) | stdexec::bulk(10000, [](int i) { /* parallel work */ }); // All libraries share the same execution context!
This is a key concept from P2079R10. It means that when you submit parallel work, the scheduler guarantees that all tasks will eventually make progress. This prevents deadlocks from nested parallelism:
// Nested parallelism - outer parallel work calls inner parallel work auto sched = stdexec::get_parallel_scheduler(); stdexec::schedule(sched) | stdexec::bulk(100, [](int i) { // Outer parallel work auto inner = stdexec::schedule(sched) | stdexec::bulk(50, [](int j) { /* inner work */ }); stdexec::sync_wait(std::move(inner)); }) | stdexec::sync_wait(); // With parallel forward progress: No deadlock! // The scheduler ensures inner work can make progress // even when outer work is using all threads
The parallel scheduler integrates seamlessly with the P2300 sender/receiver model. It's just another scheduler, but with special properties that make it suitable for parallel algorithms:
// Parallel scheduler is a regular scheduler auto par_sched = stdexec::get_parallel_scheduler(); // Works with all sender algorithms auto work = stdexec::schedule(par_sched) | stdexec::then([]() { return load_data(); }) | stdexec::bulk(1000, [](int i, auto data) { process(i, data); }) | stdexec::then([](auto data) { return finalize(data); }); stdexec::sync_wait(std::move(work));
HPX (High Performance ParalleX) is a C++ runtime system with a sophisticated work-stealing scheduler. It's a perfect fit for implementing the parallel scheduler concept because:
Our implementation has three main components:
// 1. parallel_scheduler - The main scheduler interface class parallel_scheduler { std::shared_ptr<parallel_scheduler_backend> backend_; public: // Standard scheduler interface auto schedule() const noexcept; }; // 2. parallel_scheduler_backend - Abstract backend interface class parallel_scheduler_backend { public: virtual ~parallel_scheduler_backend() = default; // Core operations virtual void schedule(/*...*/) = 0; virtual void schedule_bulk_chunked(/*...*/) = 0; virtual void schedule_bulk_unchunked(/*...*/) = 0; }; // 3. hpx_parallel_scheduler_backend - HPX implementation class hpx_parallel_scheduler_backend : public parallel_scheduler_backend { thread_pool_policy_scheduler underlying_; // HPX work-stealing scheduler };
This layered design provides replaceability - applications can provide custom backends while maintaining the same interface.
HPX's features naturally align with P2079R10 requirements:
| P2079R10 Requirement | HPX Feature |
|---|---|
| Shared execution context | Global thread pool with work-stealing |
| Parallel forward progress | Lightweight threads + task suspension |
| Efficient bulk operations | Optimized chunking strategies |
| Scalability | Proven to scale to thousands of cores |
1. Scheduler Abstraction
// Users get the parallel scheduler through a factory function auto sched = hpx::execution::experimental::get_parallel_scheduler(); // Internally, this creates a scheduler wrapping HPX's thread pool inline parallel_scheduler get_parallel_scheduler() { static auto backend = query_parallel_scheduler_backend(); return parallel_scheduler{backend}; }
2. Task Submission
// When you schedule work, it goes through the backend auto sender = sched.schedule(); // Returns a sender // For bulk operations, we use optimized HPX bulk execution auto bulk_work = sender | stdexec::bulk(10000, [](int i) { // This work runs on HPX's thread pool // with work-stealing and chunking optimizations });
3. Domain Customization for Performance
// The parallel_scheduler has a custom domain // that intercepts bulk operations and optimizes them struct parallel_scheduler_domain { template<bulk_sender Sender> auto transform_sender(Sender&& snd) const { // Transform to HPX's optimized bulk sender // Uses work-stealing, chunking, NUMA awareness return hpx_optimized_bulk_sender{/*...*/}; } };
Let's compare how parallel execution works before and after implementing the parallel scheduler:
| Aspect | Before (Manual Threads) | After (Parallel Scheduler) |
|---|---|---|
| Resource Usage | Each library creates own thread pool → oversubscription | Shared thread pool → optimal resource usage |
| Composability | Poor - different systems don't work together | Excellent - all use same execution context |
| Nested Parallelism | Deadlock prone, thread pool exhaustion | Handled correctly with forward progress guarantee |
| Abstraction Level | Low - manual thread management | High - declarative sender/receiver model |
| Scalability | Limited by fixed thread counts | Scales to thousands of cores with work-stealing |
| Performance | Inconsistent, depends on manual tuning | Optimized automatically with chunking & work-stealing |
Before (Manual Thread Management):
// Library A creates its own thread pool std::vector<std::thread> library_a_threads(4); // Library B creates its own thread pool std::vector<std::thread> library_b_threads(4); // Your application creates threads too std::vector<std::thread> app_threads(4); // Result: 12 threads competing for 4 cores! // Massive context switching overhead
After (Parallel Scheduler):
// Everyone uses the same parallel scheduler auto sched = stdexec::get_parallel_scheduler(); // Library A uses it stdexec::schedule(sched) | library_a_work(); // Library B uses it stdexec::schedule(sched) | library_b_work(); // Your application uses it stdexec::schedule(sched) | app_work(); // Result: Shared thread pool with optimal thread count // Work-stealing ensures efficient load balancing
Before (Deadlock Risk):
static_thread_pool pool(4); // Only 4 threads // Outer parallel work uses all 4 threads for (int i = 0; i < 4; ++i) { pool.execute([&]() { // Inner parallel work needs threads too for (int j = 0; j < 4; ++j) { pool.execute([&]() { inner_work(); }); } wait_for_inner_work(); // DEADLOCK! No threads available }); }
After (Parallel Forward Progress):
auto sched = stdexec::get_parallel_scheduler(); // Outer parallel work stdexec::schedule(sched) | stdexec::bulk(4, [](int i) { // Inner parallel work auto inner = stdexec::schedule(sched) | stdexec::bulk(4, [](int j) { inner_work(); }); stdexec::sync_wait(std::move(inner)); // No deadlock! HPX suspends outer task and runs inner work }) | stdexec::sync_wait();
The parallel scheduler raises the abstraction level significantly. Instead of managing threads manually, you work with high-level sender/receiver operations:
Before (Low-Level):
// Manual thread management, synchronization, error handling std::vector<std::thread> threads; std::mutex mtx; std::exception_ptr error; for (int i = 0; i < num_threads; ++i) { threads.emplace_back([&, i]() { try { process_chunk(i); } catch (...) { std::lock_guard lock(mtx); error = std::current_exception(); } }); } for (auto& t : threads) t.join(); if (error) std::rethrow_exception(error);
After (High-Level):
// Declarative, composable, automatic error propagation auto sched = stdexec::get_parallel_scheduler(); stdexec::schedule(sched) | stdexec::bulk(num_tasks, [](int i) { process_chunk(i); }) | stdexec::sync_wait(); // Errors automatically propagated, resources automatically managed
Step 1: Get the Parallel Scheduler
// Get the system's parallel scheduler auto sched = hpx::execution::experimental::get_parallel_scheduler(); // This returns a shared scheduler backed by HPX's thread pool // All calls to get_parallel_scheduler() return the same logical scheduler
Step 2: Submit Parallel Work
// Create a sender that schedules work on the parallel scheduler auto work = stdexec::schedule(sched) | stdexec::bulk(10000, [](int i) { // Process element i process_element(i); }); // The work is described but not yet executed
Step 3: Execute and Synchronize
// Execute the work and wait for completion stdexec::sync_wait(std::move(work)); // Or chain more operations before executing auto result = stdexec::schedule(sched) | stdexec::bulk(1000, process_fn) | stdexec::then([]() { return finalize(); }) | stdexec::sync_wait();
Scenario 1: CPU-Bound Data Processing
// Process a large dataset in parallel std::vector<double> data(1000000); auto sched = hpx::execution::experimental::get_parallel_scheduler(); stdexec::schedule(sched) | stdexec::bulk(data.size(), [&data](size_t i) { // Complex computation on each element data[i] = std::sqrt(data[i] * data[i] + 1.0); }) | stdexec::sync_wait(); // HPX automatically chunks the work and distributes it efficiently
Scenario 2: Parallel Algorithms
// Use parallel scheduler with standard algorithms auto sched = hpx::execution::experimental::get_parallel_scheduler(); // Parallel transform stdexec::schedule(sched) | stdexec::bulk(input.size(), [&](size_t i) { output[i] = transform_fn(input[i]); }) | stdexec::sync_wait();
Scenario 3: Large-Scale Scientific Computing
// Simulate particles in parallel struct Particle { /* ... */ }; std::vector<Particle> particles(1000000); auto sched = hpx::execution::experimental::get_parallel_scheduler(); // Run simulation steps for (int step = 0; step < 1000; ++step) { stdexec::schedule(sched) | stdexec::bulk(particles.size(), [&](size_t i) { particles[i].update_position(); particles[i].update_velocity(); }) | stdexec::sync_wait(); } // Scales efficiently to thousands of cores
HPX's parallel_scheduler is a thin domain-aware wrapper around the
existing
thread_pool_policy_scheduler<hpx::launch>. Rather than building an entirely new
execution backend, it reuses HPX's mature thread pool infrastructure and plugs
into stdexec's domain-based sender transformation pipeline.
Here is a breakdown of how the different components relate to each other:
// The parallel_scheduler wrapper parallel_scheduler └── shared_ptr<parallel_scheduler_backend> (type-erased, heap-allocated) // The abstract backend interface parallel_scheduler_backend (abstract) ├── schedule(receiver_proxy&) ├── schedule_bulk_chunked(n, bulk_proxy&) ├── schedule_bulk_unchunked(n, bulk_proxy&) ├── equal_to(other&) → bool ├── get_underlying_scheduler() → const thread_pool_policy_scheduler* └── get_pu_mask() → const mask_type* // The concrete default implementation hpx_parallel_scheduler_backend (concrete default) └── wraps thread_pool_policy_scheduler + mask_type // The factory function get_parallel_scheduler() └── queries query_parallel_scheduler_backend() └── wraps the returned shared_ptr in a parallel_scheduler // Operation state for connecting senders and receivers operation_state<Receiver> ├── receiver_ ├── backend_ (shared_ptr to backend) ├── proxy_ (concrete_receiver_proxy, member — not local) └── start() → backend_->schedule(proxy_) // Scheduler equality comparison operator== → backend_->equal_to(*other.backend_)
When you use the parallel scheduler, here is the exact flow of how policies (like bulk dispatching) are implemented under the hood:
get_parallel_scheduler()
|
v
parallel_scheduler (owns thread_pool_policy_scheduler + cached PU mask)
|
|-- schedule() --> sender<parallel_scheduler>
| |-- env exposes: get_completion_scheduler<set_value_t>
| | get_domain --> parallel_scheduler_domain
| |-- connect() --> operation_state
| |-- start() --> checks stop_token
| delegates to thread_pool.execute()
|
|-- bulk / bulk_chunked / bulk_unchunked
|-- parallel_scheduler_domain::transform_sender()
|-- extracts underlying thread_pool_policy_scheduler
|-- creates thread_pool_bulk_sender<..., IsChunked, IsParallel>
|-- uses work-stealing index queues
|-- NUMA-aware thread placement
|-- main-thread participation
A crucial part of scheduling tasks effectively is deciding whether the execution should happen
sequentially (hpx::execution::seq) or in parallel (hpx::execution::par).
Within the parallel_scheduler infrastructure, this policy directly controls the underlying
bulk sender characteristics:
par): When scheduled concurrently (mapping to
IsParallel = true in the backend), work is heavily distributed across the work-stealing
queues of the HPX thread pool. Chunks of iterations are dispatched asynchronously to different
worker threads, maximizing the utilization of CPU cores and fully leveraging NUMA-aware task
placement.seq): When scheduled sequentially
(IsParallel = false), the bulk operation executes inline. All iterations run
synchronously on the calling thread's context without spawning auxiliary HPX tasks or incurring the
synchronization overhead of parallel dispatch, making it incredibly fast for tiny workloads.1. Eliminates Oversubscription
By sharing a single thread pool across all libraries and components, the parallel scheduler prevents the performance degradation caused by too many threads competing for CPU cores:
2. Work-Stealing Efficiency
HPX's work-stealing scheduler automatically balances load across threads. When one thread finishes its work, it steals tasks from busy threads:
3. Optimized Bulk Operations
The parallel scheduler uses intelligent chunking strategies to minimize overhead:
Simpler Code
The high-level sender/receiver model eliminates boilerplate:
Better Composability
Different libraries can work together seamlessly:
Platform Independence
The parallel scheduler abstraction works across different platforms:
The parallel scheduler is part of the evolution toward structured concurrency in C++:
By implementing the parallel scheduler now, HPX users are ready for the future of C++ concurrency.
P2079R10 emphasizes that the parallel scheduler should be replaceable. This means applications can provide their own implementation while maintaining the same interface:
We implemented replaceability through a backend abstraction layer:
// Abstract backend interface class parallel_scheduler_backend { public: virtual ~parallel_scheduler_backend() = default; // Core scheduling operations virtual void schedule(/*...*/) = 0; virtual void schedule_bulk_chunked(/*...*/) = 0; virtual void schedule_bulk_unchunked(/*...*/) = 0; // Equality and introspection virtual bool equal_to(const parallel_scheduler_backend*) const = 0; }; // Default HPX implementation class hpx_parallel_scheduler_backend : public parallel_scheduler_backend { thread_pool_policy_scheduler underlying_; public: // Implements all virtual methods using HPX's thread pool };
Applications can replace the default backend:
// 1. Implement your custom backend class my_custom_backend : public parallel_scheduler_backend { // Implement all virtual methods // Could use a different thread pool, GPU executor, etc. }; // 2. Set your backend as the factory hpx::execution::experimental::set_parallel_scheduler_backend_factory( []() { return std::make_shared<my_custom_backend>(); } ); // 3. Now all calls to get_parallel_scheduler() use your backend auto sched = hpx::execution::experimental::get_parallel_scheduler(); // Uses my_custom_backend internally!
1. Flexibility
2. Testability
3. Platform Integration
Shared Pointer for Backend
The parallel scheduler stores a shared_ptr<parallel_scheduler_backend>. This allows:
Virtual Interface for Operations
Using virtual methods allows runtime polymorphism:
Domain Customization for Performance
The parallel scheduler has a custom domain that optimizes bulk operations:
Do:
get_parallel_scheduler() to get the shared schedulerDon't:
Do:
Don't:
The parallel scheduler represents a major step forward in C++ parallel programming. By providing a shared execution context with parallel forward progress guarantees, it solves fundamental problems that have plagued parallel C++ code for years.
What we've achieved:
Our HPX-based implementation demonstrates that the P2079R10 parallel scheduler concept is practical and delivers real performance benefits. The combination of HPX's proven runtime with the standard sender/receiver model creates a powerful foundation for modern C++ parallel programming.
As C++ continues to evolve toward structured concurrency, the parallel scheduler will play a central role. By adopting it now, you're preparing your codebase for the future of C++ while gaining immediate performance and maintainability benefits.
My deepest thanks go to Hartmut Kaiser and Isidoros Tsaousis-Seiras for their mentorship and invaluable insights throughout this project.
std::execution - Senders and Receivers