HPX Parallel Scheduler Implementation

Based on P2079R10 Parallel Scheduler Proposal @ 2025

Sai Charan Arvapally
University of Alberta, Canada
View Project on GitHub

Abstract

This article explains how we implemented a parallel scheduler for HPX based on the P2079R10 proposal. The parallel scheduler provides a shared execution context for parallel work, solving critical problems like oversubscription and poor composability in traditional thread-based parallelism. We show how HPX's work-stealing runtime naturally maps to the parallel scheduler model, delivering high performance and excellent scalability.

📖 Deep Dive: Want to understand the internal architecture and how scheduling actually works? Check out the Parallel Scheduler Internals page for a comprehensive explanation of the type hierarchy, execution policies, task scheduling flow, and domain transformation pipeline.

Introduction
Before: How Things Worked Earlier
The Parallel Scheduler Concept (P2079R10)
Our Implementation Using HPX
Before vs After: Key Changes
How It Works in Practice
Under the Hood: How Scheduling is Implemented
Why This Matters
Design Decisions and Replaceability
Best Practices
Conclusion
Special Thanks
References

Introduction

Modern C++ applications need efficient parallel execution. For years, developers struggled with manual thread management, thread pools, and executors that didn't work well together. The C++ execution model has evolved significantly, especially with the introduction of senders and receivers in P2300. The parallel scheduler proposal (P2079R10) takes this further by defining a standard way to execute parallel work efficiently.

This article explains how we built a parallel scheduler for HPX that follows the P2079R10 design. We'll show you what problems it solves, how it works, and why it matters for real-world C++ applications.

Motivation

Traditional Thread-Based Parallelism

Before modern execution models, C++ developers had limited options for parallel execution:

// Old approach: Manual thread management
void process_data(std::vector<int>& data) {
  std::vector<std::thread> threads;
  int num_threads = std::thread::hardware_concurrency();
  
  for (int i = 0; i < num_threads; ++i) {
    threads.emplace_back([&, i]() {
      for (size_t j = i; j < data.size(); j += num_threads) {
        data[j] *= 2; // Process each element
      }
    });
  }
  
  for (auto& t : threads) t.join();
}

Major Problems:

Oversubscription: Creating too many threads wastes resources and hurts performance
Poor composability: Different libraries create their own thread pools, leading to conflicts
Nested parallelism issues: Parallel code calling parallel code creates thread explosion
No unified execution context: Each component manages its own threads independently
Manual synchronization: Developers must handle all coordination themselves

Why Earlier Executor Designs Failed

Early attempts at standardizing executors (like static_thread_pool) had limitations:

// Old static_thread_pool approach
static_thread_pool pool(4); // Fixed number of threads
auto ex = pool.executor();

// Problem 1: Each library creates its own pool
static_thread_pool library_a_pool(4);
static_thread_pool library_b_pool(4);
// Now you have 8 threads competing for 4 cores!

// Problem 2: Nested parallelism doesn't work well
for (int i = 0; i < 100; ++i) {
  ex.execute([&]() {
    // Inner parallel work blocks outer threads
    parallel_algorithm(data); // Deadlock risk!
  });
}

Key Issues:

No resource sharing: Each thread pool is isolated
Inflexible: Fixed thread counts don't adapt to workload
Deadlock prone: Nested parallelism can exhaust thread pool
Not portable: No standard way to integrate with OS thread pools

The Evolution: Senders and Receivers

The C++ execution model evolved with P2300, introducing senders and receivers. This provided better composability, but still lacked a standard parallel execution context:

// P2300 senders/receivers - better, but still missing something
auto work = stdexec::just(42)
  | stdexec::then([](int x) { return x * 2; })
  | stdexec::bulk(1000, [](int i, int val) { /* work */ });

// Question: WHERE does this bulk work execute?
// Answer: We need a parallel scheduler!

What Was Still Missing:

Shared execution context: No standard way to share thread resources
Parallel forward progress: No guarantee that parallel work makes progress
System integration: No way to use OS-provided thread pools
Replaceability: Applications couldn't customize the execution backend

The Parallel Scheduler Concept (P2079R10)

Core Idea

P2079R10 proposes a parallel scheduler that provides a shared execution context for parallel work. Think of it as a system-wide thread pool that all components can use:

// Get the system's parallel scheduler
auto sched = stdexec::get_parallel_scheduler();

// Use it for parallel work
auto work = stdexec::schedule(sched)
  | stdexec::bulk(10000, [](int i) { /* parallel work */ });

// All libraries share the same execution context!

Key Design Goals

Avoid Oversubscription: Single shared thread pool prevents too many threads
Parallel Forward Progress: Guarantees that parallel work makes progress without deadlocks
System Integration: Can use OS thread pools (like Windows thread pool, Grand Central Dispatch)
Portability: Works across different platforms and hardware
Replaceability: Applications can provide custom implementations

What is "Parallel Forward Progress"?

This is a key concept from P2079R10. It means that when you submit parallel work, the scheduler guarantees that all tasks will eventually make progress. This prevents deadlocks from nested parallelism:

// Nested parallelism - outer parallel work calls inner parallel work
auto sched = stdexec::get_parallel_scheduler();

stdexec::schedule(sched)
  | stdexec::bulk(100, [](int i) {
       // Outer parallel work
       auto inner = stdexec::schedule(sched)
         | stdexec::bulk(50, [](int j) { /* inner work */ });
       stdexec::sync_wait(std::move(inner));
     })
  | stdexec::sync_wait();

// With parallel forward progress: No deadlock!
// The scheduler ensures inner work can make progress
// even when outer work is using all threads

How It Fits with Senders/Receivers

The parallel scheduler integrates seamlessly with the P2300 sender/receiver model. It's just another scheduler, but with special properties that make it suitable for parallel algorithms:

// Parallel scheduler is a regular scheduler
auto par_sched = stdexec::get_parallel_scheduler();

// Works with all sender algorithms
auto work = stdexec::schedule(par_sched)
  | stdexec::then([]() { return load_data(); })
  | stdexec::bulk(1000, [](int i, auto data) { process(i, data); })
  | stdexec::then([](auto data) { return finalize(data); });

stdexec::sync_wait(std::move(work));

Our Implementation Using HPX

Why HPX?

HPX (High Performance ParalleX) is a C++ runtime system with a sophisticated work-stealing scheduler. It's a perfect fit for implementing the parallel scheduler concept because:

Work-stealing scheduler: Automatically balances load across threads
Lightweight threads: Can handle millions of tasks efficiently
Asynchronous execution: Built on futures and continuations
Parallel forward progress: Already handles nested parallelism correctly
Production-ready: Used in scientific computing and HPC applications

Core Architecture

Our implementation has three main components:

// 1. parallel_scheduler - The main scheduler interface
class parallel_scheduler {
  std::shared_ptr<parallel_scheduler_backend> backend_;
public:
  // Standard scheduler interface
  auto schedule() const noexcept;
};

// 2. parallel_scheduler_backend - Abstract backend interface
class parallel_scheduler_backend {
public:
  virtual ~parallel_scheduler_backend() = default;
  
  // Core operations
  virtual void schedule(/*...*/) = 0;
  virtual void schedule_bulk_chunked(/*...*/) = 0;
  virtual void schedule_bulk_unchunked(/*...*/) = 0;
};

// 3. hpx_parallel_scheduler_backend - HPX implementation
class hpx_parallel_scheduler_backend : public parallel_scheduler_backend {
  thread_pool_policy_scheduler underlying_; // HPX work-stealing scheduler
};

This layered design provides replaceability - applications can provide custom backends while maintaining the same interface.

How HPX Maps to the Parallel Scheduler Model

HPX's features naturally align with P2079R10 requirements:

P2079R10 Requirement	HPX Feature
Shared execution context	Global thread pool with work-stealing
Parallel forward progress	Lightweight threads + task suspension
Efficient bulk operations	Optimized chunking strategies
Scalability	Proven to scale to thousands of cores

Key Implementation Details

1. Scheduler Abstraction

// Users get the parallel scheduler through a factory function
auto sched = hpx::execution::experimental::get_parallel_scheduler();

// Internally, this creates a scheduler wrapping HPX's thread pool
inline parallel_scheduler get_parallel_scheduler() {
  static auto backend = query_parallel_scheduler_backend();
  return parallel_scheduler{backend};
}

2. Task Submission

// When you schedule work, it goes through the backend
auto sender = sched.schedule(); // Returns a sender

// For bulk operations, we use optimized HPX bulk execution
auto bulk_work = sender | stdexec::bulk(10000, [](int i) {
  // This work runs on HPX's thread pool
  // with work-stealing and chunking optimizations
});

3. Domain Customization for Performance

// The parallel_scheduler has a custom domain
// that intercepts bulk operations and optimizes them
struct parallel_scheduler_domain {
  template<bulk_sender Sender>
  auto transform_sender(Sender&& snd) const {
    // Transform to HPX's optimized bulk sender
    // Uses work-stealing, chunking, NUMA awareness
    return hpx_optimized_bulk_sender{/*...*/};
  }
};

Before vs After: Key Changes

Let's compare how parallel execution works before and after implementing the parallel scheduler:

Aspect	Before (Manual Threads)	After (Parallel Scheduler)
Resource Usage	Each library creates own thread pool → oversubscription	Shared thread pool → optimal resource usage
Composability	Poor - different systems don't work together	Excellent - all use same execution context
Nested Parallelism	Deadlock prone, thread pool exhaustion	Handled correctly with forward progress guarantee
Abstraction Level	Low - manual thread management	High - declarative sender/receiver model
Scalability	Limited by fixed thread counts	Scales to thousands of cores with work-stealing
Performance	Inconsistent, depends on manual tuning	Optimized automatically with chunking & work-stealing

Detailed Comparison: Resource Usage

Before (Manual Thread Management):

// Library A creates its own thread pool
std::vector<std::thread> library_a_threads(4);

// Library B creates its own thread pool
std::vector<std::thread> library_b_threads(4);

// Your application creates threads too
std::vector<std::thread> app_threads(4);

// Result: 12 threads competing for 4 cores!
// Massive context switching overhead

After (Parallel Scheduler):

// Everyone uses the same parallel scheduler
auto sched = stdexec::get_parallel_scheduler();

// Library A uses it
stdexec::schedule(sched) | library_a_work();

// Library B uses it
stdexec::schedule(sched) | library_b_work();

// Your application uses it
stdexec::schedule(sched) | app_work();

// Result: Shared thread pool with optimal thread count
// Work-stealing ensures efficient load balancing

Detailed Comparison: Nested Parallelism

Before (Deadlock Risk):

static_thread_pool pool(4); // Only 4 threads

// Outer parallel work uses all 4 threads
for (int i = 0; i < 4; ++i) {
  pool.execute([&]() {
    // Inner parallel work needs threads too
    for (int j = 0; j < 4; ++j) {
      pool.execute([&]() { inner_work(); });
    }
    wait_for_inner_work(); // DEADLOCK! No threads available
  });
}

After (Parallel Forward Progress):

auto sched = stdexec::get_parallel_scheduler();

// Outer parallel work
stdexec::schedule(sched)
  | stdexec::bulk(4, [](int i) {
       // Inner parallel work
       auto inner = stdexec::schedule(sched)
         | stdexec::bulk(4, [](int j) { inner_work(); });
       stdexec::sync_wait(std::move(inner));
       // No deadlock! HPX suspends outer task and runs inner work
     })
  | stdexec::sync_wait();

Abstraction Level Improvement

The parallel scheduler raises the abstraction level significantly. Instead of managing threads manually, you work with high-level sender/receiver operations:

Before (Low-Level):

// Manual thread management, synchronization, error handling
std::vector<std::thread> threads;
std::mutex mtx;
std::exception_ptr error;

for (int i = 0; i < num_threads; ++i) {
  threads.emplace_back([&, i]() {
    try {
      process_chunk(i);
    } catch (...) {
      std::lock_guard lock(mtx);
      error = std::current_exception();
    }
  });
}

for (auto& t : threads) t.join();
if (error) std::rethrow_exception(error);

After (High-Level):

// Declarative, composable, automatic error propagation
auto sched = stdexec::get_parallel_scheduler();

stdexec::schedule(sched)
  | stdexec::bulk(num_tasks, [](int i) { process_chunk(i); })
  | stdexec::sync_wait();

// Errors automatically propagated, resources automatically managed

How It Works in Practice

Basic Usage Workflow

Step 1: Get the Parallel Scheduler

// Get the system's parallel scheduler
auto sched = hpx::execution::experimental::get_parallel_scheduler();

// This returns a shared scheduler backed by HPX's thread pool
// All calls to get_parallel_scheduler() return the same logical scheduler

Step 2: Submit Parallel Work

// Create a sender that schedules work on the parallel scheduler
auto work = stdexec::schedule(sched)
  | stdexec::bulk(10000, [](int i) {
       // Process element i
       process_element(i);
     });

// The work is described but not yet executed

Step 3: Execute and Synchronize

// Execute the work and wait for completion
stdexec::sync_wait(std::move(work));

// Or chain more operations before executing
auto result = stdexec::schedule(sched)
  | stdexec::bulk(1000, process_fn)
  | stdexec::then([]() { return finalize(); })
  | stdexec::sync_wait();

Real-World Usage Scenarios

Scenario 1: CPU-Bound Data Processing

// Process a large dataset in parallel
std::vector<double> data(1000000);
auto sched = hpx::execution::experimental::get_parallel_scheduler();

stdexec::schedule(sched)
  | stdexec::bulk(data.size(), [&data](size_t i) {
       // Complex computation on each element
       data[i] = std::sqrt(data[i] * data[i] + 1.0);
     })
  | stdexec::sync_wait();

// HPX automatically chunks the work and distributes it efficiently

Scenario 2: Parallel Algorithms

// Use parallel scheduler with standard algorithms
auto sched = hpx::execution::experimental::get_parallel_scheduler();

// Parallel transform
stdexec::schedule(sched)
  | stdexec::bulk(input.size(), [&](size_t i) {
       output[i] = transform_fn(input[i]);
     })
  | stdexec::sync_wait();

Scenario 3: Large-Scale Scientific Computing

// Simulate particles in parallel
struct Particle { /* ... */ };
std::vector<Particle> particles(1000000);

auto sched = hpx::execution::experimental::get_parallel_scheduler();

// Run simulation steps
for (int step = 0; step < 1000; ++step) {
  stdexec::schedule(sched)
    | stdexec::bulk(particles.size(), [&](size_t i) {
         particles[i].update_position();
         particles[i].update_velocity();
       })
    | stdexec::sync_wait();
}

// Scales efficiently to thousands of cores

Under the Hood: How Scheduling is Implemented

HPX's parallel_scheduler is a thin domain-aware wrapper around the existing thread_pool_policy_scheduler<hpx::launch>. Rather than building an entirely new execution backend, it reuses HPX's mature thread pool infrastructure and plugs into stdexec's domain-based sender transformation pipeline.

Class Hierarchy & Object Model

Here is a breakdown of how the different components relate to each other:

// The parallel_scheduler wrapper
parallel_scheduler
  └── shared_ptr<parallel_scheduler_backend>   (type-erased, heap-allocated)

// The abstract backend interface
parallel_scheduler_backend (abstract)
  ├── schedule(receiver_proxy&)
  ├── schedule_bulk_chunked(n, bulk_proxy&)
  ├── schedule_bulk_unchunked(n, bulk_proxy&)
  ├── equal_to(other&) → bool
  ├── get_underlying_scheduler() → const thread_pool_policy_scheduler*
  └── get_pu_mask() → const mask_type*

// The concrete default implementation
hpx_parallel_scheduler_backend (concrete default)
  └── wraps thread_pool_policy_scheduler + mask_type

// The factory function
get_parallel_scheduler()
  └── queries query_parallel_scheduler_backend()
  └── wraps the returned shared_ptr in a parallel_scheduler

// Operation state for connecting senders and receivers
operation_state<Receiver>
  ├── receiver_
  ├── backend_  (shared_ptr to backend)
  ├── proxy_    (concrete_receiver_proxy, member — not local)
  └── start() → backend_->schedule(proxy_)

// Scheduler equality comparison
operator== → backend_->equal_to(*other.backend_)

Execution Flow: How Polices Are Handled

When you use the parallel scheduler, here is the exact flow of how policies (like bulk dispatching) are implemented under the hood:

get_parallel_scheduler()
    |
    v
parallel_scheduler  (owns thread_pool_policy_scheduler + cached PU mask)
    |
    |-- schedule() --> sender<parallel_scheduler>
    |                    |-- env exposes: get_completion_scheduler<set_value_t>
    |                    |                get_domain --> parallel_scheduler_domain
    |                    |-- connect() --> operation_state
    |                                        |-- start() --> checks stop_token
    |                                                        delegates to thread_pool.execute()
    |
    |-- bulk / bulk_chunked / bulk_unchunked
         |-- parallel_scheduler_domain::transform_sender()
              |-- extracts underlying thread_pool_policy_scheduler
              |-- creates thread_pool_bulk_sender<..., IsChunked, IsParallel>
                   |-- uses work-stealing index queues
                   |-- NUMA-aware thread placement
                   |-- main-thread participation

Sequential vs Parallel Execution Policies (seq / par)

A crucial part of scheduling tasks effectively is deciding whether the execution should happen sequentially (hpx::execution::seq) or in parallel (hpx::execution::par). Within the parallel_scheduler infrastructure, this policy directly controls the underlying bulk sender characteristics:

Parallel Policy (par): When scheduled concurrently (mapping to IsParallel = true in the backend), work is heavily distributed across the work-stealing queues of the HPX thread pool. Chunks of iterations are dispatched asynchronously to different worker threads, maximizing the utilization of CPU cores and fully leveraging NUMA-aware task placement.
Sequential Policy (seq): When scheduled sequentially (IsParallel = false), the bulk operation executes inline. All iterations run synchronously on the calling thread's context without spawning auxiliary HPX tasks or incurring the synchronization overhead of parallel dispatch, making it incredibly fast for tiny workloads.

Why This Matters

Performance Benefits

1. Eliminates Oversubscription

By sharing a single thread pool across all libraries and components, the parallel scheduler prevents the performance degradation caused by too many threads competing for CPU cores:

Reduced context switching overhead
Better CPU cache utilization
Optimal thread count matching hardware capabilities
Lower memory footprint (fewer thread stacks)

2. Work-Stealing Efficiency

HPX's work-stealing scheduler automatically balances load across threads. When one thread finishes its work, it steals tasks from busy threads:

Automatic load balancing without manual tuning
Scales efficiently from 1 to thousands of cores
Handles irregular workloads gracefully
Minimizes idle time

3. Optimized Bulk Operations

The parallel scheduler uses intelligent chunking strategies to minimize overhead:

Larger chunks reduce task creation overhead
Smaller chunks improve load balancing
Adaptive chunk sizes based on workload
Cache-friendly data access patterns

Maintainability Benefits

Simpler Code

The high-level sender/receiver model eliminates boilerplate:

No manual thread management
No explicit synchronization primitives
Automatic error propagation
Composable operations

Better Composability

Different libraries can work together seamlessly:

Shared execution context prevents conflicts
Standard sender/receiver interface
Easy to integrate third-party parallel algorithms
Nested parallelism works correctly

Portability Benefits

Platform Independence

The parallel scheduler abstraction works across different platforms:

Same code runs on Linux, Windows, macOS
Can integrate with OS thread pools (Windows thread pool, GCD)
Adapts to different hardware (CPUs, accelerators)
Portable performance characteristics

Alignment with Future C++ Concurrency

The parallel scheduler is part of the evolution toward structured concurrency in C++:

P2300 (Senders/Receivers): Foundation for async operations
P2079R10 (Parallel Scheduler): Standard parallel execution context
Future proposals: Parallel algorithms using senders
Industry adoption: Major C++ projects moving to this model

By implementing the parallel scheduler now, HPX users are ready for the future of C++ concurrency.

Design Decisions and Replaceability

Why Replaceability Matters

P2079R10 emphasizes that the parallel scheduler should be replaceable. This means applications can provide their own implementation while maintaining the same interface:

Custom backends: Use different thread pools or execution models
Platform integration: Integrate with OS-specific thread pools
Testing: Use mock schedulers for unit tests
Profiling: Add instrumentation without changing application code

Our Replaceability Architecture

We implemented replaceability through a backend abstraction layer:

// Abstract backend interface
class parallel_scheduler_backend {
public:
  virtual ~parallel_scheduler_backend() = default;
  
  // Core scheduling operations
  virtual void schedule(/*...*/) = 0;
  virtual void schedule_bulk_chunked(/*...*/) = 0;
  virtual void schedule_bulk_unchunked(/*...*/) = 0;
  
  // Equality and introspection
  virtual bool equal_to(const parallel_scheduler_backend*) const = 0;
};

// Default HPX implementation
class hpx_parallel_scheduler_backend : public parallel_scheduler_backend {
  thread_pool_policy_scheduler underlying_;
public:
  // Implements all virtual methods using HPX's thread pool
};

How to Provide a Custom Backend

Applications can replace the default backend:

// 1. Implement your custom backend
class my_custom_backend : public parallel_scheduler_backend {
  // Implement all virtual methods
  // Could use a different thread pool, GPU executor, etc.
};

// 2. Set your backend as the factory
hpx::execution::experimental::set_parallel_scheduler_backend_factory(
  []() { return std::make_shared<my_custom_backend>(); }
);

// 3. Now all calls to get_parallel_scheduler() use your backend
auto sched = hpx::execution::experimental::get_parallel_scheduler();
// Uses my_custom_backend internally!

Benefits of This Design

1. Flexibility

Applications can choose the best backend for their needs
Easy to experiment with different execution strategies
No need to recompile application code when changing backends

2. Testability

Mock backends for unit testing
Deterministic execution for debugging
Performance profiling without code changes

3. Platform Integration

Can wrap OS-specific thread pools (Windows, macOS, Linux)
Integration with GPU executors
Custom scheduling policies for specific domains

Key Design Decisions

Shared Pointer for Backend

The parallel scheduler stores a shared_ptr<parallel_scheduler_backend>. This allows:

Multiple scheduler instances to share the same backend
Efficient copying of schedulers
Proper lifetime management

Virtual Interface for Operations

Using virtual methods allows runtime polymorphism:

Different backends can have completely different implementations
No template bloat from multiple backend types
Clean separation between interface and implementation

Domain Customization for Performance

The parallel scheduler has a custom domain that optimizes bulk operations:

Intercepts stdexec bulk operations
Transforms them to use HPX's optimized bulk sender
Maintains sender/receiver composability
Zero-overhead abstraction

Best Practices

Using the Parallel Scheduler

Do:

Use get_parallel_scheduler() to get the shared scheduler
Compose operations using sender/receiver chains
Let the scheduler handle resource management automatically
Use bulk operations for data-parallel work
Trust the work-stealing scheduler for load balancing

Don't:

Create your own thread pools when using the parallel scheduler
Manually manage thread counts or synchronization
Assume specific execution order within bulk operations
Block threads waiting for parallel work to complete

Performance Tips

Do:

Use appropriate chunk sizes for your workload
Minimize data sharing between parallel tasks
Prefer immutable data or thread-local storage
Profile your application to identify bottlenecks

Don't:

Create excessive fine-grained tasks (overhead dominates)
Use locks or mutexes in hot paths
Assume linear speedup without measurement

Conclusion

The parallel scheduler represents a major step forward in C++ parallel programming. By providing a shared execution context with parallel forward progress guarantees, it solves fundamental problems that have plagued parallel C++ code for years.

What we've achieved:

Eliminated oversubscription: Shared thread pool prevents resource conflicts
Enabled composability: Different libraries work together seamlessly
Solved nested parallelism: Forward progress guarantees prevent deadlocks
Improved performance: Work-stealing and optimized chunking deliver scalability
Provided replaceability: Applications can customize the backend

Our HPX-based implementation demonstrates that the P2079R10 parallel scheduler concept is practical and delivers real performance benefits. The combination of HPX's proven runtime with the standard sender/receiver model creates a powerful foundation for modern C++ parallel programming.

As C++ continues to evolve toward structured concurrency, the parallel scheduler will play a central role. By adopting it now, you're preparing your codebase for the future of C++ while gaining immediate performance and maintainability benefits.

Special Thanks

My deepest thanks go to Hartmut Kaiser and Isidoros Tsaousis-Seiras for their mentorship and invaluable insights throughout this project.

References

[P2079R10] System Execution Context - Parallel Scheduler Proposal
[P2300R10] std::execution - Senders and Receivers
[HPX] HPX - High Performance ParalleX Runtime System
[stdexec] NVIDIA stdexec - Reference Implementation
[Project #6655] HPX Parallel Scheduler Implementation

HPX Parallel Scheduler Implementation

Based on P2079R10 Parallel Scheduler Proposal @ 2025

Abstract

Table of Contents

Introduction

Motivation

Traditional Thread-Based Parallelism

Why Earlier Executor Designs Failed

The Evolution: Senders and Receivers

The Parallel Scheduler Concept (P2079R10)

Core Idea

Key Design Goals

What is "Parallel Forward Progress"?

How It Fits with Senders/Receivers

Our Implementation Using HPX

Why HPX?

Core Architecture

How HPX Maps to the Parallel Scheduler Model

Key Implementation Details

Before vs After: Key Changes

Detailed Comparison: Resource Usage

Detailed Comparison: Nested Parallelism

Abstraction Level Improvement

How It Works in Practice

Basic Usage Workflow

Real-World Usage Scenarios

Under the Hood: How Scheduling is Implemented

Class Hierarchy & Object Model

Execution Flow: How Polices Are Handled

Sequential vs Parallel Execution Policies (seq / par)

Why This Matters

Performance Benefits

Maintainability Benefits

Portability Benefits

Alignment with Future C++ Concurrency

Design Decisions and Replaceability

Why Replaceability Matters

Our Replaceability Architecture

How to Provide a Custom Backend

Benefits of This Design

Key Design Decisions

Best Practices

Using the Parallel Scheduler

Performance Tips

Conclusion

Special Thanks

References