Breaking I/O bottlenecks in scientific workflows: a new EPCC–University of Turin collaboration

13 November 2025

Our new paper, “Overcoming Dynamic I/O Boundaries: a Double-Sided Streaming Methodology with dispel4py and CAPIO”, will be presented at WORKS 2025, part of the SC25 Workshops series.

How this collaboration started

This work began with a discussion sparked earlier this year when I was invited to give a keynote at PDP 2025 in Turin, Italy, hosted by Prof. Marco Aldinucci’s group at the University of Turin. After the keynote we had a long technical discussion, particularly with one of Marco’s PhD students, Marco Edoardo Santimaria. That led to follow-up meetings, which led to prototypes, which led to a summer of remote collaboration, and finally to this paper. 

And this is only the beginning. Marco Edoardo will visit EPCC from January to April 2026 to continue this research here in Edinburgh.

What problem are we solving?

Most scientific workflows have many steps. A classic pattern looks like this: 

Step A writes thousands of files → wait → Step B reads them → next steps continue. 

The “wait” is the bottleneck. Even if 50 files are already ready, many systems will wait until the final file has finished writing before the next step starts. This pause is called an I/O boundary, and it wastes time, especially in data-heavy HPC.

Two technologies, one new idea

Our approach builds on two existing systems:

In short, dispel4py is strong on the control side (when to run the next step), while CAPIO is strong on the data side (making file I/O non-blocking). Until now, these systems were not used together.

The real limitation

Streaming inside dispel4py works well for item-by-item (map-like) stages. But some workflow stages are progressive all-pairs: they must combine each new result with all results seen so far (e.g. streaming cross-correlation, incremental clustering, pairwise graph construction). That pattern isn’t item-granular, it needs a materialised set that grows over time. The simplest, correct way to express that set in practice is to split the workflow: let the first pipeline produce a directory of results; let the second consume that directory. 

And here’s the barrier: POSIX filesystems expose directories “properly” only once they’re complete. Without CAPIO, the downstream step can’t see new elements until all files are written, creating a hard synchronisation wall. But this breaks the streaming, doesn't it? 

What CAPIO changes

CAPIO removes that wall. It exposes file content and directory growth incrementally while they’re still being produced, and marks files safe-to-read before the final close. As a result, dispel4py can treat a directory of outputs like a stream, just as it treats in-memory messages.

Our contribution (as described in the WORKS paper) is to make dispel4py and CAPIO cooperate so streaming continues through the file boundary. We call this double-sided streaming.

Why this matters

Workflows must often write files mid-pipeline for logging, provenance, debugging, or publishing intermediate artefacts. In our solution we don’t remove materialisation, we instead remove the forced pause that used to come with it.

We validated this solution on a real seismic workflow that cross-correlates seismic traces between stations to study patterns related to earthquakes and volcanic activity. 

With our new double-sided streaming, Phase 2 (cross-correlation) starts as soon as the first valid files appear. This reduced total runtime by 23–40% and, more importantly for users, first results started appearing within seconds instead of after the whole preprocessing phase. CAPIO overlaps reading ↔ writing; dispel4py overlaps scheduling ↔ computing, eliminating the artificial wait between phases.

What comes next?

Our collaboration will continue, including during Marco Edoardo’s four-month visit to EPCC in early 2026. Next steps include:

  • Tighter integration between CAPIO and dispel4py
  • Exploring CAPIO’s in-memory mode for even higher performance
  • Applying this approach to other scientific domains and workflow patterns.

This is a nice example of how a research visit to give a talk led to a new international collaboration, and to a concrete technical result with real performance gains for large-scale scientific workflows.

Links

SC25 presentation: Overcoming Dynamic I/O Boundaries: a Double-Sided Streaming Methodology with dispel4py and CAPIO

Related paper: Large-scale HPC Approaches and Applications on Highly Distributed Platforms

 

Author

Dr Rosa Filgueira
Rosa's profile picture