Durable Execution Without the Platform

When building integrations, you have to deal with the difficulties of working with distributed systems. Fetching from APIs that may sometimes be unavailable, transitive failures, or rate limiting issues. Integrations are usually made up of multiple steps, and each step can fail. Each step might have side effects. And if the process crashes halfway through, you need some way to pick up where you left off without re-executing the steps that already succeeded.

This is the problem that durable execution solves. Persist the result of each step so that on recovery, the engine replays completed work from storage instead of re-executing it. Your workflow becomes crash-proof.

Why it matters for integrations

Integration pipelines are a sequence of steps that talk to external systems. Webhooks, API calls, and database queries. Each one is a network boundary where things can go wrong.

Without durability, a failure at step five of a six-step pipeline means you either re-run the whole thing (and hope running steps one through four again doesn't cause problems) or you build your own checkpointing logic.

Durable execution gives you checkpointing as a primitive. Each step persists its result before moving on. If the process dies, the next run skips what's already done and picks up from the failure point.

Why it matters for agents

LLM calls are expensive, slow, and non-deterministic. Tool calls have side effects: sending emails, writing to databases, charging credit cards. An agent loop might run dozens of steps over several minutes.

If your agent crashes after a planning step, you don't want to re-run the same prompt and pay for the same tokens again. If a tool call already executed, you don't really want it to run again.

Durable execution solves both problems. The LLM response is cached. The tool call result is cached. On recovery, those steps return their stored values instantly. Only the incomplete step re-executes.

The platform problem

Temporal and Restate are the two established tools here. Temporal gives you a full workflow orchestration platform with its own server, workers, and SDK. Restate takes a lighter approach with a server that sits next to your service and journals your function invocations. Both support multiple languages, have sophisticated scheduling, and handle distributed coordination across services.

But they're platforms. Temporal means running a Temporal server (or paying for their cloud). Restate means running a Restate server alongside your service. Both bring their own deployment story, their own operational burden, and their own opinions about how your system should be structured.

For a team running dozens of microservices that need cross-service orchestration, that overhead is worth it. But I was building something simpler: single-process Rust services that run integration pipelines and agent workflows. I didn't need distributed coordination. I needed crash recovery and checkpointing, and I didn't want to deploy a platform to get it.

Replay and the versioning problem

Both Temporal and Restate use replay-based recovery that depends on deterministic re-execution. The engine re-runs your workflow function from the top, matching each step to its recorded result by its position in the event history. Step one returns result one, step two returns result two, and so on.

This works well until you need to change your code while workflows are in flight. Insert a new step between steps two and three? Every running workflow's history is now misaligned. Reorder steps? Same problem. You need versioning strategies, migration logic, or careful deployment choreography to avoid breaking active workflows.

Temporal handles this with workflow versioning and patching APIs. Restate has its own approach to compatible changes. You have to think about replay compatibility every time you change a workflow.

Keys, not positions

memable takes a different approach. Instead of matching steps by position, it matches them by key.

let data: Vec<Record> = ctx
.step("extract:v1")
.run( async || {
fetch_from_api().await
})
.await?;

let cleaned = ctx
.step("transform:v1")
.run( async || {
normalise( & data)
})
.await?;

Each step has a name you choose. On recovery, the engine looks up what's already done by key and returns the cached result. It doesn't care what order your code runs in.

This means you can deploy new code while workflows are in flight:

Reorder steps: the engine matches by key, not position
Insert a new step: it has a new key, so it runs fresh
Bump a version: change "transform:v1" to "transform:v2" and the old cached result is ignored; the step re-executes
Remove a step: the cached result sits there harmlessly in storage

No replay. No versioning API. No migration ceremonies.

A library, not a platform

memable is a Rust crate. You add it to Cargo.toml and it runs inside your process.

let mut engine = Engine::builder()
.open("workflows.redb") ?
.build();

engine.register("sync-pipeline", sync_pipeline);
engine.start().await?;

engine.invoke("sync-pipeline").await?;

Storage is redb, an embedded key-value store written in pure Rust. No C dependencies, no external database, no network calls to a coordinator. Your workflow state lives in a single file next to your application.

For tests, swap in the in-memory backend:

let engine = Engine::builder().in_memory().build();

Same API, no disk, no cleanup.

Suspend and resume

Some workflows need to wait for something external. A human approval, a webhook, a callback from a third-party service. You don't want to hold a task in memory while you wait.

async fn approval_workflow(ctx: Context) -> Result<(), EngineError> {
    let record_count: u32 = ctx
        .step("fetch-data:v1")
        .run(async || {
            fetch_from_source().await
        })
        .await?;

    let approved: bool = ctx
        .suspend("approval:v1")
        .status("Waiting for manager approval")
        .await?;

    if !approved {
        return Ok(());
    }

    ctx.step("process-data:v1")
        .run(async || {
            process(record_count).await
        })
        .await?;

    Ok(())
}

ctx.suspend() drops the workflow entirely. Nothing held in memory. When the signal arrives (from an HTTP handler, a CLI command, whatever), the engine writes the payload to storage and re-runs the workflow. Memoised steps return instantly, the suspend step resolves with the payload, and execution continues.

The signal delivery is one line:

engine.signal("approval", & instance_id, "approval:v1", true).await?;

Durable timers

Same mechanism, but for time-based waits. ctx.timer() suspends the workflow until a deadline, then a background poller auto-resumes it.

ctx.timer("cooldown:v1", Duration::from_secs(300)) ?;

No task held in memory. No external scheduler. If the process restarts, the poller picks up the expired timer and fires it.

What it looks like for an agent

To give you an idea of how this might look for an AI agent, here's an example that plans, searches, summarises, and writes a report. Each step gets a dynamic key based on the content it's processing. See how the engine means that the durable execution logic doesn't need to complicate your code. You only have to write the workflow code and let the engine deal with checkpointing intermediate results and retrying flakey steps.

async fn research_agent(ctx: Context) -> Result<(), EngineError> {
    let plan: Plan = ctx
        .step("plan:v1")
        .run(async || {
            llm("What topics should we cover?").await
        })
        .await?;

    for topic in &plan.topics {
        let results = ctx
            .step(&format!("search:{topic}:v1"))
            .run(async || {
                web_search(topic).await
            })
            .await?;

        ctx.step(&format!("summarise:{topic}:v1"))
            .run(async || {
                llm(&format!("Summarise: {results}")).await
            })
            .await?;
    }

    ctx.step("report:v1")
        .run(async || {
            llm("Write the final report").await
        })
        .await?;

    Ok(())
}

If this crashes at step 30 of 40, the resume re-runs the function from the top. Steps one through 29 return their cached results in microseconds. Only step 30 actually executes. No wasted LLM calls, no repeated side effects.

The keys being strings rather than positions means the loop works naturally. Each iteration gets its own key. Add a new topic to the plan and it runs fresh while everything else stays cached.

Trade-offs

This is early-stage software.

memable is single-process. There's no distributed coordination, no cross-service orchestration, no task queue across a fleet of workers. If you need that, Temporal and Restate are the right tools.

The storage is local. If your disk dies, your workflow state dies with it. For my use cases that's fine. For mission-critical production workflows, I'd probably still use Restate.

The API is still unstable. I'm using it in my own projects and iterating on the design, but it's not tested beyond my own side projects yet. If you're interested, check out memable.daz.is.