Hooks for Pipelines

I decided to write my own version of React, but for the purposes of building ETL backends for small to medium workloads.

Hooks for Pipelines

A couple weeks ago, Acko.net posted Reconcile All the Things and I read it with great interest. It's the second of a trilogy of articles (1, 2, 3) that lay out how React's hooks API can be used for things other than frontend web interfaces. That's because React is actually a runtime implemented in userland that you can set different render targets. That's how you can have both React for the web and React Native for mobile using the same core runtime.

Acko wrote his own version of React, and used it to declaratively program 3D scenes for Web GPU as a render target. In fact, he connects it to dataflow programming, which not only is used for graphics rendering, but is often used for streaming backends like Apache Spark, ETL systems like Apache Airflow, and interactive programs like Excel.

This is surprising. A framework for constructing frontends is normally not thought of as being transferable to problems in other places in the stack.

I've been wanting a framework for chainable background workers for a long time. I kept running into this problem at various points in my career. Every time, off-the-shelf tools either wasn't a good fit, or it was overkill for the job. And each time, I would write something specific to the domain, but run into the same homegrown problems with retries, backfilling, timeouts, versioning, observability, and maintainability. I've tried to write a dataflow programming DSL on four separate occasions, but the affordance of the API was always off.

So when a hooks-like API was connected with dataflow programming in Reconcile All the Things, it lit a lightbulb in my head that this might be the thing that bridges the gap.

While a hooks-like API won't solve every problem here, it does introduce two important concepts that a large swatch of programmers are now comfortable with without realizing it: declarative programming and managed side-effects (effects).

While the body of the hooks is still imperative, the construction of the dataflow is declarative. With a hooks-like API, you get to program in immediate mode, instead of in retained mode. In retained mode, you have to tell the computer the exact steps to take to get from the previous state to the next state. In immediate mode, where you just have to tell the computer what the next state should look like, and it figures out how to get there. This vastly simplifies frontend programming, and is promising in its potential to simplify backend ETL and streaming systems.

Managing state is a large part of the source of complexity in programs. That's why you get the Hexagon architecture, Out of the Tar Pit paper, state monads, and the joke that one of two hard things in computer science is cache invalidation (the other being naming things). All are arguments for developers to be deliberate in containing and minimizing code that maintains state.

Managed side-effects is a solution for pure functional programming languages to help maintain state: the programmer is not allowed to call anything with a side-effect–they can only request the underlying runtime to make the actual side-effect call. React takes a middle road, inspired by Algebraic Effects found in Eff and Koka. Side effects are declared when used, but the effect handler is for the programmer to implement in a reusable and composable way.

Excited by the prospects, I decided to write my own version of React, but for the purposes of building ETL backends for small to medium workloads. For a use case, I wanted to ingest the borrow and supply APY rates from Compound's smart contracts on the Ethereum blockchain. This is the main component of the ingestion pipeline written in a hooks-like API (with the accompanying custom hooks):

export const ApyIngestor = props => {
  let { providerUri, tokenListPath, mainnetPath, mainnetAbiPath } = props;

  let web3 = useWeb3(providerUri);

  let tokenList = useFetchJson(tokenListPath);

  let mainnet = useFetchJson(mainnetPath);
  let mainnetAbi = useFetchJson(mainnetAbiPath);

  let contractGenerator = useCompoundContract(web3, mainnetAbi, mainnet);

  let blockHeader = useBlockHeader(web3);

  let filteredTokens =
    tokenList?.tokens
      ?.filter(tokenRecord => tokenRecord.chainId === 1)
      ?.filter(tokenRecord => tokenRecord.symbol.match(/c.*/)) || [];

  return elem(
    Div,
    {
      filteredTokens: filteredTokens?.length,
      blockHeader: blockHeader?.number
    },
    () => {
      return filteredTokens.map((tokenRecord, i) => {
        return elem(TokenRates, {
          key: i,
          tokenRecord,
          contract: contractGenerator(tokenRecord.symbol),
          blockHeader
        });
      });
    }
  );
};
The main ingestion component of the pipeline
const TokenRates = props => {
  let { tokenRecord, contract, blockHeader } = props;

  const knex = useKnex(config.dev.db);

  const [borrowRate, supplyRate] = useBorrowAndSupplyRate(contract, blockHeader);

  let record: AnnualPercentageYieldRecord | null = null;
  if (blockHeader && tokenRecord && borrowRate && supplyRate) {
    record = {
      project_name: "compound",
      blocknumber: blockHeader.number,
      market_name: tokenRecord.symbol,
      data: {
        borrow_rate: borrowRate,
        supply_rate: supplyRate
      },
      block_at: new Date(blockHeader.timestamp * 1000),
      created_at: new Date(),
      updated_at: new Date()
    };
  }

  usePersist(knex, record, ["project_name", "blocknumber", "market_name"]);

  return elem(Div, record);
};
Each compound market has its own component to persist its record to the DB

Compared to writing the ingestor in vanilla JS, there were a couple of nice things about doing it this way, most of which are the same nice things about writing React. First, the transformation of the data from fetching it at the source to the resulting data is declarative in construction. It's now possible to think of the result as a function of the a component's props and state, and any change to the inputs, whether by props or by effect will incrementally change the result. Only hooks that need to be run again will be run in order to obtain the new result.

Second, the hooks compose. useFetchJson is a custom hook that is made of useState and useEffect hooks. This hook can now be reused across different components and pipelines.

export const useFetchJson = filePath => {
  let [json, setJson] = Refract.useState("useFetchJson", null);

  Refract.useEffect(
    "useFetchJson",
    () => {
      (async () => {
        const result = await fs.readFile(filePath);
        const parsed = JSON.parse(String(result));
        setJson(parsed);
      })();
    },
    [filePath]
  );

  return json;
};
A custom hook composed of two basic hooks

Third, each hook is like a managed effect. When you use a hook, it's not executed immediately. The hook generates a request that is dispatched to the underlying runtime to perform the side effect for you. This means the complexities of maintaining state for retries, timeouts, backfills, and versioning can all be encapsulated by the underlying runtime, in a reusable and composable way.

Lastly, one advantage of this system I built is that it's interruptible. Like React Fiber, it runs at a specific frame rate, so it has a time budget on every frame to execute as many units of work in the pipeline that it can. These units of work are the effect handlers for the effects that were triggered when an input has changed. Right now, I'm not using this for anything, but I think it'll help support the interactivity, debugging, and observability in a running pipeline in the future.

That said, in React, the managed effects aren't just the hooks. They're also the JSX elements. React is actually a runtime that takes the JSX elements as managed effects and figures out how to manipulate the DOM using their reconciliation algorithm.

As of yet, I have no such equivalent of a render target for the resulting data, unless I consider the database as a "render target". That might make sense, as I just describe the updated database state, and the runtime figures out the insert/upsert/delete operations to do it. However, that requires a bit more experimentation, because unlike React's usual render targets, it's possible to have more than one sink in a data pipeline, and I'm not sure how that would be represented in a tree-like JSX or its equivalent.

For now, I'll keep exploring this avenue, and if you have comments, thoughts, or considerations, let me know.

Photo by ATBO from Pexels