pipeline

Easy Backfilling with Hooks and Components

I implemented a backfill mechanism with the basic hooks, so when the pipeline is brought back up after some downtime, it'll not only do its job of ingesting new blocks for interest rates, but it will also see how many of the last 128 blocks it missed and ingest those.

Wil

Aug 21, 2021 • 3 min read

Ever since my realization that React's hook API was useful in other contexts, I've been excited about its application to a workflow software like Apache Airflow. Riding on that momentum, I built a bare-bones React for data pipelines, which for now, I'll call by its codename Refract.

So beyond just getting it to work, I wanted to address the backfilling problem I had building ETL pipeline in the past. When fetching incoming data and running it through a transformation, something inevitably can go wrong, resulting in downtime.

Hence, when bringing the service back up, you might have missed event data and will need to backfill it. This is often extra work that I may not have had time for to get it up and running. But when something breaks, I spend more time manually backfilling data.

Using Refract, I built an example app that fetches interest rates from the Compound smart contract for all their markets directly from an Ethereum node.

Most visual programming where nodes are connected via wires makes reuse hard to do because they don't distinguish between a component and its instance. But because Refract uses components and hooks much like React does, the main pipeline as well as the missing block hook are both reusable.

export const ApyIngestor = props => {
  let { providerUri, tokenListPath, mainnetPath, mainnetAbiPath } = props;

  // ...various hooks...

  return [
    elem(MainLine, {
      key: "0",
      knex,
      tokens: filteredTokens,
      contractGenerator,
      blockHeader
    }),

    elem(Backfill, {
      key: "1",
      web3,
      knex,
      tokens: filteredTokens,
      contractGenerator,
      latestBlockHeader: blockHeader
    })
  ];
};


export const Backfill = props => {
  let { web3, knex, tokens, contractGenerator, latestBlockHeader, children } =
    props;

  const { missingKeys, numRemainingKeys, setNumRemainingKeys } = useMissingKeys(
    knex,
    "annual_percentage_yields",
    "blocknumber",
    latestBlockHeader?.number
  );

  let [pastBlockHeader, setPastBlockHeader] = Refract.useState(
    "past blockheader",
    null
  );

  Refract.useEffect(
    "get past block",
    () => {
      if (!web3 || !missingKeys || !numRemainingKeys) return;

      (async () => {
        missingKeys.forEach(async blocknumber => {
          const bh = await web3.eth.getBlock(blocknumber);
          setPastBlockHeader(bh);
          setNumRemainingKeys(numRemainingKeys - 1);
        });
      })();
    },
    [missingKeys, numRemainingKeys]
  );

  return [
    elem(MainLine, {
      key: "0",
      knex,
      tokens,
      contractGenerator,
      blockHeader: pastBlockHeader
    })
  ];
};

Backfill is something specific to the domain, so I decided to leave it as something the programmer constructs, but with reusable hooks to help make it easier.

For example, we backfill in ascending order, processing the oldest first. Otherwise, the Ethereum node might flush the old block before we get a chance to get around to it. Also, we try to generate a new missing keys list on every new block, but only if we haven't finished processing the last generated backfill list yet. That's what the numRemainingKeys count is for. We return it and its complement setNumRemainingKeys from the hook, so the client can update it.

So far, I've liked the experience of writing Backfill with this API. When I refactor and move either components or hooks around, I don't really have to change much; I can just cut and paste. And I can reuse components because each place I use them is a different instanciation.

Another advantage is that I don't really have to worry about cache invalidation any longer. In the initial non-Refract implementation, I did notice I reloaded the ABI files from disk on every new block. It's wasteful, but I didn't change it because it didn't seem to affect run time. But in Refract, it just worked. The caveat is that I have to get the dependency array correct in useEffect hooks, which sometimes I can miss. Also, forgetting that some variables might be undefined in the beginning, so I have to use option chaining.

So far so good.

[1] Unless you're talking to a full archive node, Infura only keeps the last 128 blocks.

Sign up for more like this.