Data Platform

Halcyon

A predicate-pushdown query engine that rewrites execution plans across a 64-node cluster, reading Parquet at the SSD level to cut latency by an order of magnitude.

Role Lead EngineerClient Confidential (fintech)Timeline 8 monthsYear 2025

Latency12× faster

Daily users600+

Cluster64 nodes

Warehouse4 PB

RustApache ArrowParquetgRPCKubernetes

The problem

Analysts were waiting minutes for dashboards that should have rendered in seconds. The existing engine scanned far more data than each query needed, and the planner had no awareness of the physical layout on disk.

With a 4 PB warehouse and 600+ daily users, every wasted scan multiplied into real cost and real frustration.

The approach

I rebuilt the execution layer around predicate pushdown — filters are evaluated as close to the storage as possible, skipping entire row groups before they ever leave the SSD.

A cost-based planner rewrites queries across the cluster, balancing shuffle cost against parallelism so the 64 nodes stay saturated without thrashing the network.

Row-group and page-level skipping driven by Parquet statistics.
Vectorized execution on Apache Arrow record batches.
Adaptive plan rewriting based on live cardinality estimates.

Queries that took 90 seconds now return in under 8. The team stopped pre-aggregating data just to make dashboards usable.

The outcome

P99 latency dropped 12×, and the engine now serves the entire analytics org from a single cluster. The cost-based planner removed an entire class of hand-tuned materialized views.

More work

Related projects

CRM Platform

Meridian CRM2025

A custom CRM with pipelines, automations, and reporting.

Next.jsNestJSPostgreSQL

ERP System

Atlas ERP2024

End-to-end operations: inventory, accounting, and HR.

LaravelVueMySQL

Website

Lumen2024

A high-performance marketing site with motion design.

Next.jsGSAPThree.js