Show HN: Quack-Cluster – A serverless distributed SQL engine with DuckDB and Ray

github.com

2 points by kristian1232 4 hours ago

Hi HN,

I'm excited to share a project I've been working on: Quack-Cluster.

I love the speed and simplicity of DuckDB for analytics, but I often work with datasets spread across hundreds of files in object storage (like S3). I wanted a way to run distributed queries across all that data without the complexity of setting up and managing a full-blown Spark or Presto cluster. I'm also a big fan of Ray for its simplicity in distributed Python, so I decided to combine them.

How it works: You send a standard SQL query to a central coordinator. It uses SQLGlot to parse the query and identify the target files (e.g., s3://bucket/data/*.parquet). It then generates a distributed plan and sends tasks to a cluster of Ray actors. Each Ray actor runs an embedded DuckDB instance to process a subset of the files in parallel. The partial results (as Arrow tables) are then aggregated and returned to the user.

The goal is to provide a lightweight, high-performance, and serverless alternative for interactive SQL analytics directly on a data lake.

The core tech stack is:

Backend: Python, FastAPI

Distributed Computing: Ray

Query Engine: DuckDB

SQL Parsing: SQLGlot

The project is open-source and I've tried to make it easy to get started locally with Docker and make. I'm here to answer any questions and would be grateful for any feedback on the architecture, use case, or the code itself.

Thanks for checking it out!

hodgesrm 3 hours ago

Sounds interesting! What kind of query latency do you see with this approach?

Also, have you thought about caching? My team is working on a similar problem and we have caches for everything from contents of S3 list_objects_v2 calls to Parquet metadata to blocks read from object storage.