Ask HN: Where does your AI/ML data pipeline hurt the most? (2025)

1 points by phukon a day ago

I’m gathering quick, honest feedback from people who wrangle data for models and post-training/ML teams at LLM labs.

Share your thoughts/anecdotes on:

- Biggest recurring bottleneck (collection, cleaning, labeling, drift, compliance, etc.)

- Has RLHF/synthetic data actually cut your need for fresh domain data?

- Hard-to-source domains (finance, healthcare, logs, multi-modal, whatever) and why.

- Tasks you’d automate first if you could.