Interesting talk. He mentions Futhark a few times, but fails to point out that his ideal way of programming is almost 1:1 how it would be done in Futhark.
Right. This is the binary tree version of the algorithm, and is nice and concise, very readable. What would take it to the next level for me is the version in the stack monoid paper, which chunks things up into workgroups. I haven't done benchmarks against the Pareas version (unfortunately it's not that easy), but I would expect the workgroup optimized version to be quite a bit faster.
Thanks for clarifying! It would indeed be interesting to see a comparison between similar implementations in other languages, both in terms of readability and performance. I feel like the readability can hardly get much better than what you wrote, but I don't know!
It is a joke, but an SQL engine can be massively parallel. You just don't know it, it just gives you what you want. And in many ways the operations resembles what you do for example in CUDA.
CUDA backend for DuckDB or Trino would be one of my go-to projects if i was laid off.
What could be good is relational + array model. I have some ideas on https://tablam.org, and building not just the language but the optimizer in tandem I think will be very nice.
More generally, the key here is that the more magic you want in the execution of your code, the more declarative you want the code to be. And SQL is pretty much the poster child declarative language out there.
Term rewriting languages probably work better at this than I would expect? It is kind of sad how little experience with that sort of thing that I have built up. And I think I'm above a large percentage of developers out there.
Raph and I also talked about this subject here: https://www.popovit.ch/interviews/raph-levien-simd The discussion covers things at a relatively basic level as we wanted it to be accessible to a wide audience. So we explain SIMD vs SIMT, predication, multiversioning, and some more.
Raph is a super nice guy and a pleasure to talk to. I'm glad we have people like him around!
It seems like there are two sides to this problem, both of which are hard and go hand in hand. There is the HCI problem of having abstractions are rich enough to handle problems like parsing and scheduling on the GPU. Then you need a sufficiently smart compiler problem of lowering these problems to the GPU. But of course, there's a limit to how smart a compiler can be, which loops back to your abstraction design.
Overall, it seems to be a really interesting problem!
There were a few languages designed specifically for parallel computing spurred by DARPA's High Productivity Computing Systems project. While Fortress is dead, Chapel is still being developed.
Those languages were not effective in practice. The kind of loop parallelism that most people focus on is the least interesting and effective kind outside of niche domains. The value was low.
Hardware architectures like Tera MTA were much more capable but almost no one could write effective code for them even though the language was vanilla C++ with a couple extra features. Then we learned how to write similar software architecture on standard CPUs. The same problem of people being bad at it remained.
The common thread in all of this is people. Humans as a group are terrible at reasoning about non-trivial parallelism. The tools almost don't matter. Reasoning effectively about parallelism involves manipulating a space that is quite evidently beyond most human cognitive abilities to reason about.
Parallelism was never about the language. Most people can't build the necessary mental model in any language.
the distinction matters less and less. Inside the GPU there is already plenty of locality to exploit (catches, schedulers, warps). nvlink is a switch memory access network, so that already gets you some fairly large machines with multiple kinds of locality.
throwing infiniband or IP on top is really structurally more of the same.
Or basically a generic nestable `remote_parallel_map` for python functions over lists of objects.
I haven't had a chance to fully watch the video yet / I understand it focuses on lower levels of abstraction / GPU programming. But I'd love to know how this fit's into what the speaker is looking for / what it's missing (other than obviously it not being a way to program GPU's) (also full disclosure I am a co-founder).
Lower-level programming language, which is either object-oriented like python or after compilation a real-time system transposition would assemble the microarchitecture to an x86 chip.
I almost mentioned it in the talk, as an example of a language that's deployed very successfully and expresses parallelism at scale. Ultimately I didn't, as the core of what I'm talking about is control over dynamic allocation and scheduling, and that's not the strength of VHDL.
I think a good parallel language will be the one that takes your code written with tasks and channels, understands its logic, rewrites and compiles it in the most efficient way. I don't feel that I have to write something harder than that as a pity human.
So he wants a good parallel language? What's the issue? I haven't had problems with concurrency, multiplexing, and promises. They've solved all the parallelism tasks I've needed to do.
Interesting talk. He mentions Futhark a few times, but fails to point out that his ideal way of programming is almost 1:1 how it would be done in Futhark.
His example is:
It would be written in Futhark something like this:Also, while not exactly the algorithm Raph is looking for, here is a bracket matching function (from Pareas, which he also mentions in the talk) in Futhark: https://github.com/Snektron/pareas/blob/master/src/compiler/...
I haven't studied it in depth, but it's pretty readable.
(author here) check_brackets_bt is actually exactly the algorithm that Raph mentions
Right. This is the binary tree version of the algorithm, and is nice and concise, very readable. What would take it to the next level for me is the version in the stack monoid paper, which chunks things up into workgroups. I haven't done benchmarks against the Pareas version (unfortunately it's not that easy), but I would expect the workgroup optimized version to be quite a bit faster.
I've been playing with one using scans. too bad that's not really on the map for architectural reasons, it opens up a lot of uses.
Thanks for clarifying! It would indeed be interesting to see a comparison between similar implementations in other languages, both in terms of readability and performance. I feel like the readability can hardly get much better than what you wrote, but I don't know!
SQL.
It is a joke, but an SQL engine can be massively parallel. You just don't know it, it just gives you what you want. And in many ways the operations resembles what you do for example in CUDA.
CUDA backend for DuckDB or Trino would be one of my go-to projects if i was laid off.
My issue with SQL is lack of composability and difficulty of debugging intermediate results.
Yes, SQL is poor.
What could be good is relational + array model. I have some ideas on https://tablam.org, and building not just the language but the optimizer in tandem I think will be very nice.
is it a language problem though? it's just lack of tooling.
The dataframe paradigm (a good example being polars) is another good alternative that's more composable (imo).
It is true. I still hate it. I think because it always offers 10 different ways to do the same thing. So it is just too much to remember.
More generally, the key here is that the more magic you want in the execution of your code, the more declarative you want the code to be. And SQL is pretty much the poster child declarative language out there.
Term rewriting languages probably work better at this than I would expect? It is kind of sad how little experience with that sort of thing that I have built up. And I think I'm above a large percentage of developers out there.
If you want to work in data engineering for massive datasets (many petabytes) pls hit me up!
Raph and I also talked about this subject here: https://www.popovit.ch/interviews/raph-levien-simd The discussion covers things at a relatively basic level as we wanted it to be accessible to a wide audience. So we explain SIMD vs SIMT, predication, multiversioning, and some more.
Raph is a super nice guy and a pleasure to talk to. I'm glad we have people like him around!
It seems like there are two sides to this problem, both of which are hard and go hand in hand. There is the HCI problem of having abstractions are rich enough to handle problems like parsing and scheduling on the GPU. Then you need a sufficiently smart compiler problem of lowering these problems to the GPU. But of course, there's a limit to how smart a compiler can be, which loops back to your abstraction design.
Overall, it seems to be a really interesting problem!
There were a few languages designed specifically for parallel computing spurred by DARPA's High Productivity Computing Systems project. While Fortress is dead, Chapel is still being developed.
Those languages were not effective in practice. The kind of loop parallelism that most people focus on is the least interesting and effective kind outside of niche domains. The value was low.
Hardware architectures like Tera MTA were much more capable but almost no one could write effective code for them even though the language was vanilla C++ with a couple extra features. Then we learned how to write similar software architecture on standard CPUs. The same problem of people being bad at it remained.
The common thread in all of this is people. Humans as a group are terrible at reasoning about non-trivial parallelism. The tools almost don't matter. Reasoning effectively about parallelism involves manipulating a space that is quite evidently beyond most human cognitive abilities to reason about.
Parallelism was never about the language. Most people can't build the necessary mental model in any language.
iirc those were oriented more towards large HPC clusters rather than computation on single node?
Chapel, at least, aims for both. You can write loops that it will try to compile to use SIMD instructions, or even for the GPU: https://chapel-lang.org/docs/technotes/gpu.html
the distinction matters less and less. Inside the GPU there is already plenty of locality to exploit (catches, schedulers, warps). nvlink is a switch memory access network, so that already gets you some fairly large machines with multiple kinds of locality.
throwing infiniband or IP on top is really structurally more of the same.
Chapel definitely can target a single GPU.
Bend comes to mind as an attempt at this: https://github.com/HigherOrderCO/Bend
Disclaimer: I did not watch the video yet
What about burla.dev ?
Or basically a generic nestable `remote_parallel_map` for python functions over lists of objects.
I haven't had a chance to fully watch the video yet / I understand it focuses on lower levels of abstraction / GPU programming. But I'd love to know how this fit's into what the speaker is looking for / what it's missing (other than obviously it not being a way to program GPU's) (also full disclosure I am a co-founder).
Lower-level programming language, which is either object-oriented like python or after compilation a real-time system transposition would assemble the microarchitecture to an x86 chip.
Went in thinking "Have you heard of Go?"... but this turned out to be about GPU computing.
Well, they said "good" :). Go we already have, that is correct.
P.S. I'm joking, I do love Go, even though it's by no means a perfect language to write parallel applications with
VHDL?
I almost mentioned it in the talk, as an example of a language that's deployed very successfully and expresses parallelism at scale. Ultimately I didn't, as the core of what I'm talking about is control over dynamic allocation and scheduling, and that's not the strength of VHDL.
prolog?
Now you are backtracking (pun intended).
I think a good parallel language will be the one that takes your code written with tasks and channels, understands its logic, rewrites and compiles it in the most efficient way. I don't feel that I have to write something harder than that as a pity human.
mapping from channels to SIMD seems kind of intractable, its a kind of lifting that involves looking across the producers and the consumers.
going the other direction, making channel runtimes run SIMD, is trivial
The audio is weirdly messed up
Yes, sorry about that. We had tech issues, and did the best we could with the audio that was captured.
Unfortunately his microphone did not cooperate.
So he wants a good parallel language? What's the issue? I haven't had problems with concurrency, multiplexing, and promises. They've solved all the parallelism tasks I've needed to do.