Nvidia adds native Python support to CUDA

https://thenewstack.io/nvidia-finally-adds-native-python-support-to-cuda/

diggan
I'm no GPU programmer, but seems easy to use even for someone like me. I pulled together a quick demo of using the GPU vs the CPU, based on what I could find (https://gist.github.com/victorb/452a55dbcf59b3cbf84efd8c3097...) which gave these results (after downloading 2.6GB of dependencies of course):

    Creating 100 random matrices of size 5000x5000 on CPU...
    Adding matrices using CPU...
    CPU matrix addition completed in 0.6541 seconds
    CPU result matrix shape: (5000, 5000)
    
    Creating 100 random matrices of size 5000x5000 on GPU...
    Adding matrices using GPU...
    GPU matrix addition completed in 0.1480 seconds
    GPU result matrix shape: (5000, 5000)
Definitely worth digging into more, as the API is really simple to use, at least for basic things like these. CUDA programming seems like a big chore without something higher level like this.
ashvardanian
CuPy has been available for years and has always worked great. The article is about the next wave of Python-oriented JIT toolchains, that will allow writing actual GPU kernels in a Pythonic-style instead of calling an existing precompiled GEMM implementation in CuPy (like in that snippet) or even JIT-ing CUDA C++ kernels from a Python source, that has also been available for years: https://docs.cupy.dev/en/stable/user_guide/kernel.html#raw-k...
almostgotcaught
it's funny - people around here really do not have a clue about the GPU ecosystem even though everyone is always talking about AI:

> The article is about the next wave of Python-oriented JIT toolchains

the article is content marketing (for whatever) but the actual product has literally has nothing to do with kernels or jitting or anything

https://github.com/NVIDIA/cuda-python

literally just cython bindings to CUDA runtime and CUB.

for once CUDA is aping ROCm:

https://github.com/ROCm/hip-python

dragonwriter
The mistake you seem to be making is confusing the existing product (which has been available for many years) with the upcoming new features for that product just announced at GTC, which are not addressed at all on the page for the existing product, but are addressed in the article about the GTC announcement.
ladberg
The main release highlighted by the article is cuTile which is certainly about jitting kernels from Python code
wiredfool
Curious what the timing would be if it included the memory transfer time, e.g.

  matricies = [np.random(...) for _ in range]
  time_start = time.time()
  cp_matricies = [cp.array(m) for m in matrices]
  add_(cp_matricies)
  sync
  time_end = time.time()
nickysielicki
I don’t mean to call you or your pseudocode out specifically, but I see this sort of thing all the time, and I just want to put it out there:

PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_...

bee_rider
If I have a mostly CPU code and I want to time the scenario: “I have just a couple subroutines that I am willing to offload to the GPU,” what’s wrong with sprinkling my code with normal old python timing calls?

If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”

hnuser123456
I think it does?: (the comment is in the original source)

    print("Adding matrices using GPU...")
    start_time = time.time()
    gpu_result = add_matrices(gpu_matrices)
    cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does
    elapsed_time = time.time() - start_time
I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know?
apbytes
When you call a cuda method, it is launched asynchronously. That is the function queues it up for execution on gpu and returns.

So if you need to wait for an op to finish, you need to `synchronize` as shown above.

`get_current_stream` because the queue mentioned above is actually called stream in cuda.

If you want to run many independent ops concurrently, you can use several streams.

Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.

Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.

moffkalast
Only 4x speed seems rather low for GPU acceleration, does numpy already use AVX2 or anything SIMD?

For comparison, doing something similar with torch on CPU and torch on GPU will get you like 100x speed difference.

diggan
It's a microbenchmark (if even that), take it with a grain of salt. You'd probably see a bigger difference with bigger/more/more complicated tasks,
rahimnathwani
Thank you. I scrolled up and down the article hoping they included a code sample.
diggan
Yeah, I figured I wasn't alone in doing just that :)
rahimnathwani
EDIT: Just realized the code doesn't seem to be using the GPU for the addition.
no_wizard
What makes Python such a target for these kind of things?

I've noticed alot of projects add Python support like this. Does the Python codebase allow for it to compile down to different targets easier than others?

aixpert
thank God, Pytorch gained so much momentum before this came out, Now we have a true platform independent semi standard For parallel computations. We are not stuck with NVIDIA specifics.

It's great that parts of pie torch which concern the NVIDIA backend can now be implemented in Python directly, The important part that it doesn't really matter or shouldn't matter for end users / Developers

that being said, maybe this new platform will extend the whole concept of on GPU computation via Python to even more domains like maybe games.

Imagine running rust the Game performantly mainly on the GPU via Python

disgruntledphd2
This just makes it much, much easier for people to build numeric stuff on GPU, which is great.

I'm totally with you that it's better that this took so long, so we have things like PyTorch abstracting most of this away, but I'm looking forward to (in my non-existent free time :/ ) playing with this.

wafngar
Why not use torch.compile()?
ashvardanian
CuTile, in many ways, feels like a successor to OpenAI's Triton... And not only are we getting tile/block-level primitives and TileIR, but also a proper SIMT programming model in CuPy, which I don't think enough people noticed even at this year's GTC. Very cool stuff!

That said, there were almost no announcements or talks related to CPUs, despite the Grace CPUs being announced quite some time ago. It doesn't feel like we're going to see generalizable abstractions that work seamlessly across Nvidia CPUs and GPUs anytime soon. For someone working on parallel algorithms daily, this is an issue: debugging with NSight and CUDA-GDB still isn't the same as raw GDB, and it's much easier to design algorithms on CPUs first and then port them to GPUs.

Of all the teams in the compiler space, Modular seems to be among the few that aren't entirely consumed by the LLM craze, actively building abstractions and languages spanning multiple platforms. Given the landscape, that's increasingly valuable. I'd love to see more people experimenting with Mojo — perhaps it can finally bridge the CPU-GPU gap that many of us face daily!

crazygringo
Very curious how this compares to JAX [1].

JAX lets you write Python code that executes on Nvidia, but also GPUs of other brands (support varies). It similarly has drop-in replacements for NumPy functions.

This only supports Nvidia. But can it do things JAX can't? It is easier to use? Is it less fixed-size-array-oriented? Is it worth locking yourself into one brand of GPU?

[1] https://github.com/jax-ml/jax

odo1242
Well, the idea is that you’d be writing low level CUDA kernels that implement operations not already implemented by JAX/CUDA and integrate them into existing projects. Numba[1] is probably the closest thing I can think of that currently exists. (In fact, looking at it right now, it seems this effort from Nvidia is actually based on Numba)

[1]: https://numba.readthedocs.io/en/stable/cuda/overview.html

gymbeaux
This is huge. Anyone who was considering AMD + ROCm as an alternative to NVIDIA in the AI space isn’t anymore.

I’m one of those people who can’t (won’t) learn C++ to the extent required to effectively write code for GPU execution…. But to have a direct pipeline to the GPU via Python. Wow.

The efficiency implications are huge, not just for Python libraries like PyTorch, but also anything we write that runs on an NVIDIA GPU.

I love seeing anything that improves efficiency because we are constantly hearing about how many nuclear power plants OpenAI and Google are going to need to power all their GPUs.

ferguess_k
Just curious why can't AMD do the same thing?
bigyabai
It can be argued that they already did. AMD and Apple worked with Khronos to build OpenCL as a general competitor. The industry didn't come together to support it though, and eventually major stakeholders abandoned it altogether. Those ~10 wasted years were spent on Nvidia's side refining their software offerings and redesigning their GPU architecture to prioritize AI performance over raster optimization. Meanwhile Apple and AMD were pulling the rope in the opposite direction, trying to optimize raster performance at all costs.

This means that Nvidia is selling a relatively unique architecture with a fully-developed SDK, industry buy-in and relevant market demand. Getting AMD up to the same spot would force them to reevaluate their priorities and demand a clean-slate architecture to-boot.

pjmlp
Maybe because Apple got pissed on how Khronos took over OpenCL, AMD and Intel never offered tooling on par with CUDA in terms of IDE integration, graphical debuggers and library ecosystem.

Khronos also never saw the need to support a polyglot ecosystem with C++, Fortran and anything else that the industry could feel like using on a GPU.

When Khronos finally remember to at least add C++ support and SPIR, again Intel and AMD failed to deliver, and OpenCL 3.0 is basically OpenCL 1.0 rebranded.

Followed by SYCL efforts, which only Intel seems to care, with their own extensions on top via DPC++, nowadays openAPI. And only after acquiring Codeplay, which was actually the first company to deliver on SYCL tooling.

However contrary to AMD, at least Intel does get that unless everyone gets to play with their software stack, no one will bother to actually learn it.

ErrorNoBrain
They are, if they cant find an nvidia card
pjmlp
NVidia cards are everywhere, the biggest difference to AMD is that even my lousy laptop GeForce cards can be used for CUDA.

No need for a RTX for learning and getting into CUDA programming.

dismalaf
> But to have a direct pipeline to the GPU via Python

Have you ever used a GPU API (CUDA, OpenCL, OpenGL, Vulkan, etc...) with a scripting language?

It's cool that Nvidia made a bit of an ecosystem around it but it won't replace C++ or Fortran and you can't simply drop in "normal" Python code and have it run on the GPU. CUDA is still fundamentally it's own thing.

There's also been CUDA bindings to scripting languages for at least 15 years... Most people will probably still use Torch or higher level things built on top of it.

Also, here's Nvidia's own advertisement and some instructions for Python on their GPUs:

- https://developer.nvidia.com/cuda-python

- https://developer.nvidia.com/how-to-cuda-python

Reality is kind of boring, and the article posted here is just clickbait.

dragonwriter
> It's cool that Nvidia made a bit of an ecosystem around it but it won't replace C++ or Fortran and you can't simply drop in "normal" Python code and have it run on the GPU.

While its not exactly normal Python code, there are Python libraries that allow writing GPU kernels in internal DSLs that are normal-ish Python (e.g., Numba for CUDA specifically via the @cuda.jit decorator; or Taichi which has multiple backends supporting the same application code—Vulkan, Metal, CUDA, OpenGL, OpenGL ES, and CPU.)

Apparently, nVidia is now doing this first party in CUDA Python, including adding a new paradigm for CUDA code (CuTile) that is going to be in Python before C++; possibly trying to get ahead of things like Taichi (which, because it is cross-platform, commoditizes the underlying GPU).

> Also, here's Nvidia's own advertisement for Python on their GPUs

That (and the documentation linked there) does not address the new upcoming native functionality announced at GTC; existing CUDA Python has kernels written in C++ in inline strings.

pjmlp
Yes, shading languages which are more productive without the gotchas from those languages, as they were designed from the ground up for compute devices.

The polyglot nature of CUDA is one of the plus points versus the original "we do only C99 dialect around here" from OpenCL, until it was too late.

freeone3000
OpenCL and OpenGL are basically already scripting languages that you happen to type into a C compiler. The CUDA advantage was actually having meaningful types and compilation errors, without the intense boilerplate of Vulkan. But this is 100% a python-for-CUDA-C replacement on the GPU, for people who prefer a slightly different bracketing syntax.
dismalaf
> But this is 100% a python-for-CUDA-C replacement on the GPU

Ish. It's a Python maths library made by Nvidia, an eDSL and a collection of curated libraries. It's not significantly different than stuff like Numpy, Triton, etc..., apart from being made by Nvidia and bundled with their tools.

the__alchemist
Rust support next? RN I am manually [de]serializing my data structures as byte arrays to/from the kernels. It would be nice to have truly shared data structures like CUDA gives you in C++!
KeplerBoy
Isn't Rust still very seldomly used in the areas where CUDA shines (e.g. number crunching of any kind, let it be simulations or linear algebra)? Imo C++ or even Fortran are perfectly fine choices for those things, since the memory allocation pattern aren't that complicated.
IshKebab
Mainly because number crunching code tends to be very long-lived (hence why FORTRAN is still in use).
nine_k
Not only that. Fortran is very good for writing number-crunching code. Modern Fortran is a pretty ergonomic language, it gives you a really easy way to auto-parallelize things in many ways, and new Fortran code is being produce unironically. Of course it normally uses the treasure trove of existing numerical Fortran code. (Source: a friend who worked at CERN.)
pjmlp
Yes, and the new kid in town, slang has more chances of adoption.
KeplerBoy
sorry, could you link to the project? Seems there are quite a few languages called slang.
chasely
The Rust-CUDA project just recently started up again [0], I've started digging into it a little bit and am hoping to contribute to it since the summers are a little slower for me.

[0] https://github.com/rust-gpu/rust-cuda

the__alchemist
Still broken though! Has been for years. In a recent GH issue regarding desires for the reboot, I asked: "Try it on a few different machines (OS, GPUs, CUDA versions etc), make it work on modern RustC and CUDA versions without errors." The response was "That will be quite some work." Meanwhile, Cudarc works...
chasely
Totally, it's going to take a minute to get it all working. On a positive note, they recently got some sponsorship from Modal [0], who is supplying GPUs for CI/CD so they should be able to expand their hardware coverage.
Micoloth
What do you think of the Burn framework? (Honest question, I have no clue what I’m talking about)
airstrike
I used it to train my own mini-GPT and I liked it quite a bit. I tend to favor a different style of Rust with fewer generics but maybe that just can't be avoided given the goals of that project.

The crate seems to have a lot of momentum, with many new features, releases, active communities on GH and Discord. I expect it to continue to get better.

the__alchemist
Have not heard of it. Looked it up. Seems orthogonal?

I am using Cudarc.

the__alchemist
Not functional.
chrisrodrigue
Python is really shaping up to be the lingua franca of programming languages. Its adoption is soaring in this FOSS renaissance and I think it's the closest thing to a golden hammer that we've ever had.

The PEP model is a good vehicle for self-improvement and standardization. Packaging and deployment will soon be solved problems thanks to projects such as uv and BeeWare, and I'm confident that we're going to see continued performance improvements year over year.

int_19h
AI-generated code is going to be a major influence going forward. Regardless of how you feel about its quality (I'm a pessimist myself), it's happening anyway, and it's going to cement the dominant position of those languages which LLMs understand / can write the best. Which correlates strongly to their amount in the training set, which means that Python and JavaScript in particular are here to stay now, and will likely be increasingly shoved into more and more niches - even those they aren't well-suited to - solely because LLMs can write them.
silisili
> Packaging and deployment will soon be solved problems

I really hope you're right. I love Python as a language, but for any sufficiently large project, those items become an absolute nightmare without something like Docker. And even with, there seems to be multiple ways people solve it. I wish they'd put something in at the language level or bless an 'official' one. Go has spoiled me there.

horsawlarway
Honestly, I'm still incredibly shocked at just how bad Python is on this front.

I'm plenty familiar with packaging solutions that are painful to work with, but the state of python was shocking when I hopped back in because of the available ML tooling.

UV seems to be at least somewhat better, but damn - watching pip literally download 20+ 800MB torch wheels over and over trying to resolve deps only to waste 25GB of bandwidth and finally completely fail after taking nearly an hour was absolutely staggering.

SJC_Hacker
Python was not taken seriously as something you actually shipped to non-devs. The solution was normally "install the correct version of Python on the host system". In the Linux world, this could be handled through Docker, pyenv. For Windows users, this meant installing a several GB distro and hoping it didn't conflict with what was already on the system.
ergonaught
> Packaging and deployment will soon be solved problems ...

I hope so. Every time I engage in a "Why I began using Go aeons ago" conversation, half of the motivation was this. The reason I stopped engaging in them is because most of the participants apparently cannot see that this is even a problem. Performance was always the second problem (with Python); this was always the first.

pjmlp
Is the new BASIC, Pascal and Lisp.

Now if only CPython also got a world class JIT, V8 style.

jmward01
This will probably lead to what, I think, python has led to in general: A lot more things tried quicker and targeted things that stay in a faster language. All in all this is a great move. I am looking forward to playing with it for sure.
ryao

  CUDA was born from C and C++
It would be nice if they actually implemented a C variant of CUDA instead of extending C++ and calling it CUDA C.
pjmlp
First of all they extend C, and with CUDA 3.0, initial support was added for C++, afterwards they bought PGI and added Fortran into the mix.

Alongside for the ride, they fostered an ecosystem from compiled language backends targeting CUDA.

Additionally modern CUDA supports standard C++ as well, with frameworks that hide the original extensions.

Most critics don't really get the CUDA ecosystem.

ryao
They replaced C with C++. For example, try passing a function pointer as a void pointer argument without a cast. C says this should work. C++ says it should not. There are plenty of other differences that make it C++ and not C, if you know to look for them. The fact that C++ symbol names are used for one, which means you need to specify extern “C” if you want to reference them from the CUDA driver API. Then there is the fact that it will happily compile C++ classes where a pure C compiler will not. There is no stdc option for the compiler.
swyx
why is that impt to you? just trying to understand the problem you couldnt solve without a C-like
ryao
I want to write C code, not C++ code. Even if I try to write C style C++, it is more verbose and less readable, because of various C++isms. For example, having to specify extern “C” to get sane ABI names for the Nvidia CUDA driver API:

https://docs.nvidia.com/cuda/cuda-driver-api/index.html

Not to mention that C++ does not support neat features like variable sized arrays on the stack.

kevmo314
A strict C variant would indeed be quite nice. I've wanted to write CUDA kernels in Go apps before so the Go app can handle the concurrency on the CPU side. Right now, I have to write a C wrapper and more often than not, I end up writing more code in C++ instead.

But then I end up finding myself juggling mutexes and wishing I had some newer language features.