Creating 100 random matrices of size 5000x5000 on CPU...
Adding matrices using CPU...
CPU matrix addition completed in 0.6541 seconds
CPU result matrix shape: (5000, 5000)
Creating 100 random matrices of size 5000x5000 on GPU...
Adding matrices using GPU...
GPU matrix addition completed in 0.1480 seconds
GPU result matrix shape: (5000, 5000)
Definitely worth digging into more, as the API is really simple to use, at least for basic things like these. CUDA programming seems like a big chore without something higher level like this.> The article is about the next wave of Python-oriented JIT toolchains
the article is content marketing (for whatever) but the actual product has literally has nothing to do with kernels or jitting or anything
https://github.com/NVIDIA/cuda-python
literally just cython bindings to CUDA runtime and CUB.
for once CUDA is aping ROCm:
- San Francisco Python meetup in 2023: https://youtu.be/L9ELuU3GeNc?si=TOp8lARr7rP4cYaw
- Yerevan PyData meetup in 2022: https://youtu.be/OxAKSVuW2Yk?si=5s_G0hm7FvFHXx0u
Of the more remarkable results: - 1000x sorting speedup switching from NumPy to CuPy.
- 50x performance improvements switching from Pandas to CuDF on the New York Taxi Rides queries.
- 20x GEMM speedup switching from NumPy to CuPy.
CuGraph is also definitely worth checking out. At that time, Intel wasn't in as bad of a position as they are now and was trying to push Modin, but the difference in performance and quality of implementation was mind-boggling.For comparison, doing something similar with torch on CPU and torch on GPU will get you like 100x speed difference.
matricies = [np.random(...) for _ in range]
time_start = time.time()
cp_matricies = [cp.array(m) for m in matrices]
add_(cp_matricies)
sync
time_end = time.time()
PSA: if you ever see code trying to measure timing and it’s not using the CUDA event APIs, it’s fundamentally wrong and is lying to you. The simplest way to be sure you’re not measuring noise is to just ban the usage of any other timing source. Definitely don’t add unnecessary syncs just so that you can add a timing tap.
https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART_...
If I don’t care what part of the CUDA ecosystem is taking time (from my point of view it is a black-box that does GEMMs) so why not measure “time until my normal code is running again?”
print("Adding matrices using GPU...")
start_time = time.time()
gpu_result = add_matrices(gpu_matrices)
cp.cuda.get_current_stream().synchronize() # Not 100% sure what this does
elapsed_time = time.time() - start_time
I was going to ask, any CUDA professionals who want to give a crash course on what us python guys will need to know?So if you need to wait for an op to finish, you need to `synchronize` as shown above.
`get_current_stream` because the queue mentioned above is actually called stream in cuda.
If you want to run many independent ops concurrently, you can use several streams.
Benchmarking is one use case for synchronize. Another would be if you let's say run two independent ops in different streams and need to combine their results.
Btw, if you work with pytorch, when ops are run on gpu, they are launched in background. If you want to bench torch models on gpu, they also provide a sync api.
It's great that parts of pie torch which concern the NVIDIA backend can now be implemented in Python directly, The important part that it doesn't really matter or shouldn't matter for end users / Developers
that being said, maybe this new platform will extend the whole concept of on GPU computation via Python to even more domains like maybe games.
Imagine running rust the Game performantly mainly on the GPU via Python
I'm totally with you that it's better that this took so long, so we have things like PyTorch abstracting most of this away, but I'm looking forward to (in my non-existent free time :/ ) playing with this.
That said, there were almost no announcements or talks related to CPUs, despite the Grace CPUs being announced quite some time ago. It doesn't feel like we're going to see generalizable abstractions that work seamlessly across Nvidia CPUs and GPUs anytime soon. For someone working on parallel algorithms daily, this is an issue: debugging with NSight and CUDA-GDB still isn't the same as raw GDB, and it's much easier to design algorithms on CPUs first and then port them to GPUs.
Of all the teams in the compiler space, Modular seems to be among the few that aren't entirely consumed by the LLM craze, actively building abstractions and languages spanning multiple platforms. Given the landscape, that's increasingly valuable. I'd love to see more people experimenting with Mojo — perhaps it can finally bridge the CPU-GPU gap that many of us face daily!
As someone working on graphics programming, it always frustrates me to see so much investment in GPU APIs _for AI_, but almost nothing for GPU APIs for rendering.
Block level primitives would be great for graphics! PyTorch-like JIT kernels programmed from the CPU would be great for graphics! ...But there's no money to be made, so no one works on it.
And for some reason, GPU APIs for AI are treated like an entirely separate thing, rather than having one API used for AI and rendering.
I’m one of those people who can’t (won’t) learn C++ to the extent required to effectively write code for GPU execution…. But to have a direct pipeline to the GPU via Python. Wow.
The efficiency implications are huge, not just for Python libraries like PyTorch, but also anything we write that runs on an NVIDIA GPU.
I love seeing anything that improves efficiency because we are constantly hearing about how many nuclear power plants OpenAI and Google are going to need to power all their GPUs.
This means that Nvidia is selling a relatively unique architecture with a fully-developed SDK, industry buy-in and relevant market demand. Getting AMD up to the same spot would force them to reevaluate their priorities and demand a clean-slate architecture to-boot.
Khronos also never saw the need to support a polyglot ecosystem with C++, Fortran and anything else that the industry could feel like using on a GPU.
When Khronos finally remember to at least add C++ support and SPIR, again Intel and AMD failed to deliver, and OpenCL 3.0 is basically OpenCL 1.0 rebranded.
Followed by SYCL efforts, which only Intel seems to care, with their own extensions on top via DPC++, nowadays openAPI. And only after acquiring Codeplay, which was actually the first company to deliver on SYCL tooling.
However contrary to AMD, at least Intel does get that unless everyone gets to play with their software stack, no one will bother to actually learn it.
Have you ever used a GPU API (CUDA, OpenCL, OpenGL, Vulkan, etc...) with a scripting language?
It's cool that Nvidia made a bit of an ecosystem around it but it won't replace C++ or Fortran and you can't simply drop in "normal" Python code and have it run on the GPU. CUDA is still fundamentally it's own thing.
There's also been CUDA bindings to scripting languages for at least 15 years... Most people will probably still use Torch or higher level things built on top of it.
Also, here's Nvidia's own advertisement and some instructions for Python on their GPUs:
- https://developer.nvidia.com/cuda-python
- https://developer.nvidia.com/how-to-cuda-python
Reality is kind of boring, and the article posted here is just clickbait.
While its not exactly normal Python code, there are Python libraries that allow writing GPU kernels in internal DSLs that are normal-ish Python (e.g., Numba for CUDA specifically via the @cuda.jit decorator; or Taichi which has multiple backends supporting the same application code—Vulkan, Metal, CUDA, OpenGL, OpenGL ES, and CPU.)
Apparently, nVidia is now doing this first party in CUDA Python, including adding a new paradigm for CUDA code (CuTile) that is going to be in Python before C++; possibly trying to get ahead of things like Taichi (which, because it is cross-platform, commoditizes the underlying GPU).
> Also, here's Nvidia's own advertisement for Python on their GPUs
That (and the documentation linked there) does not address the new upcoming native functionality announced at GTC; existing CUDA Python has kernels written in C++ in inline strings.
Ish. It's a Python maths library made by Nvidia, an eDSL and a collection of curated libraries. It's not significantly different than stuff like Numpy, Triton, etc..., apart from being made by Nvidia and bundled with their tools.
The polyglot nature of CUDA is one of the plus points versus the original "we do only C99 dialect around here" from OpenCL, until it was too late.
No need for a RTX for learning and getting into CUDA programming.
So your lousy laptop GTX 750Ti ehhh probably can’t practically be used for CUDA. But your lousy 1050Ti Max-Q? Sure.
JAX lets you write Python code that executes on Nvidia, but also GPUs of other brands (support varies). It similarly has drop-in replacements for NumPy functions.
This only supports Nvidia. But can it do things JAX can't? It is easier to use? Is it less fixed-size-array-oriented? Is it worth locking yourself into one brand of GPU?
[1]: https://numba.readthedocs.io/en/stable/cuda/overview.html
The crate seems to have a lot of momentum, with many new features, releases, active communities on GH and Discord. I expect it to continue to get better.
I am using Cudarc.
The PEP model is a good vehicle for self-improvement and standardization. Packaging and deployment will soon be solved problems thanks to projects such as uv and BeeWare, and I'm confident that we're going to see continued performance improvements year over year.
I really hope you're right. I love Python as a language, but for any sufficiently large project, those items become an absolute nightmare without something like Docker. And even with, there seems to be multiple ways people solve it. I wish they'd put something in at the language level or bless an 'official' one. Go has spoiled me there.
I'm plenty familiar with packaging solutions that are painful to work with, but the state of python was shocking when I hopped back in because of the available ML tooling.
UV seems to be at least somewhat better, but damn - watching pip literally download 20+ 800MB torch wheels over and over trying to resolve deps only to waste 25GB of bandwidth and finally completely fail after taking nearly an hour was absolutely staggering.
I think software engineers with any significant amount of experience recognize you can build an application that does X in just about any language. To me, the largest difference, the greatest factor in which language to choose, is the existing packages. Simple example- there are several packages in Python for extracting text from PDFs (using tesseract or not). C# has maybe one tesseract wrapper? I recall working with PDFs in .NET being a nightmare. I think we had to buy a license to some software because there wasn’t a free offering. Python has several.
This is VERY important because we as software engineers, even if we wanted to reinvent the wheel sometimes, have very limited time. It takes an obscene number of man hours to develop a SalesForce or a Facebook or even something smaller like a Linux distro.
I hope so. Every time I engage in a "Why I began using Go aeons ago" conversation, half of the motivation was this. The reason I stopped engaging in them is because most of the participants apparently cannot see that this is even a problem. Performance was always the second problem (with Python); this was always the first.
AMD is held back by the combination of a lot of things. They have a counterpart to almost everything that exists on the other side. The things on the AMD side are just less mature with worse documentation and not as easily testable on consumer hardware.