Intro
利用whisper模型转录播客文本
- 安装whisper.cpp并下载模型(Mac平台)
- 爬取播客mp3并转成wav
- 执行转录
Usage
Install Whisper
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp
bash ./models/download-ggml-model.sh large-v3-turbo
make large-v3-turbo
Transcribe
- download.py
import requests
from lxml import etree
import os
def download(url, fpath):
if os.path.exists(fpath):
print(f"文件 {fpath} 已存在,跳过下载。")
return
get_response = requests.get(url, stream=True)
with open(fpath, "wb") as f:
for chunk in get_response.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
f.write(chunk)
if __name__ == "__main__":
URL = "https://feeds.simplecast.com/OB5FkIl8"
page = requests.get(URL)
tree = etree.fromstring(page.content)
nodes = tree.xpath("//item")
print(len(nodes))
nodes.reverse()
for idx, node in enumerate(nodes):
enclosure = node.xpath("./enclosure")
if enclosure:
mp3_url = enclosure[0].get("url")
title = node.xpath("./title/text()")[0]
print(title)
title = title.replace(" ", "-") + ".mp3"
title = str(idx) + ":" + title
download(mp3_url, title)
- to-wave.sh
#!/bin/bash
mkdir -p wavs/
MP3s=$(ls *.mp3)
for mp3 in ${MP3s[@]}; do
PREFIX=$(echo $mp3 | sed 's/.mp3//g')
echo $PREFIX
rm wavs/$PREFIX.wav
ffmpeg -i $PREFIX.mp3 -acodec pcm_s16le -ar 16000 -ac 1 wavs/$PREFIX.wav
done
- transcribe.sh
#/bin/bash
WHISPER_SRC=$HOME/Workshop/whisper.cpp
WAVS=$(find wavs/*.wav)
for wav in ${WAVS[@]}; do
wav_path=$(realpath $wav)
set -x
# $WHISPER_SRC/build/bin/main -m $WHISPER_SRC/models/ggml-base.en.bin -f $wav_path --output-txt
$WHISPER_SRC/build/bin/whisper-cli -m $WHISPER_SRC/models/ggml-large-v3-turbo.bin -f $wav_path --output-txt -t 8 -p 1
done
pytorch podcast transcript
EP1 Binding-C++-objects-to-Python
Hello, my name is Edward, and this is episode one of my podcast about PyTorch things.
I’m not really sure how this is going to work out or where I’m going to go with this, but
for now, the idea behind this podcast is just to, you know, be a casual form for me to talk
about, you know, various aspects of the PyTorch project.
No particular organization.
Today, I want to talk a little bit about how we bind Python to PyTorch.
That is to say, you know, the whole point of PyTorch is to provide an object called a tensor
that people can use.
And, you know, to make this tensor object available from Python, we have to do bindings for it.
And these bindings are actually quite intricate in some sense.
And I want to just explain why it’s not as easy as it seems and talk a little bit about
like how we actually solve this in the project, and some of the work that I’ve been working
on recently.
So what are Python bindings?
Well, let’s imagine that you’re trying to design any sort of, you know, high performance computing
library that has bindings available from a dynamically scripted language.
So if you were just writing a data structure in the language itself, you would probably just
define a class for the object in question in the language itself.
And that would give you something very reasonable.
Now, the problem is, you know, when you’re writing in interpreted languages like Python, all of
the objects need to have a very regular layout.
And it means that, you know, when you want to do something that actually needs to be very
efficient, that needs to actually have some sort of packed layout, typically, the language
itself won’t give you enough facilities to actually define the exact data layout you need.
It’s going to be something that, you know, you have to go to a lower level language, like
C or C++ to do.
So the typical situation for anyone who’s writing a language, sorry, a library in this
situation is you’ll have some sort of data structure, in our case, let’s call this data
structure a tensor.
And in and then you want to somehow make it possible for people to access this data structure
from Python.
So you’ve got two objects in hand, right?
You’ve got this concept of an object in C++ land or in C land, a struct that knows nothing
about Python, per se, because maybe you also wanted this library to be usable by other people
who don’t have Python.
And then you also need to somehow give a representation, a Python representation, that regular Python
programs can understand.
And sort of this split, this split where you want it to work both in a Python agnostic context
and a Python context is where some of the complexity of binding objects in this way comes from.
Now wait, Edward, you might be thinking, hey, you know, I can bind objects to Python.
There’s this cool library called pybind11.
And all I need to do is just take my object, you know, and wrap it up in this magic class
underscore template.
And then pybind11 goes through all the work somehow of, you know, making it possible to
actually, you know, turn this object into a Python object.
And I don’t know what it really does.
Well, but you know, something happens.
And so I want to talk a little bit about what happens in this case.
And actually, when we talk about a type like tensor, we don’t actually use pybind11 to bind
it, because pybind11 does something very interesting, uses a hash map, and we don’t want to pay the
cost for that.
So let’s talk about what it means to make a type actually available in Python.
So we’ve got some C++ type, we’ve got some C struct, and we want to make it available to
Python.
So when we’re writing some Python bindings, we need to define a Python layout data structure
that represents the Python object in question.
So remember, Python is an interpreted language, all of the objects have a very regular form,
Python is ref counted.
So one of the things that every Python object needs to have is a header saying what kind
of object it is, and what its reference count is.
So if you like go and look up your CPython, you know, API notes about how to define a new
define a new object, it’ll tell you, hey, you know, first to find this header, then you can
put in your fields.
And then there’s a description of the data type you have to do to actually say what the object
in question is.
Okay, that’s cool.
So you can like copy paste some code and get this working.
And then you have a problem, which is that you’ve got this Python object, and it’s not
the same thing as your C struct.
So what do you do?
Well, you could do something like, okay, a Python object is simply a object that contains the C++
object in question.
But this usually isn’t really quite what you want.
Because let’s say that you have a pre existing C++ object, and you want to pass it to Python,
right?
like say I allocated a tensor from C++, and I want to return it from my program, and
actually have, you know, someone in Python make use of it.
If you just put the tensor in the Python object struct directly, well, you need to somehow, you
know, move the data over into this new struct layout that’s got this header that, you know,
Python expects your stuff to have, and you probably don’t want to actually move all of
the data in question.
So you know, the obvious thing to do in this situation is do an indirection, right?
So instead of having the entire, you know, contents of the object stored, you’ll just have a pointer,
right, maybe a shared pointer to the representation in question.
Okay, so that, you know, lets you construct a Python object.
But something very strange will happen if you actually try to run the code in this case.
What will happen is, um, you pass your object to Python, um, you construct one of these Python
objects, you wrap it up, uh, you set the pointer to point to the C++ object in question, and you got
this Python object.
Then, the next time you decide you want to return this Python object, well, okay, um, I need to go
wrap up my, uh, pointer into one of these Python objects and return that.
Notice something has happened.
I’ve actually returned a new object in this situation so that, you know, even though both
of these Python objects point to the same underlying C++ object, um, they’re two different Python
objects.
And if I do something like, you know, A is B, you know, the test for object identity, uh,
in Python, uh, Python will just happily tell me, no, they’re not the same thing, even though
the C++ type is actually the same thing.
So usually when we bind, um, objects that have this notion of, you know, object identity, you
know, usually objects you can mutate like tensors, for example.
Um, we want to also preserve this notion of object identity when we bind them to Python.
And so Pybind 11 lets you bind arbitrary objects to Python, and it also preserves object identity.
And the way it does this is it maintains a giant hash map of all the C++ objects you’ve sent
through it so that the next time you send the same C++ pointer through it, it can look it
up in the hash table and say, oh, this is the Python object that I used last time.
So let me just return that again.
And this is how everything bound with Pybind 11 is going to work.
Okay.
Is this setting off performance alarm bells for you?
Because it is for me.
And it, this is kind of not actually, you know, this is not that fast.
And if you, um, really care about making things fast, you don’t actually want to bind your objects
this way, you want something cheaper to actually implement on this.
You want, for example, to just be able to dereference a field on your object to get the Python object
in question.
And so this is what we did for tensor.
So for tensors, we don’t maintain a hash map mapping and a given tensor to its Python object.
Instead, we have a field on the tensor object.
And this field simply points to the Python object in question that we want to return.
So if I want to pass a tensor from C++ to Python, I just read out this field.
If it’s not null, then I, there’s a Python object and I’ll just return that directly.
If it is null, that means it’s the first time I’m actually sending this tensor to Python.
So I can just go ahead and allocate one of these Python objects as I would have done before.
And then I actually, you know, get this object in Python in this situation.
So that, you know, works okay.
And remember that even though, you know, allocating a new object and then setting it to the tensor
seems very thread unsafe, all of our Python interactions are protected by the global interpreter lock.
So actually, you know, Python takes care of all the synchronization for us.
So this works decently well.
And it’s what we do.
One thing that you have to be careful about is this pointer that the tensor object has to
the Python object is non-owning.
Because remember, the Python object needs to keep the tensor C++ tensor live, right?
So it has a strong reference from Python to C++.
If the C++ object also had a strong reference to the Python object, you’d have a reference loop.
And that’s bad because when you have a reference cycle in a ref kind of language, the result will
never actually ever get deallocated.
So strong reference from Python to C++ because, you know, if you’ve got a Python object, you
better have a C++ tensor backing it.
And C++ tensor to Python is a weak reference.
Those of you who are thinking ahead might realize that there is a problem.
And the problem is this.
Because the reference to the Python object is weak, if I only have strong references to
the C++ object and I have no more references to the Python object, then the Python object
will actually be dead and it will get garbage collected by the CPython interpreter.
So that’s not so great.
And, you know, you kind of are wondering, well, what about this stale PyObject pointer in this
case?
Well, fortunately, we can actually define what the destructor for Python tensor object should
be.
So we just say, oh, clear out the PyObject field from the tensor when this happens.
But this does mean that something very strange can happen in this situation.
Namely, if you have a tensor and you send it to Python and then at some point all the Python
references are dead, the next time you send it to Python, you will get a completely distinct
object.
Now, granted, it’s kind of difficult to notice when this has happened because, well, the old
object isn’t around because you promised that you weren’t going to have any references to
it.
But, you know, if you, like, for example, took the ID of the object, the ID would be different
between the two versions.
And more importantly, and one of the reasons why I’ve recently been working on a patch to
change this behavior, if you actually had some Python data stored on the tensor, for example,
all objects in Python, you know, you can add arbitrary attributes to them after the fact
using the underscore underscore dict attribute.
Well, if you went ahead and added a bunch of these things to the tensor and then expected
once you saved it in C++, for example, if you were saving it for backwards, one of the most
common cases when we’ll save a tensor in C++ and it will outlive its Python equivalent, you
won’t get that information when it pops back out into Python.
And we have a bug tracking this issue and people don’t really like it, although it’s, you know,
it’s kind of hard to solve a problem like this.
So next time, I want to talk a little bit about how we are going to solve this.
And it’s actually pretty nifty.
It’s using a trick that Sam Gross, one of the original PyTorch developers, came up with.
And I’m eager to share it with you next time.
See ya.Binding-C++-objects-to-Python
EP2 History-and-constraints-of-the-dispatcher
Hi, my name is Edward, and welcome to the PyTorch Dev Podcast.
Today, I want to talk a little bit about the history and motivations
behind one of the sort of more intricate pieces of PyTorch Core, the dispatcher.
Now, what exactly is the dispatcher?
Well, the dispatcher is basically the code that when you call a function,
like when you call at::add or you call a method on a tensor,
it figures out where you actually want to call it.
I’ve done a few talks about the dispatcher in the past,
and I also have a blog post talking about how the dispatcher works.
And today, I want to do something a little different.
So if you want to learn more about those aspects of the dispatcher,
I recommend you go check out those posts.
Instead, what I want to do is I want to do a little historical story
about how the dispatcher came to be and what the various constraints and features we needed
played out over time to make it into the system that it is today.
So to talk about the dispatcher, we first need to talk about the time before the dispatcher.
So before the dispatcher existed, and before A10 existed, PyTorch was built off of this library
called TH. And TH itself wasn’t written when PyTorch was written, instead it itself
came from a further back library called LuaTorch, which was basically the torch libraries like TH and THC
bound to the Lua programming language.
So when Adam Paschka and Coe wrote the first version of PyTorch, what they did was they just took
all of the old school TH and THC libraries and wrote bindings for them for Python.
And they also wrote an autograd system and data parallel support.
But binding these torch libraries, which previously could only be called from Lua to Python, was sort of the first step on the journey here.
So to understand how these bindings worked back in the day, it’s important to understand a little bit about how TH used to be constructed.
As you know is the case today, a tensor library involves a lot of different operations, and each of these operations needs to be implemented for every D-type you want to support.
So if you talk about an operation like add, it needs to be implemented for floats, and doubles, and integers, and 32-bit integers, and 8-bit integers, and so forth and so forth.
TH was written in C. And if you’ve ever written any C before, you may know that C doesn’t really have any facilities for actually parameterizing over different D-types.
So the way that they solved this problem was they were like, “Okay, we’re going to define a file.
We are not going to talk about a float or a double. We’re just going to talk about some abstract type.”
And then we will just include this file eight times with different settings of various macros to stamp out each version of the file.
So if you talk about a function like add, we would have a TH_float_tensor_add and a TH_double_tensor_add and so forth and so forth.
So there’d be like eight functions, and you know, at the Python binding level, what they did was they wrote some generated code, which basically was like, “Hey, you know, what’s the input tensor?”
“Oh, it’s a floating-point tensor. Okay, I’m going to call TH_float_add in this case.”
So it would just be the switch statement of all the various different dispatch types, and that’s how things were for a pretty long time.
And about the time I joined Facebook, we were sort of trying to figure out what to do about the internals of PyTorch.
And one of the things that was happening was that, you know, we had just bound the Torch library, and everything else was written in Python.
And it turned out that PyTorch was kind of slow. And Sam Gross did some measurements and found that, you know,
the reason why PyTorch was slow was because too much of it was written in Python. And so what we wanted to do was we wanted to port everything into C.
But not actually C, because writing this TH code with its, you know, macros being stamped out eight times was actually pretty horrible.
So what we actually wanted to do was write some C++. And during this time, Zachary DeVito came up with this idea.
“Oh, all we want is a simple tensor library that gives us a tensor type in C++, just like the tensor type you would have in Python,
with all the stuff you want. And then it’ll be easy to port all this stuff from Python to C++,
because we’ll just use this tensor type and write the stuff we want in this case.”
So Zach sort of, it’s really funny, like the way A10 got written was I think Zach locked himself in a room
for two weeks. And at the end of two weeks, A10 was created. And Zach went through a bunch of different
designs. He actually, I remember we were chatting about this and he was like, you know, I’ve gotten to this
point and I don’t know if I should implement multiple dispatch or not. And we like talked about some of the
pros and cons. And in the end, he didn’t decide to do that. And so what Zach did was in order to figure out
which implementation of a particular D type you wanted to go to, instead of having one of these
if statements, we were going to have a virtual object because this is C++ and C++ is all about objects
and it’s all about virtual methods. So the concept was every tensor had a type object associated with it.
By the way, the term type still shows up in various parts of the code base, even though these type objects
no longer exist. But what the type object was, was it had virtual methods for every single operation
you could imagine doing on a tensor. Adding, subtracting, sigmoid, whatever, you name it, it was there.
And so every tensor would have a pointer to a type object that implemented all of the things you wanted
for the object in question. And so to actually call an add on a tensor, you would instead go and the
implementation of the method on the tensor object would instead go call the add on the type object
attached to the tensor. And that would do a virtual call to actually get to the real implementation in question.
Why did Zach do it this way? Because, you know, if you have done any object oriented programming,
a really normal way to design an object hierarchy in a situation is, oh, I got a tensor super class,
and I’m going to inherit a float tensor from it and an int tensor from it, and so forth and so forth.
So there are a few reasons for this. So one is that Zach really wanted tensor to be what we think of as a pointer type.
So let’s think about in, in Python, if I have a tensor and I say y equals, I have a tensor named x,
and they say y equals x, then I want y to actually refer to the same memory and the same tensor really
as x, right? I don’t like make a copy in this situation. We don’t pass objects by value in Python,
they get passed by reference. Although some PL people would take offense to me calling it that way.
But in any case, you know, assignment and passing things to parameters, they preserve object identity.
You don’t create new versions of the object every time you do that. In C++, you have to actually,
you know, say what you want your object to do. So if you just define a tensor class with a bunch of fields
for sizes and strides and so forth, then if you pass that class by value to somewhere else,
you will in fact copy all those fields when you get there. And that’s not at all what the Python
semantics are. So tensor has to be some sort of class, which doesn’t do this. And so we need to not,
we wanted tensor to actually work like the Python semantics. And so you can’t actually just subclass
from tensor directly, because that just doesn’t work at all. Like, that’s not how C++ classes work.
So another reason why Zach wanted a virtual dispatch rather than an if statement was because of the fact
that CUDA support was this like sort of separate thing that was optional, you didn’t have to, you know,
have a version of PyTorch with CUDA, you could instead link against the dynamic library that provided CUDA
support. And then that would actually let you, you know, get all the CUDA functionality, but you could
also not link against that library, and you’d only get the CPU support. So you had these libraries living in
two different dynamic libraries. And if you’ve ever tried to write some code with multiple libraries,
you might know that you can’t actually call a function in another library, unless you depend
on that library. And the way things were set up is the CUDA library depended on the CPU library,
but not vice versa. So if you’re in some CPU code, and you call this function, and actually the tensor
turns out to be a CUDA tensor, you need to figure out how to actually get to the CUDA library. And the only
way you really can do that is via via a virtual call. The types provide the virtual call, they worked
pretty well, it was pretty fast. And we were happy for a while, until the next thing came along. So the
next thing that came along was that, you know, we had this pretty cool A10 concept, there are all these
operators, they all lived on the type object, and someone came up to us and they were like, “Hey, I want to
define my own operator on top of the tensor class.” And I’d like, you know, like, I’d like to define
tons and tons of custom operators, actually, because I’m Facebook, and I’ve got, you know, various very
specialized use cases that I don’t have a general purpose operator for, but I still want to implement.
And this type class, right, with all these virtual methods on it, there’s a problem. You can’t retroactively
add more virtual methods to a class. Okay, sure, you can inherit from the class, but you can’t actually,
um, but like, you have to, like, inherit each time you do it and make sure you inherit from the thing you
inherited from previously. And this clearly is untenable if you’ve got, you know, 20 different
people saying, “Hey, I want to add my own extra operator in this situation.” And it was actually
kind of important to make sure that people register directly inside the type object, because remember,
we also have this feature in PyTorch called autograd. And so actually, when you call a type object,
you’re not necessarily calling into the CPU recruiter implementation. In some situations,
you might call to the autograd implementation that has something different, and then eventually you’ll
call into the CPU type afterwards. So this need for open registration meant that it wasn’t really tenable
to keep using virtual tables. Virtual tables are a sort of marvel of C++ design, but one of the reasons
why they can be implemented the way they are implemented is because you’re not allowed to
add more methods onto them after the fact. And we wanted to be able to load up extra libraries,
add new methods to them, and do that. And this is when the dispatcher sort of in its modern incarnation
came into being, right? So the idea behind the dispatcher is, okay, we are not going to, um,
we’re not going to let C++ handle the V table layout for us. Instead, we’re going to re-implement
the V table ourselves. And furthermore, instead of having all of the virtual methods for all operations
laid out into a single table, in which case, like, it’s not at all clear, like, um, how to add more
things to the table, we’re just going to maintain separate tables per operator. So that when you call an
operator, you know, you call add, you’re like, okay, um, uh, I’m going to go look at the add dispatch table.
And, uh, it’s going to tell me how to go to CPU or CUDA because we want a lot of operators, open
registration of operators, but for different backends like CPU and CUDA, those get added way less frequently.
And, um, yeah, and that sort of brings the dispatcher into sort of, it’s, um, you know, a relatively modern form.
There’s some things we added after the fact. Um, for example, uh, we wanted the ability to do multiple
dispatch. So the, the request for multiple dispatch, um, came from a few places. So one case where sort of,
we’d always known this was a bit of a problem is, um, we have support for sparse tensors in PyTorch.
And so you have this interesting problem, which is that, uh, if you’ve got a dense tensor and a sparse
tensor and you add them together, you want to send this to the sparse kernel because the sparse kernel
is what is going to actually know how to deal with the sparse tensor. But in the initial implementation of
the, um, type objects that did dispatching, um, we always looked at the type of the first object
to figure out where to go. And since the first object’s a dense tensor, we go to the dense implementation
and then you have to do some extra tests to see if things are sparse and route them to the right,
right direction. Multiple dispatch would let you change the behavior of dispatch for, um, uh, to respect
the, um, arguments of multiple, uh, to, to respect the types of multiple tensor arguments. So if you had
a dense and a sparse, okay, actually that means I should go do something else, not just, you know,
blindly look at the first argument. And, uh, Zach and I were talking about, um, how to like implement
multiple dispatch quickly, um, during the fair offsite in Montreal. That’s like a few years ago.
And Zach was like, Hey, you know, here’s how you could do it, right? You could maintain a, a set of
keys, a bit set of keys representing all of the things represented by a tensor under some ordering
saying which one you wanted to go to. And then if you want it to do multiple dispatch, all you needed
to do was bitwise or all of these fields together, and then just pick out what the leftmost bit on the
resulting, uh, key was to like get the, you know, highest priority dispatch key you want to dispatch
in this case. You didn’t have to like do any like, okay, looping over the arguments, looking for the
right one. It’s just do this bitwise or extract out the first bit and, and you’re done. And this
basically served as the basis for the multiple dispatch implementation that is in PyTorch today,
where you have a bunch of dispatch keys. They have a priority and we always dispatch to the highest
priority key. These semantics came out because, you know, we had an idea about how to implement them
efficiently. Uh, similarly, um, the work on automatic boxing came out of this problem, which is that,
okay, uh, you know, we have all this, we have all these operators, we made operators extensible,
and then we suddenly had a problem, which is that we couldn’t easily write code that was generic over
all operators. Previously, the way we did this was we had a code generation phase, which, you know, knew
about all the operators in PyTorch and was able to just write, you know, specialized C++ code for each
one. But once we like open the gates up to let people register whatever operators they wanted,
there were all these operators leaving outside of our repository, which the code generation knew nothing
about and which, you know, we then couldn’t really generically program in any reasonable way.
And so if the code gen doesn’t know about it, well, C++ doesn’t know what about the
kernels in question. And so, uh, Sebastian Messmer, um, did this sort of years long project of sort of
making sure that all objects, uh, all functions, even if they were registered outside of the dispatcher
could via templating magic actually be generically programmed over. And so the, the technology of
back and fallback, which sort of only recently went to stable is based on this. So the dispatcher today
is pretty complicated. There’s a lot of features that it supports, but, um, you know, if you sort of look
through the history, you can see, you know, there were various design constraints that got us where
we were today. The design constraint of letting, you know, CPU and CUDA live in different dynamic
libraries, the design constraint of open registration, and even, you know, the design constraints of allowing
for multiple dispatch or automatic boxing. So these days, you know, the dispatcher has a lot of features.
You can do a lot of things with it. And it’s also a little slow. Unfortunately, we’ve tried to make
it faster, but it’s certainly a lot faster than if you were doing all of this in Python. And, um, I don’t
know, um, the next time you have some project and you’re wondering, oh, why is the dispatcher this way?
Just think about the constraints. It’s a really useful way to reason about things. Thank you all for listening. See you all later.History-and-constraints-of-the-dispatcher
EP3 Dynamic-library-structure
Hi, my name is Edward, and welcome to today’s episode of the PyTorch Dev Podcast.
Today, I want to talk a little bit about someone’s, or perhaps anyone who is a software architect’s
favorite subject, the library structure in PyTorch. Now, what do I mean by the library
structure in PyTorch? Isn’t PyTorch just one library that everyone uses? Well, that’s true
in one sense, in that, you know, we distribute a single PyTorch wheel that people use and think of
as one unit, but internally in our library, PyTorch is actually split into multiple separate
dynamic libraries, at least in open source, but this is also true inside our internal build system.
It’s split into multiple different libraries, you know, naming, ranging from C10, A10 Core, A10,
Torch, Torch Python, and, you know, each of these libraries is, you know, a proper unit of encapsulation
and means that you can’t, for example, willy-nilly depend on something from Torch
Python from C10. If you’re not very familiar with, you know, what people are using these
libraries for, you might think that this is just a whole waste of time, right? Like, you
try to write some code, you put it in some folder, and then you have to decide which folder you’re
going to put it in, and then it turns out you put it in the wrong folder, and you’ve got to, like,
move some stuff around to make everything work out. It can really feel like a waste of time for no good
reason. And some of the library structure in PyTorch is vestigial, and, you know, really shouldn’t be
there, and we should, you know, reconsider how it’s actually set up. But a lot of the libraries in PyTorch
exist for some good reasons, and in today’s podcast, I want to explain what the reasons behind the library
split in PyTorch are. And hopefully, that will help you also think about how to better structure your code
so that you don’t accidentally, you know, violate one of these abstraction boundaries.
So, principle one that I would say about dynamic library, you know, structuring in general, like just
how you decide to set up libraries, is that for any major dependency you might have,
it’s usually a good idea to give it a separate library. So a good example of this is CUDA. CUDA is a
really honking big dependency, right? Like, you’ve got to actually have NVIDIA’s CUDA runtime libraries,
and then there’s, you know, actually a whole bunch of code in PyTorch that only really makes sense when
you’re running on a system that has a GPU. We offer CPU-only builds of PyTorch, which don’t have any CUDA
bits for people who don’t have GPUs. And the point of this is that, you know, many people don’t want CUDA,
and so there should be a way to use PyTorch without having to actually drag in all a CUDA.
And if you had PyTorch as one single giant library in the situation, that wouldn’t work. You’d, you know,
have to always get in the CUDA dependency.
Well, you might say, hey, Edward, you know, isn’t the normal thing in open source to give you a bunch of
configure flags, and you just ask for which features you want? And the answer is yes, that’s true. Like,
if you’ve ever built Python from source, for example, there’s a whole bunch of flags you can toggle on and off.
But if you’re actually working in, say, a Linux distribution, or you’re working inside FB code,
typically, it’s frowned upon to recompile the same piece of software multiple times with different flag
settings. Because, well, you know, how are you going to distinguish between all these different versions?
So when you’re in a situation where you can only ever build some piece of code once, well, you had
better not, you know, you’d better find some other way besides if-deafing to split things out. And so
in PyTorch, we have a A10 CPU library that has all of our CPU kernels, and we have an A10 CUDA library
that contains all of our CUDA kernels. And so if you’re, say, in Buck, and you want to depend on a
library, but you don’t want any of the CUDA functionality, there is actually a dependency
you can depend on, the CPU-only dependency, that will prevent you from bringing in all your CUDA code.
So if you look at another really important library, TorchPython, this one is also split off from
LibTorch. And why is it split off? Well, because LibTorchPython has a dependency on the CPython API.
And there’s plenty of situations when you are, you know, doing a C++-only application,
and you don’t actually want to have the dependency on Python. So that’s principle one. Whenever there is a
major dependency, there is probably a library split lurking nearby.
Principle two is sort of related, but more of an internal concern, which is that you want to split
so that you can use what you need. So what do I mean by that? Well, in many situations, binary size is at a premium,
and you don’t want to actually ship code that you don’t actually use.
So, you know, honestly, principle one is sort of the extreme version of this where the, you know,
thing you’re not using is a giant, you know, honking blob of code that is from someone else.
But, you know, PyTorch is also big in and of itself. And we don’t want to necessarily use code in PyTorch,
if you know, we don’t need it, we don’t want to actually put things in, if you don’t need it.
And so similarly, parts of PyTorch are split in this way, so that we can actually distribute these things
without all of the functionality in question. So one good example of this in PyTorch is the split
between A10 core and A10. Although this split is a little historical, because mobile is deciding to
ship more and more stuff. In the beginning of the project, there was only a very limited subset of
functionality that needed to be shipped on mobile. And so when you know, we, when we wanted to actually
put PyTorch into production, we wanted to actually merge the Cafe2 and PyTorch code bases, we needed to
find a way to like, put in the code that we wanted on mobile in one place, and all the code that you
know, wasn’t relevant to mobile in some other place. And that’s why A10 is split into A10 core and A10.
A10 core is the stuff that’s relevant for mobile, and A10 is everything else that you know, you might
not be so interested in. I say the split is a little historical, because as time has gone on, and mobile
has gotten more and more features, it turns out that A10 does provide a bunch of stuff that mobile wants.
But in the beginning, it didn’t. And A10 core is this sort of minimal version that, you know,
is generally applicable and takes up less binary space than all of A10. Another good example of this
is the torch and A10 split. So A10 is short for a tensor library, and originally it was conceived of
as just a way to do PyTorch code. Like, you know, you want to do an add, okay, A10 will tell you how to add
two tensors together. Whereas torch is the lag, the library that actually gives you all of the sort of
neural network functionality. So it knows how to do automatic differentiation, it knows about NN modules,
all that good stuff. And so once again, if you’re in a situation where you don’t actually care about
doing AD, you don’t care about doing neural networks, you just need a way to do some tensor computations,
well, the split between A10 and torch means that you can just use A10 in that situation.
So that’s principle two, which is split on what you need, a more, you know, sort of internal version
of split on dependencies. And principle three is kind of a cop out, but it’s really important, which is
we split our libraries for technical reasons. That is to say, sometimes there is no way to actually ship
PyTorch unless we actually have things split in some particular way. Let me explain one particular
example. So a very, um, a sort of rite of passage for any new developer on PyTorch is writing a new
function and forgetting to slap a torch_api macro on it. You’ll get a very obscure linker error saying,
hey, you know, I have no idea what this symbol is, even though, you know, like it compiled fine and the
symbol is there, what the heck’s going on? So why does this macro exist in the first place? Well, this macro exists
because of something very interesting. So I have to take a brief detour to explain. So
when we write dynamic libraries, we have to specify what symbols we actually expose as opposed to private
symbols, which aren’t available to external users. That kind of makes sense. And if you’re writing a,
you know, standard Linux library, you usually just expose everything. Like you don’t really care about
a very much hygiene in this case. But on Windows, there’s actually a problem, which is the Windows DLL
format only allows for about 65,000 exported public symbols. Now 65,000 would be a lot of cookies to eat,
but as far as symbols go, it’s nothing. And any, you know, self-respecting project is going to quickly
hit this limit. So on Windows, because of this limitation, people tend to be a lot more careful
about what actual symbols they put in their libraries. So you have to actually say, you know,
what symbols you want. And if you, you know, if there’s a symbol that you don’t want, you just
don’t make it public. So on Windows, we have hidden visibility by default, and you must explicitly export
a symbol you want to. And guess what macro does that? Well, that’s the torch API macro.
Okay, that’s cool. But what does that mean? Well, remember, the symbol limit still applies. Just using
the torch API macro doesn’t mean that, you know, you’re not continuously adding more and more symbols.
And it turns out that the consolidated PyTorch A10 and A10 CUDA libraries goes over the Windows
symbol limit if you put them together. So no, we cannot ship PyTorch unless these libraries are separate,
so that we are under the public symbol limit. Another example of a technical reason requiring
us to actually keep the libraries split is for mobile. So mobile, mobile started off, you know,
just having a small dependency on A10. But eventually, they actually needed operators. But there’s a problem,
right, which is that A10 has a ton of operators, and mobile doesn’t really want most of them. Like,
there’s only a few operators that are actually used by models in practice, and they’d much rather prefer
to only ship those operators. So mobile has some very complicated system for recompiling PyTorch so
that, you know, only the operators they care about are compiled for any given library.
Okay, that’s cool. What do I recompile in this case? Well, library split comes to rescue.
Because we have all of our CPU kernels in a separate library, A10 CPU, that’s the only library that needs
to get recompiled on a per-app basis for mobile. A10 itself, which just contains, you know, common code
that’s used everywhere, doesn’t need to get recompiled in this situation. So, you know, having the library
split in this way made it easier for mobile to do selective build. And if you ever propose merging these
things together, well, you’d better have an answer for what you’re going to do on the mobile side.
So, what are the principles behind PyTorch’s library split? Well, you know, whenever there’s a major
dependency, that usually means there’s going to be a library split. We split because that lets us,
you know, let people use code that, you know, use what you need. You know, we don’t go to the,
you know, extreme with this because it’s very hard to deal with lots and lots of itty-bitty libraries.
But like for major partitions of functionality, there will be a library split usually in that
situation. And finally, there are a bunch of weird-ass technical reasons like, you know,
Windows and mobile that also require us to split things in this way.
Okay, so that’s why we have so many libraries in PyTorch. Some of the libraries probably can get
merged together, like A10 core and A10 probably can be merged together. C10 probably could be moved
into A10, except there’s this funny business with our AMD Rock’em support where hipification works
differently in one case or another. Yeah, it’s complicated. There’s a lot of things that sort of
have created over time. But, you know, usually if you’re running into a library problem,
the best fix is not to actually like rage against the library structure in PyTorch. It’s just to do
a few simple things to, you know, sort of unblock yourself. So what are those things? So one thing you
can always do is sometimes some code is put in the wrong place. And so you just need to put the code
in the right place, right? Just move a file around. Yeah, I know it’s annoying. You can always put a little
stub in the old location so that you don’t have to update all the includes. But, you know,
oftentimes just moving a file to the appropriate place because, you know, whoever put it there
originally didn’t think too hard about it. That often will solve a problem you have.
Of course, sometimes you do need to break layering, right? Like sometimes you need to be able to call
into some code in, say, Torch when you’re inside C10. And there no amount of moving files around will save
you. And so there’s another trick that’s, you know, sort of used very commonly in the code base, namely making a virtual
interface and that you can call into the, you know, higher level library layer from a lower level
library layer. So one really good example of this is device guard. Device guard works by having a device
guard interface for every implementation of the device guard. And so if you’re in a situation where you
don’t necessarily know if you have access, direct access to the library in question, you can use device guard
and it will do a virtual jump to the actual implementation, which might be CUDA to actually
get the functionality that you want. Of course, if you’re actually in the CUDA library, you don’t have
to do this virtual jump. And so there’s actually a specialized version of device guard called CUDA guard,
which lets you do exactly this when you don’t need to violate the layering. So that’s all I wanted to say
say about library structure today. Thanks for listening. See you next time.Dynamic-library-structure
EP4 Vectorization
Hello, and welcome to the PyTorch Dev Podcast.
My name is Edward, and today I want to talk about vectorization.
Vectorization is a very important component of any self-respecting deep learning,
or really any numeric computing library that lives on CPU.
But sometimes it has a bit of a reputation for being this very mysterious, very magical thing.
You know, numerical codes go into compiler, vectorized instructions come out,
and, you know, you’re not really meant to know how exactly the sausage is made.
Well, actually, you know, vectorization isn’t that magic,
and today I want to talk a little bit about how we make use of vector instructions in PyTorch,
on what vectorization is, and some of the sort of tips and pitfalls
associated with vectorization in the code base.
So what is vectorization?
Well, imagine that you’re doing some computation on your CPU.
Normally, the way a CPU works, and what you learned in your architecture class,
is you have a bunch of instructions.
You feed the instructions into the CPU, and the CPU goes ahead and does the things that you ask it to do.
So, for example, if you, you know, want to do an add, you tell the CPU,
hey, I want to add this number and this number together from these two registers,
and the CPU will go ahead and do that for that single instruction.
Now, as you might imagine, when we’re doing numeric computing,
we don’t have just one number.
We have a lot of numbers, and we want to do the same thing to all of these numbers.
And that’s where vector instructions come in.
Vector instructions are a form of what we call SIMD parallelism.
That’s SIMD, single instruction, multiple data, where instead of giving your CPU an instruction
to do an operation on a single piece of data, you can give your CPU an instruction to work on multiple pieces of data.
That’s why it’s called vectorization, because you’re working on a vector of numbers rather than one number.
So, when you want to write some vectorized code, you have a bunch of these vector registers,
which are larger registers than you’d normally be able to use to do various computations,
the idea being you, like, fit in multiple numbers into these registers,
and then you have a whole pile of new instructions to do things like add,
but not just add one number, but add all of the numbers in your vector registers.
And the vector instructions are actually pretty simple,
And so if you wanted to, you know, go and learn how to, like, you know, write some vectorized code by hand,
all you’d have to do is really pull up the Intel manual or, you know, whatever, you know, manual for whatever processor you wanted to do,
and, like, just look and find which instructions you wanted to do.
Or you could use a library like Sleaf, which already provides pre-vectorized instructions for you.
Or you could even just, you know, write some code and hope that your compiler’s auto-vectorizer handles it for you.
You just, you need to pass the flag, like MAVX, and it will try its best to vectorize your code for you.
So on Intel CPUs, which are the CPUs that most people are using,
the vector instructions are called AVX, stand for Advanced Vector Extension.
And there’s a bunch of different versions of AVX, basically, because over the years, Intel was like,
Ah, you know, we only really want to do vector operations on two pieces of data.
So here, have an extension that does that.
Actually, that was called SSE.
And then over time, they gave more instructions, bigger vector registers, and more and more features.
And so as time went on, you know, they released AVX, then AVX2, then AVX512.
And so just, you know, over time, there’s more and more functionality.
But remember, and this is going to be very important later in this podcast, that, you know,
you need a CPU that actually has the silicon for doing whatever it is you want to do.
So if you’ve got, like, a, you know, CPU from, like, 2015, chances are it doesn’t actually have AVX512, it only has AVX2.
You can actually find out what vector extensions are supported by your CPU on Linux by catting out the contents of PROC CPU info.
That’s a magic file that the Linux kernel provides that tells you all about your CPUs,
and tell you the model, and it’ll also tell you all the extensions that it supports.
And then you can look and see, you know, which AVX is on there.
Okay, so AVX is a bunch of vector instructions.
I’m not really here to teach you, like, how to write AVX code.
I actually have no idea how to write AVX code by hand.
Instead, in PyTorch, we have a bunch of abstractions to make it easy for us to manually vectorize our code.
Because often, we don’t really trust the compiler to do a good job in vectorization.
So, we just want to, you know, actually tell, hey, here are the exact instructions I want you to use, so that there’s no possibility for the compiler to mess it up.
And then the set of header files which help us do vectorization in PyTorch are called the VEC, aptly named VEC headers.
And so, currently in PyTorch, we don’t have support for AVX 512.
We just have support for AVX 2, a.k.a. AVX 256, so-called because the registers are 256 bits wide.
And so we have a class called VEC 256, which just represents a bunch of vector data stored in the vector registers and then has a bunch of operations like add, sub, you know, sign, and so forth for doing vector operations on this vector piece of data.
So, if you want to write some vectorized code, chances are, you know, you might just be able to, like, get VEC 256 and then get your data into VEC 256.
And we actually have a bunch of wrapper functions like CPU kernel, which help, you know, handle all the fiddly, you know, edge conditions.
Because remember, vector instructions always work on, you know, four pieces of data.
So, what if you’ve got seven pieces of data?
Well, you have to do the vectorized instruction on the four, but then you need a manual loop to finish the last three.
So, you, like, get your vectorized thing and then you just tell, you know, exactly what vector instructions you want to do by just calling these methods on VEC 256.
And if you want to, like, actually implement some new and interesting functionality using the raw intrinsics, the intrinsics being various special functions your compiler provides that lets you just directly call various vector instructions, you can do that, too.
And typically, you just go into the VEC 256 class and you write in exactly what instructions you want it to use in this situation.
So, it’s a pretty fun exercise to, you know, add vectorization support for something.
And, like, if you’re sort of in the mood for just, like, you know, cracking open the Intel manual and, like, reading some papers and trying to figure out how to vectorize something, you know, a pretty fun task is, you know, hey, I need to do something fast.
And right now, we only have these crappy, you know, single instruction implementation for it in PyTorch.
Maybe I can vectorize it.
Some things are easy to do, like, if you’re just doing some point-wise operation, you just need to figure out the right sequence of vector instructions to get the computation you want to do.
Some things are harder to do.
I remember a U-man wizard way back in the day actually implemented a vectorized sort for PyTorch.
We never merged it because it was too complicated.
But, you know, like, that’s the kind of thing, like, there’s a ton of things you can accelerate using vector instructions.
And actually, they will run a lot faster on CPU if you do that.
So it’s often worth doing it this way.
So that’s it for what is vectorization and how people do vectorization in PyTorch.
And that’s nearly it.
But I want to tell you a little bit more about some of the weird things that we do in the codebase to actually make this all tick.
So remember this thing that I said, right?
I said that not all CPUs support all vector instructions.
Depending on if your CPU is from 2010 or 2015 or from 2020, you know, you’re going to have different support for vectorized instructions.
And no one really wants to, you know, try to run their PyTorch program and get a SIG illegal instruction because, you know, you tried to feed the CPU some instruction it didn’t understand.
And this is actually a bit of a problem for us because when you compile your code, that’s when the compiler makes the decision to make use the various vector instructions that it has available.
But the compiler doesn’t know where you’re actually going to run the code later.
It’s not like, you know, you’re compiling some code and you’re trying to test if, you know, you have LibXML on your system.
And if you do have it, then you compile the support for LibXML.
Otherwise, you don’t compile with support for it.
It’s not like that because you actually have no idea where your end user is going to run your code.
And so, you know, you have no idea what vector instructions are going to be available.
And so, you know, if you don’t do anything special, you really can only ship your software for the lowest common denominator of CPU you want to support.
And typically that’s just, you know, no vector instructions at all because, you know, old CPUs have been around for a really long time.
So the way we work around this problem is, you know, we just say, OK, fine, some CPUs support vector instructions, some don’t.
So let’s just compile our instructions multiple times for each level of CPU support we want to support.
And then just, you know, query the CPU processor at runtime and use that to pick the particular compiled version of our code that actually does the vector instructions.
So we have a system that does this.
It’s called Dispatch Stub.
Dispatch Stub sounds very complicated.
And in fact, you can also use it to dispatch to CUDA versus CPU.
But really, it has one goal in life.
And its goal in life is to let you get to the appropriately vectorized version of your code depending on what CPU capability you have.
So there’s a bunch of macros and if you like sort of cargo called the code, you can, you know, usually figure out how to make this work.
But the basic concept is in the native slash CPU folder, any file you put in there will get compiled multiple times, once per vectorization level that PyTorch supports.
And then each of these compilation units will register its kernel to Dispatch Stub saying, hey, I’m the AVX 256 version.
Hey, I’m the AVX version.
And hey, I’m the non-vectorized version.
And then Dispatch Stub will just, you know, query what CPU capabilities you have and then dispatch to the correct one.
And there’s a bunch of sort of magic that has to happen to actually make this all work out.
For example, when you actually compile this code multiple times, you have to be really, really careful not to accidentally compile any other code that you don’t actually want.
And this is important because when you compile C++, normally you would imagine you just compile the functions that you define in your C++ file.
But that’s not entirely true.
When you do, for example, template specializations, C++ will blat out another bunch of code and then sort of rely on the linker to duplicate this code later.
And so if you happen to blat out some code that in fact uses some vector instructions and then that copy of the code overrides the regular version of the code that you compiled with no vector instructions.
Because remember, we don’t want to assume that everyone supports vector instructions.
Then you can end up with normal code like vector resize using AVX2 instructions and then your binary packagers will be very unhappy because they’ll like package the binaries.
It’ll work all fine because all of our test machines are AVX2 and then like some user is going to report to us that, hey, when I import Torch, I get a SIG illegal instruction.
What’s up with that?
Actually, we do have a test for this now in CI, so you don’t have to worry about silently breaking this.
There’s two more things I want to say.
One is that if you want to, you know, sort of, if you’ve got a very featureful CPU, you can actually manually change what vector instruction you want to do.
There’s an environment variable that lets you do this.
It escapes my mind at the moment, but you can look it up.
It’s got capability in its name, in all caps, and that you can just use it to, you know, switch between versions.
And it’s actually a pretty nice way to see like how much extra benefit you’re getting at a higher level of vectorization.
One last thing.
So very, very recently, okay, not that recently at this point, but fairly recently, Intel’s released support for the new AVX512 extension.
And so we’ve sort of been using it on and off, but we actually don’t support it in the library proper.
And the reason we don’t support it is because of this funny thing that happens to Intel CPUs when you start running AVX512 instructions.
They downclock.
Somehow, for some reason, when they design the CPUs, they like put too much silicon on it.
And if you like actually use the AVX512 silicon, it overheats the chips, so they can’t actually use all of the chips at this point in time.
So they downclock the processor to make sure their heat output isn’t too big.
And that means that if you are switching in and out of AVX512 instructions and regular instructions, the downclocking will actually kill your overall performance.
So we’ve kind of been, like, kind of loathed to actually add support for AVX512.
But there’s some very enthusiastic open source contributors who have been trying to add support for this at the framework level.
So go with them.
They’re working on it.
If you’re really interested, check out their PR, which I’m going to post as a link in the rest of this podcast.
So that’s everything I have to say about vectorization.
Vectorization.
It’s not magic.
Well, okay, when we recompile your code multiple times, that’s maybe a little magic.
Hopefully this explains some reasons why you have to put some code in CPU, some code in not in CPU.
Some of it is vectorized, some of it isn’t.
And hopefully it also tells you why you can’t just, you know, use random templates inside the CPU folder because of symbol problems.
So that’s all for today.
See you all next time.Vectorization
EP5 Inference-mode
Hi, my name is Edward, and welcome to today’s edition of the PyTorch Dev Podcast.
Today, I want to talk about a new feature that recently landed in head in Master PyTorch
called Inference Mode that was spearheaded by Ailing Zhang, but also had a lot of contributions
from the rest of the folks in composability.
What is Inference Mode?
Well, Inference Mode is a thing that you can do when you are writing some PyTorch code
and you are guaranteed that you’re only going to run inference on it.
And Inference Mode basically makes your code run faster in this situation.
It’s fast enough to get something like 5% to 10% wins when we have used it inside production at Facebook.
And today, I just want to talk a little bit about where this feature comes from,
why it’s necessary, and a little bit about how we implemented it.
Okay, so first off, why does Inference Mode exist in the first place?
And, you know, you might be thinking, hey, Edward, you know, if I just have some code in PyTorch
and I don’t, you know, require grad on any of my inputs, so there’s no parameters, I’m not training,
I don’t call backwards on it, shouldn’t this code, you know, just be as good as, you know,
running some plain old tensor operations without, you know, having any separate for autograd.
Like, that seems like it should be just as fast.
And, you know, if I’m a little worried about accidentally setting some requires grad equals true,
well, there’s this no grad mode, this no grad context manager, which I can already use in PyTorch
to just say, hey, whatever the requires grad fields on my tensors are, ignore that and just don’t require gradients.
So why is there an opportunity to make things go faster?
And so it turns out that there are two things that we do in PyTorch
to support automatic differentiation that can’t be turned off.
They must be done because it may be possible at some point in the future
that you will attempt to use these tensors for AD,
and if we don’t do these things ahead of time, we’re just screwed.
Whether or not this is the right trade-off or not,
this is historically where PyTorch has been,
where, you know, you can always write your code
and then try to use it with autograd later, and this will work out.
And so inference mode changes some of these assumptions.
It says, hey, no, actually, I guarantee that I’m not going to use these tensors
to do autograd later, and as a result, we can do things a little faster.
So there are two things that slow, like, ostensibly inference-only mode code down
in PyTorch that inference mode targets.
So the first thing that happens is whenever you do any sort of mutation
to a tensor in PyTorch, and really, whenever you, like,
just allocate any tensor at all, we have some safety tracking for mutation
called a version counter.
So what is a version counter in PyTorch?
Well, a version counter solves the problem that is pretty common,
which is, let’s say you have a tensor, and you need to save its value for later.
Well, tensors are large, and so we don’t want to make copies of them.
So we just save that tensor directly.
What if someone, along the time when you saved it for, say, backwards,
that’s the most common case version counters are used for,
and when you actually use it, when you do the backwards commutation,
someone goes ahead and modifies the tensor under you?
Well, that’s great.
It turns out all of your, you know, automatic differentiation isn’t going to work.
You’re just going to get wrong gradients in this situation because someone, you know,
monkeyed about this value, and you were expecting the old value prior to the mutation
to be the one that you were going to use for your backwards formula.
So because this can, you know, basically result in silently incorrect results,
like you have no idea that things have gone wrong, but things have gone wrong,
we have a mechanism called version counters, which help us detect when mutations have happened.
The mechanism is pretty simple.
Basically, we associate every tensor with a version.
When you mutate the tensor, we update the version.
And whenever we save a tensor for backwards, we look at what the current version was and say,
okay, whatever this version is, when we look at it again later in the backwards,
you have to, you know, have the same version that you had when you saved it.
So if there was a different version, we want to just raise an error and say,
hey, someone mutated this saved tensor for backwards.
Uh-oh.
All right.
So that means that we have to do a bunch of, you know, work, right?
So we have to allocate these version counters.
We can’t actually store them directly on the tensor because remember,
mutating a tensor or mutating a view of a tensor,
hey, these, you know, are the same thing.
So we need to make sure you get updated in both of these cases.
So it’s not something you can store on the tensor directly.
And it also isn’t something you can store in the storage,
if you know what that is, um, for very complicated reasons involving detach.
So these are actually like separate heap allocated counters that we keep around
and you have to allocate them.
And you also have to do the reference count bumps on them.
And these, these version counter bumps, sorry, not reference count bumps, version counter bumps.
And we have to do these bumps atomically because there might be a mutation from separate threads.
So that also leads to cost, right?
It leads to having to do all these extra operations.
So can we get rid of this when there’s no requires grad true anywhere in your program?
The answer is no, because you don’t know if in the future,
someone is going to use this tensor to actually save it for backwards,
because it’s going to be used with some other requires grad true thing.
So we need to know ahead of time that, you know, hey, this is going to be a tensor
that is never, ever going to alias with a tensor that is going to be saved for backwards.
The second thing that we have to do ahead of time is something called view tracking.
So what is view tracking?
Well, let’s just think about how views work in PyTorch.
So if you’ve read my blog post about, you know, basic concepts in PyTorch,
you may know that PyTorch tensors are strided.
And so if I want to take a view on a tensor,
I can just, you know, allocate another tensor, share the data and just, you know, record, you know,
what the offset should be and, you know, whether or not I’m going to like, you know,
inflate my strides or anything like that.
And this is pretty cool.
And ordinarily, you would think that when I do a view on an operation,
that’s the only thing I need to do.
Well, unfortunately, in the presence of automatic differentiation, that’s not enough.
And the case that causes problems is what if you take a view from a tensor
and then you mutate the view with another tensor that requires gradients.
Let me say that again, because it’s a little bit of a complicated example.
You have a tensor, take a view of it.
You mutate the view with a requires grad true tensor.
So something very interesting happens in this situation,
which is that if you then go back to the base tensor and you use it as part of some computation,
that base tensor now requires grad equals true.
The requires grad trueness of the, you know, input mutation on the view infects the base tensor.
And if you think about why this might be the case, it makes sense because,
hey, you know, I have this thing and I need to keep track of all uses of it because, you know,
I want to differentiate on it.
And, you know, if I mutate it into the view, it is going to like implicitly show up in the base.
And so if I make uses of the base that end up contributing to my loss,
well, those also count as, you know, uses that I have to, you know, count towards, you know,
when I do automatic differentiation in this case.
And so just recording, you know, the storage and the strides and the offset in the tensor when we do views isn’t enough.
We actually need to record some extra view metadata so that we can make this situation work.
So I’ve covered the two situations where we need to do this extra work.
So one is in-place updates to do version counter bumps.
And the second is view metadata tracking.
And if you were thinking back to the original motivation for inference mode,
well, hey, you know, these are very obscure situations.
And if I’m just running inference on my tensors, you know,
I don’t expect any of these things to actually matter.
So inference mode is the way for the user to tell,
hey, I am going to guarantee you that I am not going to do any of these naughty things.
And then I can just skip doing version counters.
So I just won’t allocate the version counters at all.
I won’t do version counter bumps on my tensors.
And I’m just not going to do any of the view metadata tracking.
I’m just going to, you know, leave it all alone.
And then, you know, my code will run faster as long as I’m not using it for AD.
So that doesn’t sound too hard, right?
Just put in a bunch of if statements and, you know, or, you know, like, because we’ve talked about the dispatcher, right?
Oh, do some fancy dispatcher stuff.
Just make these things not get run in those cases.
But there’s a problem.
The problem is we don’t actually want to have our users pinky promise us that they’re going to handle everything correctly.
Because we don’t actually trust our users to do things correctly.
You shouldn’t either.
I wouldn’t trust myself to get these things right.
I’m worried that I’m going to accidentally use one of these tensors in Autograd later and everything’s going to blow up and, like, I’m going to be sad.
So the sort of magic sauce and what sort of took us a long time to sort of get inference mode working was how do we do this safely?
Let us say, how can we let the user say, I promise not to use these things for Autograd and then actually hold the user to this promise so that if they actually do use it for inference modes later, if they use an inference mode tensor in automatic information, we actually give a proper error message in this case.
And so I’m just going to describe a little bit about how we do this.
And, you know, if you want to actually see the details, we’ve got a very nice RFC co-authored by Eiling and me.
And you can read that for all the sort of nitty-gritty details of how everything works.
But there’s two basic things that we need to do.
So the first thing that we need to do is we want to get rid of version counters, right?
We want to get rid of the need to track when mutations happen.
And so in order to verify that, you know, this never actually causes problems for automatic differentiation, we need to enforce some sort of invariant that says, oh, yeah, you know, one of these tensors that doesn’t record version counters, you’re not allowed to ever actually try to use the version counter to enforce safety.
Because that’s a place where the system could go wrong.
So in other words, we have a no aliasing requirement.
The no aliasing requirement says that any tensor that doesn’t have the version counter, and we’re actually going to just refer to these as inference tensors, because they’re just tensors that happen when you do inference mode, right?
You just don’t allocate version counters for them.
Any inference tensor must not alias with any tensor that is saved for backwards.
So how do we actually do this?
Well, you know, we take an inference tensor, we say, okay, there’s no version counter on it.
Whenever we make aliases to this tensor, we also need to make sure these are also inference tensors, because, you know, hey, it’s an aliasing requirement, right?
Like, you know, just because you take a view of a tensor doesn’t mean you can save that, because if you mutate that, well, you know, it still affects the view of the tensor.
And then we just say, okay, any inference tensor is not allowed to be saved for backwards.
And so there’s one place we have to write this check, which is namely when we save variables for tensors.
So the no aliasing invariant involves basically setting up this dynamic alias analysis that just says, hey, this is a class of tensors, these inference tensors, which are guaranteed not to alias with AD.
And we only have to check one place to make sure this actually happens.
And so that’s very nice and not too hard to implement.
Second part is view tracking, right?
So what do we do if we, you know, don’t track the view metadata in a situation?
And this one’s actually not so hard.
We basically just say, okay, we don’t record the view metadata for these tensors.
And now we need to, sorry, I said this one’s not so hard, but this one’s also tricky in its own way.
So naively, what you’d expect you to be able to do in this situation, you say, okay, I’m just not going to record the view metadata.
And then if I ever do something to a tensor that, you know, might require the view metadata, I just raise an error.
Does that work?
Almost, but there’s one problem.
And the problem is, if you have a base tensor, and you mutate it with something that requires grad equals true, ordinarily, your views also become requires grad equals true, right?
The flow goes both ways, right?
Like, if I put in some data that I need to track gradients for, then all the views also need to track gradients as well.
And in the case of the base tensor, I don’t actually know if I’ve recorded the view metadata or not in the situation.
So what we do is we just say, okay, well, these inference tensor things, you know, the tensors that were allocated in inference mode, you’re not allowed to mutate them outside of inference mode.
And that just sort of, you know, with a very heavy hammer prevents this sort of situation from causing a problem.
So that’s what inference mode does in a nutshell.
It says, okay, when you’re inside inference mode, you know, we allocate these inference tensors, these inference tensors do less work.
They don’t track versions, and they don’t track view metadata.
And once you have this situation, you just have a bunch of extra checks, a bunch of, like, sort of restrictions on how you can use these tensors outside of inference mode that sort of guarantees that you can’t actually observe that you fail to record all this information.
You’ll just error in those cases.
So we’ve been deploying this to a bunch of places.
There’s this old RAII guard called auto non-variable type mode.
It didn’t make any sense.
It just happened to make people’s code run faster, but it didn’t do any error checking.
And we’ve been moving people over to use inference mode in this situation.
Actually, that’s all Eiling stuff.
She’s been very like a trooper moving all of our mobile stuff over.
It’s been quite an adventure because there’s a ton of places that only do inference.
Like, ever try to debug a PyTorch problem on Oculus?
Yeah, me neither.
Good work.
So that’s everything I had to say about inference mode today.
Right now, it’s only available from C++, but we’ll be adding a Python API for it very soon.
So that’s all I wanted to say for today.
See you next time.Inference-mode
EP6 Just-enough-CUDA-to-be-dangerous
Hi, my name is Edward, and welcome to today’s edition of the PyTorch Developer Podcast.
Today, I want to do a very whirlwind intro to CUDA programming.
Now, disclaimer, I am by no means a CUDA programming expert.
I’ve ridden a CUDA kernel or two in my time, but most of the time I defer to such experts
as, say, Natalia Gimelschein to actually do the heavy CUDA lifting.
But having worked on PyTorch a while, I have picked up a thing or two about CUDA.
And so today’s episode, I just want to, like, talk about really, really, really fast, you
know, here’s just a big pile of stuff that is important to know about CUDA programming,
about programming GPUs in general, just enough so that you can be a little dangerous, even
if you’re not, like, actually writing CUDA kernels.
Because it’s really helpful to know a little bit about the programming model, what happens
on GPU, because, well, you know, PyTorch is a GPU-accelerated deep learning library.
And so if you add some functionality to PyTorch, we expect you to be able to also run it with
GPU acceleration.
All right.
So where to get started?
Part one.
What is CUDA?
So to answer what is CUDA, we actually have to answer a different question first, which is,
what is a GPU?
So the GPU is a piece of hardware in your computer that sort of made the deep learning
revolution possible.
And its name is short for Graphics Processing Unit, because historically, that’s what we actually
use them for.
We use them to actually, you know, render graphics scenes on your computers if you’re playing a
video game or, you know, doing some sort of photo or video application.
And it just turns out that the types of things GPUs are good at doing are also good at doing
deep learning models.
Why is that the case?
Well, the way a GPU works is that instead of having, so remember when we talked about
vectorization and I said a CPU, you feed it a bunch of instructions and it does the instructions
one by one.
And, you know, basically that’s it.
And you can, like, put a bunch of cores in your CPU.
And if you have a really beefy machine, maybe you have 32 or 64 cores.
But, you know, there’s only so many cores you actually put in your CPU.
And that’s basically it.
Like, that’s the level of parallelism you’re going to get.
You have to, you know, spawn threads and, you know, use them to actually make your CPUs
go.
Well, a GPU has tons and tons of really, really simple cores.
And the way they operate is they just say, okay, well, I’m going to run the same computation
on every core so I don’t have to worry about, you know, all the cores doing different things.
And I’m just going to have so many cores doing the same operations that if I have a ton of data,
like, say, in a image or in a deep learning tensor, because I have such massive parallelism,
I can actually just, you know, do things very quickly.
Because even if there’s a million things to do, well, you know, I have a lot of cores, and so
they can make quick work of it.
So the basic idea behind GPUs is instead of having, you know, these big, beefy CPUs, but
not that many of them, we have these many, many cores, and we massively parallelize our algorithms.
And that’s how we’re actually going to do things really fast.
And so CUDA is the programming language slash compiler stack slash software ecosystem that
NVIDIA developed for programming their GPUs for, you know, sort of general purpose programming.
Because back in the day, like, when you had a GPU, you used it to do graphics processing.
So you’d write shaders, you’d write, you know, those sorts of things.
And no one was really thinking about doing, you know, actual mathematical general purpose
computation, except for, you know, a weird branch of researchers who are looking into so-called
GP GPUs, general purpose GPUs.
And they would, like, go through lots of tricks to try to, you know, you know, get the shader
to do exactly just the thing that they wanted them to do for whatever computation they want
to do.
And NVIDIA built this software stack called CUDA, and so we can use CUDA to do general
purpose programming on GPUs.
And in PyTorch, what we do this for is so that we can do deep learning neural network computations
on them.
So what is the CUDA programming model?
So, you know, the GPU is not your CPU, right?
Like, on the CPU, if you want to do some stuff, you just send some instructions to the processor,
and, you know, it just does the stuff.
You don’t have to think about it, right?
That’s normal operating.
But your GPU is typically living on a separate, you know, device in your CPU, and, like, it’s
got its own memory, and it’s, you know, not, like, anything at all like your CPU.
So there’s actually a little bit of a difference when you want to program a GPU in this situation.
And so the sort of very, very short version of, like, what you should think of as a CUDA
programming model is there is CUDA memory.
That memory is memory that lives on the GPU.
That’s the memory that, you know, programs on the GPU can actually run.
If you want to compute some data on the GPU, you have to first move it to the GPU so that
it’s accessible.
Then you can write various kernels, and these kernels are, you know, sort of written because
CUDA is a programming language built on top of C++.
They’re written in C++, but they’re different than normal C++ because, you know, unlike a
regular CPU where, you know, you have a single processor and you just feed it instructions,
these programs need to work as, uh, this, uh, running on, you know, all the little itty-bitty
processors that are on your GPU.
And so these special kernels, you, you know, you want to go write a specific, them in a subset
of C++ that, you know, your CUDA compiler actually understands.
And in general, like when you write a program or like in PyTorch, there are going to be like,
you know, dozens or really hundreds of CUDA kernels each for some particular task that you
want to do. I’m not going to talk about how you actually write a CUDA kernel today, but say you
have a bunch of these kernels. What you need to do is after you put the data on the GPU, you need to
ask the CUDA driver, Hey, can you please run this kernel? And the CUDA kernel will, CUDA driver will
go ahead and say, okay, um, I’m going to go tell that GPU, the actual device to go ahead and run this
computation to do the thing that I want it to do. And here is sort of one of the most important things
about the CUDA programming model. The, the, the most important thing, if you’ve never written
CUDA before, and there’s one takeaway I want you to get from this podcast, it is this process is
asynchronous. I’ll repeat it again. This process is asynchronous. So you tell the driver, Hey,
please do this computation. The driver’s like, okay, I’m going to go do this computation. And
the kernel call you made is going to immediately return, even though the GPU is off, you know,
chundering away on the data that you asked it to process. This is a good thing because it means that
your CPU host program that’s responsible for figuring out what kernel calls to do can run ahead while the
GPU computation is happening and figure out what the next thing you want it to do is. And so you can say,
Hey, after you’re done doing this previous computation, please do this next computation.
And you can cue it ahead and the GPU can be ready to go right when the previous computation finishes.
By the way, how does it know that it wants that kernel to run after the previous kernel you ran? Because
if it’s asynchronous, couldn’t these just like run in any order? Couldn’t it just start running it
when you ask it for it? Well, there’s this thing called streams. Streams imply sequential execution.
So you put CUDA kernels on streams, and every kernel on a stream is guaranteed to finish before the next
kernel on that stream happens. Normally, when people just write GPU-accelled programs, there’s just one
stream. It’s the default stream. Everything goes on that. Everything is sequentialized. But if you’re
doing like fancy tricks, you might have multiple streams. And one of the things PyTorch needs to do
is although most people don’t use streams, we do want it to be possible to use streams with our
software. So we have to write all of our code in a stream generic way. One last thing that’s useful
to know about the CUDA programming model is it has a notion of a current device. So you know, when you do
a kernel launch, well, you might have multiple GPUs in your machine, right? And each of these GPUs has its own
memory. And so you can’t just say, oh, well, you know, GPU two, please operate on the memory and GPU
zero. Technically, this will work if you have, you know, device to device transfer, but it’ll be kind
of slow, right? So most of the time, we don’t allow it. And you know, you have to actually make sure the
memory is in the same place. And so the current CUDA device, which is a CUDA concept, is something
that you have to say, okay, I am now setting my current device to be GPU two, so that all my kernels
actually operate on GPU two, because the kernels don’t actually take in what device they want to run
on explicitly. PyTorch also has a notion of a current stream. This is not a CUDA concept. This is something
that PyTorch built on top of CUDA. And this is so that we don’t have to also constantly say which stream
we want to run on. CUDA kernels explicitly take which stream you want, or zero for the default
stream. Okay, so that’s the basics of the CUDA programming model. So what are the implications
of this model when we are doing PyTorch programming? So remember, I said the most important thing
about CUDA programming is it is asynchronous. So what happens if something bad happens in your CUDA
kernel? Because bad things can happen in your CUDA kernel. They’re basically C++, right? You can do an
out-of-bounds pointer dereference. You can have an assert failure. You can, you know, trigger a
compiler bug. Lots of things can go wrong, right? So what happens when something goes wrong? Well,
first off, when you launch the kernel that actually is going to do something bad, it’s not going to raise
an error, right? It’s just going to return and say, hey, everything’s okay. But that’s not actually
necessarily the case, right? Because at some later point in time, when the drivers finally got in ahead
to getting to figure out, hey, you know, there’s something wrong because I’ve just run this kernel and get it,
you might be somewhere way else later in your CPU host side program, at which point the CUDA driver,
you’ll be doing some random call into the CUDA API, like trying to malk something or like trying to
launch a different kernel and say, oh, no, no, no, no, no. Something bad has happened. An internal assert
failed. I don’t know. And well, crap, because, you know, you’ve got this code and it has nothing to do
with the error that just got raised because the error was actually caused by some kernel launch,
you know, miles and miles away in your code. So this is like the most, like, you know,
anyone who like just sort of like signs up for PyTorch and doesn’t know any CUDA and like has to
debug a GPU problem. This is probably the first thing you’re going to run into. And you’re like,
oh my God, what the heck is going on? And the answer is, remember, it’s async. You’re getting the
results way later after you actually queue the kernel. What can you do in this situation? Well,
there’s a bunch of things you can do, but the easiest and, you know, simplest way to like solve
a problem like this is to use this environment variable called CUDA launch blocking, which says,
hey, you know, wait until the previous kernels have all finished before actually executing my kernel.
And in this case, because we’re waiting, we can actually make sure that we, you know, have gotten all
the errors before we move on and try to do the next operation. So that will cause the errors to
move to the right place. Your programs will run really slow because remember asynchronous execution
is a good thing. It lets us make sure we keep the pipeline of GPU computation going. Whereas, you know,
with blocking, you’re going to wait and then the GPU is going to idle while your, you know, very slow CPU
host tries to figure out what the next thing to execute is until you get to the next thing. And then it’s
going to run again, right? So your utilization is going to be crap, but at least you know where the
errors are going. Let’s talk about, um, this asynchronous thing again, right? So we said that,
you know, CUDA programming, uh, has to, uh, you know, run ahead so that, you know, we can, uh, make
up for costs of, you know, uh, launching overhead and, you know, waiting for a CPU to figure out what
the things to do. Well, there’s another consequence to this, right? Which is anytime you ask for some
memory that’s in CUDA in your GPU, and you want to actually like look at it on the CPU, like you want
to say, Oh, is it a two? Is it a three? Can I do something with this? You have to wait, right? You
have to wait for all of the asynchronously queued kernels to finish executing so that you can actually
see what the data in that memory is. And then you have to copy it back to CPU and then you can actually
go look at it. So syncs are really, really, really expensive. And whenever we write code in PyTorch,
we really want to try to avoid doing synchronizations that are unnecessary. And sometimes this is not so
easy to do because there are a lot of innocuous sounding methods that can cause synchronizations.
For example, if you ask for a torch dot non-zero on a CUDA tensor, that will cause a sync. Why does that
cause a sync? Well, it causes a sync because non-zero gives you a tensor whose size is the number of
non-zero entries in the original tensor. How do you know what the non-zero entries are? Well,
you have to look at the data sync. Another example is dot item, which, you know, takes some elements
somewhere in a tensor and then gives you what its value is. And you look at this and you’re like, Oh,
well, I got this thing from CUDA memory. So that means I had to wait for all the computation to finish
to get that thing from CUDA memory. So try really, really, really hard not to do syncs. Sometimes this is
impossible, right? Like maybe you’re doing some iterative algorithm and you’re like, you know,
repeatedly running some kernel and waiting for some value to converge before you do thing before you
stop and go do something else. Well, yeah, you’re kind of out of luck, right? You’re going to have to
actually sync when you do that. But there’s often some way to set things up so that you don’t need to
do the sync. Or maybe there’s like a different version, right? Like there’s a fast version that
doesn’t sync and then the slower version that does sync. And you want to think about actually
providing both of these things. Speaking of asserts and syncs, remember what I said about,
you know, like your errors showing up in way random places, right? So in PyTorch, we actually have this
philosophy, which is that we are willing to pay a performance cost in our CUDA kernels so that we get
good error reporting. Let me give an example of this. Say you’re writing some sort of embedding.
And so what is it embedding? It’s just a glorified hash table lookup, right? So you got some index,
you want to go look at the element at that index, right? What if the index is out of bounds? Well,
we could say, oh, you know, we really care about performance. So we want to, we don’t want to
bounds check. We’re just going to do the dereference. And if there’s, you know, if it’s out of bounds,
well, too bad for the user, right? Like you asked for it, it’s up to you to make sure things are in
bounds. We do not make this assumption. We will bounds check these axes. For one, it’s not that
expensive to do because, you know, you’re this massively parallel CUDA, you know, GPU device,
and you know, you’re going to be spending lots of time usually being memory bound. So like, you know,
extra computation usually isn’t that expensive. But two is that if you do a, you know, invalid memory
axis, you’re just going to get an invalid memory axis, and you have no idea what could have caused
this problem. If you do a bounds check, and you do an assert, you will get that assert when things
fail later. And so you can, for example, grab the PyTorch code base, and they’ll tell you, hey, this is
what caused the assert. And then you can have some clue, oh, it was this operator without having to run
CUDA non-blocking. So I have told you a little bit about what GPUs are. I’ve told you about the CUDA
programming module. And then I started harping over and over and over about syncs, async, all that stuff,
because really, the asynchronous nature of CUDA is what really, really trips people up. In fact,
like even in advanced usages, like this, these streams, we have multiple streams, like making
sure all of the, you know, synchronizations between streams happen correctly, and happen correctly with,
say, our CUDA caching allocator. Oh, yeah, we have a caching allocator, because CUDA malloc is really
slow. So, you know, we get a bunch of memory from CUDA, and then we, you know, reuse it for our own
stuff. But making sure this all gets synced up so that like async stuff doesn’t messes up. Yeah,
that’s like probably the hardest thing about working in CUDA. So if you can remember, async is cool,
but it is very complicated. And make sure to remember that when you’re working on CUDA,
you go a long way, even if you don’t know anything about how to write CUDA algorithms like me.
All right, so that’s all I wanted to say today. Thanks for listening. See you next time.Just-enough-CUDA-to-be-dangerous
EP7 Functionalization
Hi, my name is Edward, and welcome to today’s edition of the PyTorch Dev Podcast.
Today, I want to talk about a process called functionalization,
which is used in multiple parts in the PyTorch codebase.
What do I mean by functionalization?
Well, I don’t necessarily mean the conversion of things into functions,
but what I actually mean is the removal of mutation
from operations that you do in PyTorch.
And it turns out that being able to remove mutation,
being able to transform an otherwise mutable program or trace
into a purely functional form is a very useful transformation
and one that we use in several places in PyTorch.
So I just want to talk a little bit about why this is useful
and then tell you about how we do it.
Okay, so why is functionalization important?
Well, a long, long time ago, in our pre-PyTorch 0.4 days,
we didn’t actually support doing autograd and mutation at the same time.
And there was a reason why this was the case.
It’s because, you know, when you have a program
and you just, you know, write a bunch of pure function calls,
you can easily just, you know, create a autograd graph
that represents the calls you just made
and then replay that graph when you go backwards in time.
But if mutation is allowed in the mix,
if you’re allowed to sort of modify something in place
when you are working on the forward path of your function,
not only do you have to somehow deal with a mutation,
but you also have to somehow modify all of the other aliases,
all the other views on the object in that situation.
And this is kind of complicated and difficult to think about how to actually do this.
And so the way that, you know, we actually implement this in PyTorch
is morally inside PyTorch’s internals,
we convert your program into a functional form,
one where the mutations are removed.
And so the autograd trace is not, you know, recording,
hey, you know, this is the mutation that happened,
but actually here is the purely functional version of the program
that would actually give you the, you know,
same computation that you would have gotten
if you had done the mutation in question.
So why is functionalization important?
It’s important because we can use it to implement automatic differentiation
in the presence of mutation.
You know, you don’t have to do this,
but one of the things people really like about using PyTorch
is you can just sort of do all the thing that Python normally lets you do.
And one of those things is mutate tensors.
So it’s kind of nice that, you know, autograd works with this.
Another thing that this is useful for and got repurposed after the fact
is PyTorch has an integration with XLA.
XLA is the backend for TensorFlow.
You know, it’s a very nice backend, generates good code.
And there’s something very important about it,
which is it is purely functional.
It doesn’t support mutation.
And so when we have a PyTorch program
that has a bunch of mutations in it,
when we translate it into XLA, HLO, IR,
we need to figure out a way to get rid of all of those mutations.
And so, in fact, the Torch XLA extension
developed by David LeBenzé and co
actually does, you know, the same kind of functionalization
that our autograd pass does
when mutations happen in our program.
So it’s useful in a bunch of places.
In the past, we’ve sort of, like, you know,
re-implemented this trick as needed.
But we’re going to eventually work on adding functionalization
as a proper pass to PyTorch so anyone can opt into it
if it’s something you need for your backend.
Okay, so I’ve talked a little bit about functionalization
and why it’s important.
But, like, you know, why is this a hard thing to do?
Because if you, you know, ask a, you know,
diehard functional programmer,
well, they’ll just tell you,
hey, you know, getting rid of mutation is not too hard.
You know, instead of, like, adding two to a variable,
you directly to a variable,
you just say, okay, well, you know,
X plus two equals Y.
And then anywhere you previously referred to X,
you just refer to Y instead.
So what’s the big deal?
And the big problem is aliases.
So let’s say that I have a tensor
and I, you know, take out a bunch of views on it
and then I fill the tensor with ones.
The modification that I did is not just, you know,
take this tensor and replace it
with an entire tensor full of ones.
It’s also all the views that I’ve taken of this tensor,
all those views also need to get filled with ones.
And this poses a very hard implementation challenge for us
because when I am, you know,
writing a runtime system for PyTorch,
we’re doing a reference counted implementation.
We want things to get promptly disposed of.
And so this object that I’m filling all with ones
doesn’t actually know where all the views are.
So imagine that, like, for any given tensor,
you knew all the aliases to that tensor.
Then you could still functionalize in, you know,
a little bit complicated, but not too bad of a way.
And the way you would do it is you would say,
hey, here’s my tensor.
Here are all the aliases.
I do some mutation to the tensor
and then I look up all of the aliases
and then I replay that same mutation on each of the aliases.
Well, okay.
I had to, like, narrow the scope of the mutation, right?
Because the view is only looking at a part of the tensor.
And so I just only need to apply the mutation from that part.
But then I just go ahead and apply the mutation to each of them.
And in the same way, you know, let y equal x plus 2
and then, you know, all previous references to x are now y.
I can just update all of these one by one
and then I have a new updated functional graph
that doesn’t have any reference to mutation.
But I can’t do this.
I can’t actually maintain this list of aliases
because if I did maintain that list of aliases,
well, one is we’re ref counted.
So if they were strong references,
then you’d keep all of your views live
even if, you know, no one was actually using them, right?
Like if someone takes out a view to your tensor
and then you mutate that tensor
and that view never gets used,
you need that view to go dead in that situation.
And if you made these all weak references,
well, that still causes problems
because you have to, you know,
do a pile of bookkeeping on the tensor
in order to keep track of all these views.
You no longer have a fixed size representation for a tensor.
The set of aliases to it may grow unboundedly.
I actually remember a long time ago
when Sam Gross was initially implementing
our C++ Autograd system
and he was trying to get mutation to work in this situation.
He came to my desk and he asked me,
hey, Edward, so I’m trying to figure out
how to, you know, deal with these mutable aliases.
And, you know, I was thinking, you know,
could I just store the aliases
for all the tensors and update them?
And I was like, Sam, that’s not a good idea.
Don’t do it that way.
And so Sam went away and he thought about the problem
and he came up with a better solution.
And I want to tell you how that solution works today.
So just to recap the situation, right?
So we want to do a mutation to a tensor
and we want to somehow get all of the aliases,
all of the views on that tensor
to see the change in question.
But we don’t know what those views are.
So how can we make sure we actually get this mutation,
get the knowledge of this change to all of those sites?
Well, the answer is, you know,
if you can’t do it now, do it later.
So when we do a mutation to some base tensor,
we say, okay, here’s the mutation that has happened.
And we also flip some bit, flip some version saying,
hey, everyone else, all of y’all, y’all views.
If someone else tries to ask you what the functional computation
graph corresponding to you are.
And it turns out the base has changed under you.
In the meantime, from the last time you came and looked at us,
you need to stop and recompute what your new value is
subject to the mutations that happened.
So let’s just go through an example to see what happens here.
So let’s say I’ve got my tensor A.
I have a view V on the tensor A and I add two to every element in A.
So I go ahead and do that.
I update the version on A.
So it says, hey, I’m out of date.
V, you know, when it was taken out from A,
recorded the old version.
So I got version zero, version zero recorded in V.
I update the version on A to go to one.
And so A records, hey, this is the mutation
that I made in the situation.
And so the next time I access V,
the first thing that I need to do is I say, hey, V,
are you up to date?
And V goes and looks at its base, which is A.
And it says, hmm, last time I looked at A, I was version zero.
But now A is version one.
So I need to do an update.
So V then goes and looks at what the changes that were made to A
in the meantime were, reapplies them to V,
and then says, OK, here’s the up to date,
representation in purely functional form of the contents of V.
Now, in Autograd, we don’t quite replay things because when we have the,
when we have the computation represented by A,
we actually don’t have to, like, you know, replay the mutation on V.
We can just say, OK, just take whatever the current state of A is,
take whatever the actual contents of A are, you know,
and the functional trace that created it,
and then just reapply that V operation to actually get the contents of V, right?
So it’s sort of not, not, if you imagine, like, two tracks, right, running,
and V is one track and A is one track,
it’s not like A is making changes and those changes get merged into V one by one,
but actually that every time you make a change to A and then relook at V,
a new branch branches off A and you just sort of forget all about the old branch that you had before.
In fact, Autograd even does a further optimization,
which is we don’t even have to remember what the views are
because every view is related to its base tensor
simply by the strides that are recorded in the view.
If you have read my blog post, which is an introduction to PyTorch,
I explain what strides are,
and, you know, any view operation boils down to a re-striding at some offset on the tensor.
So we just have this as-strided backwards that just gets applied in this situation.
Of course, XLA doesn’t actually support striding,
so for XLA, we actually just replay the view operations,
and that’s how it goes about doing these updates.
So that’s basically how functionalization works.
So we don’t eagerly update all the aliases when we do a mutation.
Instead, we lazily update them when they get accessed.
This preserves the ref counting properties we want,
where we only ever have references from, you know,
subsidiary things to the computation graphs that preceded them,
and we don’t need to maintain lists of tensors
that tell us what the aliases of a base tensor are.
So another pretty interesting property about this scheme
is it’s actually quite a bit better than static analysis.
So, like, let’s imagine that your LLVM or some sort of compiler
or, like, the TorchScript compiler,
and you want to, you have a program,
and it’s got some mutation in it,
and you want to remove that mutation
because maybe you’ve got some functional optimizations
that work better in this situation.
Well, when you’re in the compiler setting,
it’s actually kind of difficult to remove all of the mutations
because you just don’t know what the aliasing properties of your imports are.
This is why, actually, like, when you’re writing functions,
sometimes putting a restrict qualifier,
which says, hey, this pointer input is guaranteed not to alias
with this other pointer input.
The restrict qualifier is so important
because the fact that you can prove that they’re not alias
because you told the compiler that
then enables a bunch of optimizations that the compiler can do.
But in general, the compiler has to be very conservative,
and it has to, like, sort of, you know,
you know, if it doesn’t know,
it has to assume, oh, this could alias with something else,
and that just impedes a huge number of optimizations you might do.
Whereas PyTorch, which is sort of just running this functionalization
as we run our program eagerly,
always has absolutely precise alias information
about what exactly alias was with something else.
And so we can absolutely perfectly remove mutation in this situation
without any loss of fidelity.
Of course, you know, like, this is only for a single trace,
whereas, you know, your optimizer might be working
under, you know, very different situations
where some things may alias sometimes and some things don’t, right?
So it’s the price of generality.
When we specialize, specialize, specialize to the specific case,
we can do something really good in this situation.
So that’s it for functionalization in PyTorch.
It is how we, you know, sometimes I like to tell people,
hey, you know, PyTorch kind of wore a hair shirt
where we were like, hey, we care about mutation,
we care about supporting in-place operations,
and then we had to do a whole bunch of, you know, complexity.
Like, we actually have to work pretty hard
to make sure mutation works for our users.
But at the end of the day, like, how do we do this?
We map the mutable operations into the functional universe,
and then we do the things that, you know,
automatic differentiation, all that good stuff.
So it’s actually pretty nicely factored in this way.
And this is one of the, like, really joyful things
about working about PyTorch.
All right, that’s all I have to say today.
Talk to you all next time.Functionalization
EP8 The-road-to-structured-kernels
Hi, my name is Edward, and welcome to today’s edition of the PyTorch Dev Podcast.
Today, I want to talk a little bit about structured kernels and metatensors,
a project that I’ve been working on for the better part of a year, maybe more than that at this point.
Structured kernels are basically a new way of writing kernels in PyTorch,
where you can, instead of writing a kernel from whole cloth that does all of the computation,
all of the determining whether or not the inputs are right, and all of the output shape size computation,
for example, it allows you to factor your kernels into a structured form, where you write a meta
function, which says, you know, what the input checks need to be, and what the output sizes are
going to be, and then an actual implementation function, which you can then do a separate
implementation for CPU and CUDA, and they reuse the meta function to do all the, you know, shape
checking, but then the actual implementation bits can be different in both cases. And then metatensors
are a sort of easy extension on top of this, which is that, well, once you have this meta function,
that all it does is check the input d types and figures out what the output shape needs to be,
you can actually then do a third tensor type, not CPU or CUDA, but meta, which simply says, okay,
that’s cool, you’ve figured out what the output shape needs to be, I’m done, I’m just going to give you
back that tensor without actually having done any of the computation at all. So metatensors are just
tensors that don’t have any storage associated with themselves. They just, you know, like, they’re just
sort of like a abstract interpretation of the tensor, just without the data in question.
So these are two new sort of features slash endeavors slash projects that have been going on in PyTorch.
Not every kernel is structured. There’s a bunch of kernels that you can port to structured if you want.
And I’ve got a very detailed RFC on the topic in the PyTorch RFC’s repository.
And that’s not really what I want to talk about today. I’m not going to tell you really about how structured kernels work.
So I just want to talk a little bit about the history behind structured kernels. And in particular,
and the reason why I’m doing this episode, Anjali Chordia asked me, hey, Edward, you know,
why did it take so long for us to do structured kernels? They seem like a pretty simple idea.
This is not her words, but I’m elaborating. They seem like a simple idea. Like, you know,
of course you don’t want to write the shape checks multiple times in your CPU and CUDA kernels.
How come, you know, it wasn’t done this way from the beginning? How come we didn’t do it earlier?
And this is actually a pretty good question because for me, I was, you know,
originally when I decided that I was going to work on this, I thought to myself, oh, you know,
I’ll be able to wrap this up in a half. I’ll be able to port, you know, 80% of all operators.
Life will be great. You know, what could possibly go wrong? Well, a lot of things. So let’s talk about
that. Before I dive into when we started working on structured kernels, it’s useful to think about
sort of what problems were showing up for us in PyTorch development that sort of led to the idea
that we actually need to invest some time on this. And there are two like very distant causes that sort
have caused us some consternation and we didn’t really act on them. And then a more immediate
cause. And I want to talk about the distant causes first. So distant cause one was, um, we were writing,
um, we were writing compiler passes for the JIT and they needed to do shape propagation. And there was
a problem, right? Which is that like, Hey, uh, you know, you’ve got some input shapes and, uh, you know,
you’re running an ad on them and you don’t know what the output shape is. How do you actually
compute it? And so remember like, you know, PyTorch as it is written mostly today and historically the
way it’s written, um, all the shape checks, all the output computation, they’re all sort of interleaved
with the actual kernel computation that does the honest to goodness work. So if someone came to you
and they said, Hey, you know, I want to know what the output shape of this, the ad on these two tensors
of these sizes are, but like, I don’t want you to actually do any compute. I’d actually not have a good
answer for you because there wouldn’t actually be any way to call this code in the situation.
So what did people do? Well, you know, we could have done something like structured kernels,
but we sort of routed around the problem by just being like, okay, we’re just going to build a,
we’re just going to write the formulas ourselves. Cause like a lot of these operators,
the shape calculations are really simple and you know, what could possibly go wrong? So we wrote a bunch
of, um, shape, you know, transfer functions that like, you know, said abstractly what various
operators did. And these promptly fell out of date and no one uses them because like the coverage is
really bad. And a lot of them are wrong. And they’re wrong for really interesting reasons,
because it turns out that computing the output size of like an ad is actually really complicated
in PyTorch. There’s a lot of things that go into it because it’s not just, Oh yeah, if the two sizes
are the same, then I give you output, that’s the same size because Hey, like there is, uh, you know,
broadcasting to worry about there is tight promotion to worry about if, if you were cared about D types,
which you often do in compiler passes, there’s strides to care about. If you’re like doing memory
loud, actually the stride handling for like, you know, uh, point wise operations is really,
really complicated because we need to answer questions. Like if I add an NCHW and an NHG,
HWC tensor together, what is the output layout? And like, these are questions that are all resolved
in the actual kernel today. And if you’re just like someone like, you know, like who, who you don’t
really care about these shape functions, you’re just trying to do some other work, right. That
actually uses these shape functions. You’re not going to spend the time thinking about all of the
exhaustive error cases that go into this problem. So, okay. So we needed some sort of, um, shape pass
for JIT and we wrote a kind of crappy one and now no one uses it actually like when people really need
like accurate shape information, what typically happens is they just trace through a honest for
goodness, real execution of the Python high torch kernels running through the actual kernels in
question. And then that gives you super accurate, you know, sizes and shapes and D types and layouts
for everything that happened. And then you can like, just use that information directly. Right.
So like, you just worked around the fact that you didn’t actually have a function that you could
have just called to find out what the shapes computed to be. So this is like, kind of like,
you know, ah, this kind of sucks, but sounds like refactoring everything in PyTorch to like put the
shape computation separately. Seems like a lot of work. So, you know, I’m just a compiler developer.
I’m not going to work on it. And so things stay like that for a while. The second inkling we had
that there would be a need for structured kernels was this like very old proposal called async CPU.
So what is async CPU? Well, you know, when we look at normal PyTorch programs, there’s two devices that
everyone uses, CPU and CUDA, right? CPU is synchronous. You like say, okay, I want to do an
ad and it goes ahead and does the ad. And then once the ad’s done, you get a new CPU tensor with the
result of having done the ad. CUDA is asynchronous. I talked a little bit about this in my previous
podcast about, you know, just enough to be dangerous in CUDA, right? When you run a CUDA kernel, we
actually run ahead and return to you immediately while the CUDA kernel is still processing. And
eventually we, um, uh, we can keep queuing more kernels. And only when we do a synchronized, we
actually observe the result. Well, there’s nothing special about being asynchronous that requires it to
only happen on CUDA. And so if we are CPU, we can also just do, um, a version of CPU that’s
asynchronous, right? So you like cue some work onto some thread pool and then the thread pool goes off
and starts doing the CPU work. And then, you know, you actually return immediately. And so if your CPU
computations are very beefy, uh, then, you know, you might actually profitably reduce latency this way
because you can keep, you know, running your control thread along while, you know, you’re chugging out the
actual, uh, CPU computations. So this was kind of cool. And, you know, we were talking about this,
um, during the time and there was a problem. And the problem was like, we really wanted to reuse all
the existing CPU kernels. We didn’t want to write an entirely new backend for async CPU. That would be
silly, right? Because we got these perfectly good regular CPU kernels. We just need to make them async.
But there was a problem. If you want to return immediately after running, uh, you know,
queuing up the pool of work, you need to return a tensor. And that tensor you return needs to actually
have all of the, like, you know, metadata, the sizes, the D types, the layouts, all that stuff,
because we have a ton of code that assumes that I can, you know, without inducing async,
you know, uh, access this information. And in CUDA, this isn’t really a problem because we like
already did the copy paste, uh, from CPU kernels to CUDA kernels. So like the CUDA kernels knew
how to compute all the shapes while also asynchronously firing off the kernels, because
that’s what the CUDA runtime dealt with. But like, if we were going to do this entirely new async CPU
backend, it would be really silly if we like copy pasted every single CPU kernel and then like
async-ified it. Like that would just be a terrible maintenance problem. And so we couldn’t implement
async CPU because once again, there was no way to run computations without, um, without doing, uh,
a huge refactor of PyTorch. And there weren’t really that many compelling use cases for async CPU at the
time. So we just let that lie. And, you know, it was just like, okay, well, we can’t do this, but maybe
it doesn’t really matter. And so there were, there was always other stuff to work on at the time.
The thing that actually convinced me that we needed to actually spend some time doing this
refactoring work was, um, when I was working with Bram Wasti on this project called lazy tensor,
um, lazy tensors are this concept that like keeps coming back again and again. Um, and it’s just,
you know, instead of, uh, eagerly executing computations, when you ask for them in your eager
mode API, we wait, we say, okay, we’re not going to actually run these computations because
maybe we will notice, uh, that, uh, there’s a sequence of operations that happen and they can
be fused together. And then now I can actually, um, you know, use some fuse kernel in this case and
run a lot faster in this situation. Um, lazy is different from tracing because with tracing,
you just like run the entire computation through you, you capture whatever the, um, control flow was
at that time. And then you like compile the entire trace laziness is sort of trying to be this more
controlled, uh, controlled situation where you, uh, can run your code repeatedly and like, you know,
we’ll keep lazily evaluating and then like doing the optimizations every time. So actually in theory,
anything you could run in eager mode, you could also run with a lazy tensor, but you could actually
pass it to some graph backend that does optimization. It’s, it’s very similar to tracing, but the difference
is you do expect to run the eager code every time. Um, and like, you know, if the trace is the same,
then you reuse it. Otherwise, you know, you recompile XLA, by the way, in PyTorch is an example
of a lazy tensor in PyTorch. Okay. So Bram and I were working on this prototype. Well, really Bram was
like doing all the work and I was like, you know, advising as like someone who was working in core PyTorch.
And, um, besides like all the design problems that like lazy tensors entail and which would be a great
story for another day on this podcast, um, something became clear, which is that, Hey, when you do a lazy
tensor, you need to return a tensor and that tensor needs to have valid sizes and strides and D types,
but you didn’t actually run your computation. That was like, Oh my God, this is terrible. This is exactly
the same problem. We’ve run into, um, you know, third times a charm. Let’s actually do this. And so
I pitched structured kernels as this project and thus embarked on this year long journey to like
actually bring structured kernels into PyTorch. Why did it take so long to do structured kernels?
Well, there’s, you know, a really difficult problem whenever you want to do any development in PyTorch,
which is we have too many goddamn operators. Like we’ve got like, uh, so one of the things
that I did before embarking on the structured kernels project was to like, try to taxonomize
every operator in PyTorch. And I actually like have a spreadsheet of all our operators. I like went
through them one by one and try to classify what kind of thing they did, what kind of shape computation
they were. And it was only like 1700 operators. This is slightly inflated because like when there was a
in place and out of place and out variant, I counted these all separately, but still 1700 offers. That’s
a lot of operators that you actually have to do. And we keep adding new operators every, you know,
release. And so this number just keeps going up. So, oh my God, like how the heck are we going to
actually refactor all of this code? And it’s even worse because, uh, remember like PyTorch came from
LuaTorch, which came from Torch 7. And so there’s like this legacy C TH implementation. And actually
like we had already started a project for porting these crufty TH kernels written in C, written in
this bastardized macro system and getting them into a more shiny modern C++. And even to this day, we are
still not done getting rid of all the TH kernels. So like that’s a lot of work and structured kernels,
like refactoring kernels in this way would have been a lot more work. So like the first thing that
I like had to grapple with was like, how the heck am I actually going to like stage this change in a
reasonable way so that we can like start partially migrating things while not having problems.
The second big problem that I ran into was tensor iterator. So for those of you who don’t know,
tensor iterator is the class in PyTorch, which was responsible for implementing all of our unary,
binary, and basically all of our like, you know, kernels that like, you know, basically know how
to operate on strided tensors. Tensor iterator is pretty cool. It does a lot of interesting stuff.
It’s also really, really, really, really complicated. And like, you know, if, so remember when I was like,
how do you do add, well, there’s type promotion, and there’s, you know, layout propagation, and
there’s all that stuff. Well, a lot of stuff is in tensor iterator. And it’s like this big ball of code
that like, no one really knows how to refactor. And I needed to somehow like, not duplicate this code,
because like, it’s really complicated code, I don’t want two copies of it. And at the same time, like,
make it possible to use without, you know, running the computation, even though it’s like embedded into
this giant monolithic tensor iterator class that like, I have no idea how to do. That like took,
I don’t know, I think it like took two months to figure out a reasonable design for structured
kernels that could actually deal with this involving like, basically, I added a virtual method to tensor
iterator that got invoked once it had actually figured out what the sizes and the shapes and the
d types were, and then overwrote it to call into the structured kernel machinery. The technical details
are important, but like basically big blob of legacy code. And originally, I was like, I’m just not going
to solve this problem. Because, you know, tensor is too complicated, someone should just rewrite it,
but like, add and sub and all these really important operators are tensor iterators. So I needed to,
in fact, figure out some way to actually solve this problem. So yeah, so that all took a while.
And we’re still not done. There’s still a lot of kernels that need to be ported to structured,
but we’re in a much better spot right now. There’s a lot of work going on porting kernels to structured
in PyTorch. We’re getting better and better coverage. We’re hoping to hit covering all the operators
that XLA supports. That’s a really decent chunk of operators. And I don’t know, I’m pretty optimistic
about the project, even though, you know, it’s like sort of sucked up my time and energy for a year at this
point. That’s all I really wanted to say about structured kernels and metatensors. Metatensors, by the
way, really simple, right? But how are you going to test them? And like getting testing to work on
them was also a project, but, but I’m out of time. I’m going to leave you all here. Thanks
all for listening. See y’all next time.The-road-to-structured-kernels
EP9 Backend-extensibility
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to take an ambling
journey through what we like to call in PyTorch as backend extensibility. What do I mean by
backend extensibility? Well, PyTorch, you know, as a project has a number of things that it’s
supposed to do. And one of the things that it’s supposed to give is GPU acceleration, right?
Because if we didn’t have GPU acceleration, you could just use NumPy and do a lot of things you
wanted to do there. But GPU acceleration means that you can take your same PyTorch program and run it
not only on CPU, but also on CUDA at the same time. And so we call these things backends. At the very
beginning of the PyTorch project, CPU and CUDA were the only backends that were available. If you go
back to the LuaTorch days, or also further back to the Torch 7 days, right, there was the TH library,
and there’s the THC library. And those were really the only backends in town. So everyone sort of
used one or the other. And when PyTorch was initially released, that was still the same
deal. We had a CPU, and we had a CUDA, and that’s all that happened. Then over time, people, you know,
came to us and they’re like, oh, I’ve got some hardware, or I’ve got some other backend. And I’ve
also liked to use it with PyTorch in the framework. And we started working on making it possible to add
more and more backends to PyTorch. So that’s what I mean by backend extensibility. And so where is PyTorch
when it comes to backend extensibility? Well, let’s dig into it. So the first thing to really know
about PyTorch is that from the beginning, it was designed for CPU and CUDA. So if you have something
that looks a lot like CUDA as the backend you want to do, things are going to work out okay for you.
And so a really good example of something that’s really like CUDA, in fact, so like CUDA that like,
it’s actually just frigging transpiling CUDA kernels into their own kernel language is the AMD
HIP Rockum project. So, you know, CUDA is an NVIDIA invention, and it’s only targeted at NVIDIA GPUs.
AMD also produces GPUs. And for the longest time, they didn’t have any general purpose programming
capabilities. Well, Rockum is AMD stack for doing, uh, doing general purpose programming on their GPUs.
And the way they set things up was they were like, okay, well, CUDA has a, you know, decade long head
start in, you know, building a software stack, really one of the like key advantages of, you know, being on
NVIDIA hardware, let’s just try to make use of as much of it as possible. And so the way Rockum works
is that they have a language for kernels that is basically the same as CUDA. So they like copy pasted as
much of the language as possible. And so when you want to write a, um, kernel, uh, in PyTorch, you write a
CUDA kernel. And then in our AMD HIP Rockum build, we actually transpile this kernel via just doing a bunch of
regular expression. It replaces really, we call it hippification, but like, it really is just a bunch
of, you know, search and replace, this is on strings. We hippify it into a HIP kernel and we just directly
go ahead and compile that. And that’s what, you know, you actually run if you’re running your, um, PyTorch
on an AMD GPU. This actually works by the way. Um, like you can get one of the few, there aren’t that many
GPUs that AMD releases that support Rockum, but if you have one of those GPUs, you can run your, uh,
PyTorch programs on it. We did make some weird choices with, um, Rockum because this was one of
the first, uh, first external backends that we added to PyTorch. And one of the weird choices we made
was that, um, so remember PyTorch is only CPU and CUDA. And so while Rockum is like this thing,
but no one is writing code so that it works with Rockum. So what they just did, they were like,
okay, we’re just gonna, you know, not rename any of the user facing interface. So if you want to
put your, you know, tensor into Rockum memory, just call CUDA on it with our special AMD build of
PyTorch, which is mutually exclusive from the regular Nvidia builds of PyTorch. And then that’ll put it
on the Rockum GPU. Kind of goofy, worked out okay for them. It was easy for them to reuse all of our
tests, right? Cause all our tests were just written, uh, assuming that CUDA was the thing and,
you know, not really a big deal. Kind of annoying from my perspective. I really hate that, um, it’s
CUDA masquerading as it’s, sorry, it’s hit masquerading as CUDA. And I’d really like them to
fix it someday, but I haven’t been able to get them to fix it, but so that’s how they do it.
And because Rockum is so, so similar to CUDA, like literally like most things that CUDA provides,
like streams and, you know, current device and, uh, you know, like, uh, CUDA and N, they all translate
into the hip world. So, um, it wasn’t too hard. It’s not too hard, right. When like someone is, uh,
doing the API exactly the same way. The sort of next, um, example of device extensibility that sort
of lives in PyTorch’s history is our XLA integration. So XLA, if you’re not aware is
Google’s underlying, um, compiler for TensorFlow. So like TensorFlow is the front end that you can
write your neural networks in, but then XLA is this compiler that can take in, um, a high level,
uh, IR and then compile it. And so for example, JAX, the new darling of, you know, Google’s, uh,
research researchers, um, JAX also targets, um, XLA and, you know, it doesn’t have to share any
TensorFlow code, but it just, you know, uses the underlying compiler. So XLA is pretty cool. If you
want to run your code on TPUs, Google’s hardware for deep learning, XLA is the only game in town.
And so we wanted to also make it possible to run PyTorch programs on TPUs. And as a result,
um, we invested in an XLA integration. Now XLA is a lot different from, uh, Cuda and Rockham,
right? So Cuda, uh, if you remember from my podcast about, um, every, just, you know, enough to know
enough about Cuda to be dangerous. Um, Cuda is an asynchronous programming model, right? So you like
have a bunch of kernels, you call into Cuda and, uh, Cuda goes and says, okay, well,
these are the things I’m going to run as you told me to, well, XLA is nothing like that, right?
XLA is a graph mode compiler. It expects to be given a graph of high-level IR, and then, uh, it will
actually optimize it and run it for you. So we had to do something completely different for, uh, XLA.
And, um, this is what we did. What was like, we added a new XLA type because, um, similarly to HIP,
um, we, you know, wanted the main code, sorry, not similar to HIP. Um, XLA has a bunch of integration
that needs to work with, um, the XLA code base. And so we wanted to let live, have let that live
out of trees. So what we did was we put in a bunch of hooks in PyTorch core to sort of make this all
work. And one of the things we did is there’s an XLA device type, similar to how there’s a CPU
and CUDA device type. So you have to go and send a PR to PyTorch core and say, Hey, you know, can I have
this device type? So we put in this device type, and then, um, we have this dispatcher thing, which I
also have talked about in a previous podcast. We have this dispatcher thing that is a, um, entry point
where for any of the operators that are defined in native functions.yaml, you can define your own,
uh, implementation of it, register it at runtime. So like literally you have a dynamic library and
has a bunch of static initializers and those static initializers are registrations. There’s a very
user-friendly API modeled off to PyBind 11 that you can use to do this. So you register those things.
And then whenever there’s a tensor that is an XLA tensor, uh, when you call any code in PyTorch,
it will, um, it will hit the dispatcher. You’ll see that, Oh, this is an XLA tensor.
And then it will route to the dynamically loaded XLA implementation that does whatever you want.
And then XLA itself, because, you know, it’s a graph mode thing actually doesn’t do any computation.
It just goes ahead and builds a computation graph. And so at the front end, there’s some stuff you have
to do differently. Like there’s a special set of optimizers, which you know how to deal with the
fact that XLA computation is lazy and not eager, but like XLA was sort of the first, like actually
usable external backend that we developed for. And, um, we did a only so-so job in supporting them in
their endeavor. So actually, uh, you know, it turns out there’s a lot of boilerplate you have to write
when you want to add support for an external backend. And also XLA doesn’t support all operators in PyTorch,
right? PyTorch has a lot of operators and, um, XLA, you know, well, XLA is cool. And there’s a lot of
operators that it just doesn’t support or, you know, like they just didn’t have time to add support
for. So XLA also has this mechanism for allowing for a fallback to CPU where it’s like, okay,
you’re running your PyTorch program on XLA, and then you get to some operator that XLA doesn’t know
how to do. So XLA was going to go ahead and, uh, compute the actual output for you, given the XLA
graph that was at hand. And then with a CPU tensor fallback to calling the regular PyTorch CPU kernel,
and then, you know, doing it back to XLA. It’s kind of a question whether or not this is a good
idea to do by default or not, because these are like terrible performance cliffs, but it’s really
useful if you just want to figure out if your program is going to run or not, right? Just like
being able to move in between these ways. So because there’s kind of a lot of like infrastructure you have
to write, XLA actually went ahead and, um, built a mini code generator. Like it’s this Python script that
gets run during the, um, build process of XLA that actually generates all the code that registers to
PyTorch and XLA like predates our sort of nice registration API. We had a not so nice registration
API before, and it was very hard to use. And so XLA has this code and it does all this stuff and it’s
not so nice. So actually Brian Hirsch, um, one of the members of the composability team has recently
been working on sort of revamping XLA’s code generation and letting it live in PyTorch as a
thing that external backgrounds can use, uh, when they want to like, you know, plug into our system
and get all the niceties, like, you know, fall back to CPU in this situation.
So rewinding a sec. So what is adding a new, uh, back into PyTorch look like in the XLA universe? Um,
well, you need a, you know, first send a PR to PyTorch main repository being like, Hey, Hey, Hey,
I’ve got this new device called XLA. Um, you know, I need PyTorch to know about, you also have to tell
PyTorch about this dispatch key thing. The dispatch key is the thing that actually like, um, you know,
we do the virtual call based on it’s different from the device type because not all device types have
dispatch keys. And also we have a bunch of concepts that aren’t device types like V mapping and meta
tensors actually meta tensors count as device. Um, and like autograd and these things aren’t really
devices in their own right. So dispatch key is this like generalized idea of all the things you might go
to. You have to go ask for a dispatch key to be added. And then there’s a little bit of Python binding
code, which we never like wrote in generic way. So you’ve got to go edit those parts. But once you do all
those things, you basically, um, don’t need to do anything else in PyTorch core, right? Because
there’s this virtual table that you can manually program in using the torch library macros. And
this is what XLA does. And it’s, it’s a little not so nice to do this directly. And so people have
resorted to code generation to actually manage these things. So this is like basically the current state
of the art in external backends. Um, I guess something that’s a little good to talk about is like,
what are some of the challenges of doing an external backend in this way? Because we’ve
actually, um, had a bunch of people try to actually go onto this, uh, treadmill. Uh, and you’ll see why
I call it a treadmill in a moment. Uh, for example, um, ML compute from Apple and Intel as well. So some of
the reasons why, uh, this is a little difficult. So one is that, um, you know, PyTorch cares a lot
about backwards compatibility, compatibility, but only for our end users and not our backend extenders.
So here’s an example of a backwards compatible chain you can make, um, which is, uh, say there’s some
function and we want to add some new functionality to it. We could add a new argument to that function
and give it a default value. So, you know, from the perspective of someone using PyTorch from Python,
this is perfectly backwards compatible because, Hey, uh, you know, like if I am not passing this
argument, it’ll get defaulted. And then I will ostensibly get whatever old behavior I had.
Well, that’s not the case when you’re doing, uh, a backend because well, uh, when you, you know,
have to register one of these functions that knows how to process this operator, you have a problem,
right? Like this operator is now trying to give you extra arguments and you’re like, Oh, well,
my old operator implementation only knows how to handle three arguments, not four arguments. What do
in principle, this isn’t actually a BC breaking change, right? We could like somehow detect
that the user gave us the defaulted argument and then, you know, call your, uh, old function
without that argument in that case. And if they give a non-default value for the new argument,
then we error in that case, but it’s kind of hard to do this in C plus plus only. So, you know,
this is not something that like we’ve worked. And the upshot of this is that like, if you want to
like do a backend, you’re actually going to have to do a lot of work, keeping all of your, you know,
understood functions up to date with, um, with the, uh, changes to all the operators. Cause we keep
adding new operators. We keep adding new knobs on operators. And so it’s, it’s kind of a treadmill
keeping up to date, by the way, XLA can keep up to date because we actually have it included as part
part of the build system in PyTorch. So like whenever you’re working on a new, uh, operator
in PyTorch, um, XLA will tell you if you broke XLA and then like only through the heroic efforts of,
you know, Jack, Sal and the rest of the XLA team, does this actually work okay? Because like you
accidentally break the XLA test. You’re an average PyTorch developer. You know nothing about XLA.
You can just sort of send up the bat signal and they will make the compatibility patch to like
get XLA going. Uh, what I’ve heard, um, some other people have done when they’re, uh, extending the
backend in this way is they like, don’t bother. And so every release they like try to catch up
and there’s like a ton of stuff and this is not so great. I’m hopeful that Brian Hirsch’s work on a
code gen for external backends can make this easier because there’s some things that we just technically
can’t do in pure C++ but are easier for us to do in Python. But, um, this, this code is still be
in the process of being landed for XLA. It’s really close. We, we actually tried landing it a few days
ago and it got reverted because it broke something, but not, not for a good reason. Like it’s been passing,
passing half of the CIs for a good while now. So I’m out of time. Um, there’s probably more things
about backend extensibility that I should talk about, but, um, I’ll save them for another podcast.
See you next time.Backend-extensibility
EP10 The-life-and-death-of-Variable
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about a topic that
I’ve received two requests for when soliciting topics to talk about in the podcast, and this
topic is variable. Actually, it’s a kind of strange topic to be talking about, because if you look at
PyTorch today in the Python frontend, there actually is no variable anymore, and that’s
because we got rid of it. It was a banner feature in PyTorch 0.4, and then a bit later, we actually
got rid of it in the C++ code, although there’s still a bunch of places where we still talk about
variable. That’s just because we’ve been too lazy to rename all the type names in the code base.
But it’s still really useful to know the history behind variable, because there are a lot of strange
APIs that still exist because of the fact that tensors were structured in a different way. And
it’s also kind of informative just to look at how the format of tensors has evolved over time,
and also where they might be going in the future, because I would definitely not be the person who
would say that we are in a perfect state. So where does our story begin? Our story begins a long,
long time ago, even before the existence of C++ Autograd and PyTorch in LuaTorch. So in LuaTorch,
tensor was represented as a C struct. And remember this thing, right, how the TH library in LuaTorch
has a bunch of C code that’s munged about with a preprocessor? Well, that’s true for the data type
as well. So when we wrote TH code, we had a C struct, and there was a separate C struct for every D type we
supported. So there was a TH float tensor, a TH double tensor, a TH int tensor, and so forth and so forth.
This made life really hard if you wanted to write code polymorphically over different tensor types,
but it didn’t matter because we were just, you know, rewriting all of our code every time when you,
like, wrote code in TH. We just, you know, redefined the macros and then stamped out different versions
of the code. So along comes PyTorch, and we’re still using the good old-fashioned TH tensors.
And Zach comes along, right, and he wants to build this A10 C++ library. And one of the things
that he needs in the C++ library is he wants to be able to write code polymorphically over device types
without templating them. Because you see, in C++, if you write templated code, you don’t actually get
to type check the contents of your template, right? Like, the way C++ works, well, until, you know,
C++ 20, whatever, concepts come along, the way C++ works is that when you write a templated function,
C++ only checks the stuff that isn’t related to the template. Anything related to the template
is deferred until you actually instantiate the template in question. So, you know, C++ templates
are famously a source of really bad error messages. And so, you know, we had a bunch of people who were
previously writing all of our operations in Python, and we were going to try to write them in C++.
And so, like, forcing them to template all their code on D type would have been a really,
really bad idea. So, like, if there was one good idea in the A10 library, it was this,
don’t parametrize your tensor type on D type. Okay, so we had a single tensor type, and we put it all
together. And we said, okay, there’s going to be a single tensor impl that represents all the D types in
question. And that’s going to be pretty cool. But remember that the TH library and Zach’s A10 library
didn’t know anything about automatic differentiation. And at the time, AD was implemented entirely in
Python. So there was, like, no concept of this in C++. And this was true in LuaTorch as well. AD was a
thing that was implemented in Lua, not inside the libraries themselves. And so when Sam came along,
and he was like, oh, my God, you know, autograd is too slow, we need to make it faster. And we’re
going to do it by putting it to C++. He was in the position of needing to write an implementation of
autograd in C++ rather than in Python. And so the most obvious way to do this was to preserve the
abstraction barrier that was enforced upon us when autograd was written in Python, namely that the
tensor subsystem knows nothing about automatic differentiation. So let’s think about it, right?
Like, say you have some library that gives you a tensor object and lets you do various basic operations
on them. Well, what if you want to augment this with some notion of history and a notion of an autograd
tape that you record graph operations to do later when you want to autograd on them? Well, if you
have this strong abstraction barrier between the tensor and the AD system, you can actually modify your
tensors to like add the new metadata you need. So what are you going to do, you’re going to wrap them
in a variable. So variables were just this wrapper around tensors that, you know, gave all the extra
metadata that you needed to get yourself working in the situation. And so it started off as a requirement,
right? Because autograd was written in Python. And then when we moved everything to C++, well,
the most easy thing to do was to preserve this abstraction barrier. So, you know, we had everything
in C++, but you know, it was still like implemented as there is a variable wrapper and it is on top of the
Aten library. In fact, they even lived in separate dynamic libraries, if you remember the dynamic
library podcast. So, okay, so we’ve got this variable concept and, you know, it’s like 0.3 in PyTorch days.
And, you know, we’ve got tons of people using PyTorch and they love it. And we keep getting all these
questions about when should I wrap my tensors and variables? What’s the difference between a variable
and a tensor? When do I use dot data to get a tensor out? And what we discovered is that it was actually
really, really confusing for people to have to manage both variables and tensors. Now, it is a really like
easy way to organize the code when we were implementing it. But the problem from the user experience
perspective is there’s too much expressivity, there’s too much freedom in this representation. Namely,
you can have a tensor, you can have a variable that doesn’t require grad, and you can have another
variable that does require grad. And the problem is that, you know, each of these three states,
the tensor state and the variable doesn’t require grad state, these states are basically the same.
Like, semantically, they do exactly the same thing. The only problem is, well, you know,
while you’ve got this variable thing, you’ve got this tensor thing. So people have to, you know,
worry about, you know, switching between these two modes, even though like, you know,
if they’re just thinking about like, what is it they want to do, right? Like, what they really want to do
is they want some tensors to record gradient gradients and some to not. And, you know, having to deal
with this extra distinction that doesn’t do anything useful. Well, that’s pretty confusing,
and they don’t like that. So we were like, okay, in 0.4, we want to get rid of variable,
right? And we want to just make it so that when you’re writing PyTorch code, you don’t have to deal
with, you know, remembering if you wrapped something in a variable or not. So we got rid of variable.
How did we do it? Well, we cheated. The way we cheated was we just said, okay, well,
we’ve got this big C++ implementation with variable to tensor. And like, oh, you know,
it’s a ton of code to refactor, we don’t really feel like refactoring it. Also,
we didn’t actually know how to do this refactor. So here’s what we’re going to do. In PyTorch,
we’re only going to provide you variables. So like this thing that we call tensor, secretly,
it’s a variable. And you know, that means that, you know, we’ve eliminated this illegal state,
when you don’t actually get to, you know, look at the, the illegal state is now a bare tensor,
right, because you all you have are variables or variables with requires grad. And that worked
pretty well for a while. So we had this problem, though, which is that like, in the Python API,
there’s only tensor. But if you like, dive down to C++, and you’re like a C++ writer,
there’s actually still this variable concept. And so one of the things that like, we really wanted to do
was, you know, hey, like, maybe we want the Python and the C++ APIs to look the same. Like,
maybe that’s a good idea. And we can do it. But there’s a problem. And here’s the problem. The
problem is that the way we implemented autograd is via this unwrapping operation on variables. So
the idea is that like, you have a bunch of variables floating around, you do some operation on them.
And when you do the operation, well, you know, you’ve got a variable, so you go over to the variable
implementation. And let’s say you’re doing the implementation of add. So we’re going to set up some
autograd graph, right to like, you know, record, and then we want to actually run the original,
the original code that actually implements the add kernel. So how do we do that? Well,
inside every variable, it’s there’s a tensor that you can unwrap from it. So we just unwrap the tensors
from the variables, and then we call add on those. And those are just tensors, not variables. And so
we can actually get to the actual kernel question. So how do we do this? For if there’s no separation
between variables and tensors, if every tensor is a variable, how do we actually do this? And you think
to yourself, Oh, yeah, you know, Ed, what you should just do in this situation is you should like make a
super call, right? Like, you, you’ve got your autograd code, and then you just want to call super colon colon add,
and that’ll bounce you over to, you know, whatever the, you know, the parent implementation is ostensibly
doing the actual addition. But we have a lot of operators in PyTorch. And many of these operators
actually call other operators in their implementations. And when they call those other operators,
you don’t actually want them to hit autograd. In the situation, you want them to go and you want
them to go and go straight to the, you know, non autograd actual kernel computation, right? Because
it’s sort of like, you know, once you do an autograd call, you’ve actually you’re done. There’s no
like internal autograd bookkeeping you need to do like, it’s a single atomic unit in the situation. So you
want to bypass everything underneath. Those of you who have read my dispatcher talk know how we solve
this problem. So Wilfeng implemented the C++ tensor variable merge. And the way we solve the problem was
we introduced some thread local state. So what we said was, okay, what we’re going to do is we’re going
to have these variables, and we’re going to, you know, do our autograd stuff on them. And then we’re
going to set some thread local state that says, don’t do any more autograd stuff. That’s actually what
auto non variable type mode used to do. We’ve killed that now. Check out the inference mode podcast for
more details on that. So we set this TLS. And now whenever we do function calls, we just check is,
you know, the autograd skip TLS bit set. And if it is set, then, you know, we go and go to the actual
kernel instead. The actual implementation is more complicated than that. But if you’re just thinking
about autograd, this is all you need to know. And so in that way, we didn’t actually have to do any
unwrapping step to actually, you know, make it so that we stopped running the autograd code and started
running the tensor code. Now, there are a few other complications. So one of the things that was supported
in the variable API is this data attribute. So what does the data attribute look like? Well,
you know, if I have a tensor x, then I can say x dot data, and I’ll get out, well, who knows what it does
today. But in the old days, right, if you had a variable, well, you know, x was the variable,
and then x dot data was the tensor on the inside. And so if x was a thing that requires gradient,
well, x dot data is a plain old tensor, obviously doesn’t require gradient. So we had to like figure
out like what exactly these things should do in the new world order, because we’re not wrapping variables
anymore. So there aren’t any, there’s no tensor inside waiting to, you know, burst out. Sorry,
the tensor was not inside you all along. So what are you going to do, right? Well, we just looked at
those semantics, and we’re like, okay, well, you know, what is this x dot data? Well, it aliases the
same storage as the original tensor. So it’s kind of like a alias call. But you know, it doesn’t require
gradient, even if the variable required gradient. So it’s kind of like a detached call. So, you know,
and you know, what about the version counters? Well, version counters are a concept on variable
originally, and then we put them on tensor. And so what are version counters? Well, that’s a long
story for another time. But if you know what version counters are, we stored version counters
on variables, when we put them on tensors. If you took out the data, the inside tensor of the variable,
you would actually disconnect from the original version counter. So we also simulated that behavior.
So basically, we like look, and we’re like, what is all the observable behavior you could
see when you did a dot data, and then try to figure out what that would look like in a universe where,
you know, there are no variables, right? Everything’s just a tensor.
So that was done. And it sort of worked for a while, we were in this weird nether state where
we had collapsed the representation. So there was only one, there was only one tensor representation,
rather than a variable wrapping a tensor, but we hadn’t actually expunged all the variable classes
from the code base. And then later, I actually went and finished off the job and got rid of all those
wrappers. And then that’s sort of where we are today, right? So we have tensors, it’s a single struct,
but the struct has a few fields, really one field dedicated for letting you slot in autograd metadata if you
actually want it in the future. This data is not actually defined in tensor, it still lives in a
separate dynamic library, the in the autograd folder in CSERC, and it contains a bunch of extra data.
And so if you don’t actually require autograd, we don’t bother allocating all this data, and you can
save a bunch of time. By the way, one of the reasons why, you know, inference mode and, you know,
no grad mode is faster than, you know, if you’re recording autograd. And so that’s like basically
the state of tensor today. So where could we be going in the future with this? Well, one of the
things that people have been looking into recently is how to make it so that you can nest automatic
differentiation repeatedly in a style that is not the same old style that we normally support double
backwards in PyTorch. Namely, you know, you retain graph and then you back prop through the graph again.
So more like a Jack style, like, you know, repeatedly differentiate a piece of code ahead of time.
So how can we do that? Well, we’ve got a prototype that knows how to do this. And actually it’s done by,
well, who would guess, wrapping the tensor into multiple levels of gradient tracking to make it
work out. So I don’t know. Revenge of the wrappers, I suppose. So that’s all I wanted to say about
a variable today. See you all next time.The-life-and-death-of-Variable
EP11 How-new-operators-are-authored
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I’d like to give a short intro slash primer
about the general developer experience that happens
when you want to add a new operator to PyTorch.
Despite, you know, PyTorch being a library
for doing numeric computing, and so, you know,
hey, you know, what are we all about?
Well, we’re all about a big pile of operator implementations
for all the things you might want to do.
Actually, it’s not that common that we go about
and add a new operator to PyTorch.
It’s actually pretty rare
because we kind of have a lot of operators in PyTorch,
and, you know, most of the time
when you want to do something interesting and new,
usually you just, you know, put a bunch of operators together
to do whatever it is that you are interested in doing.
And that’s like, you know, that’s basically what people are doing
when they write deep learning models, right?
They’re just putting operators together
into bigger and better operations.
So you only really need to write a custom operator
when there is something that you need to do
that, like, sort of can’t be done efficiently
by putting everything separately.
So, like, kind of classic example,
which applies to PyTorch and is sort of ameliorated
if you’ve got a fusing compiler,
is if you, say, want to write a new pointwise op
that consists of a bunch of pointwise operations,
and you don’t actually want to, you know,
run them separately one by one, loop by loop, right?
Because that takes up a lot of memory traffic.
Well, then writing an operator for that case
is quite a big benefit
because once you fuse them together,
things can run substantially faster.
But, okay, so let’s say that, you know,
you’re actually doing some sort of really fancy linear algebra
or you need a new pointwise fused op
or, you know, any sort of situation
where you, you know, need the performance
that you can only get from writing a kernel.
What does it look like
when you want to add a new operator?
Well, there’s sort of two main modes
that people write new operators in PyTorch.
One is adding a new operator to the library proper.
So this is, you know, this is core PyTorch.
The next release of PyTorch,
the operator is going to show up
and, you know, you can make use of it.
It, you know, is something that you put
in native functions.yaml,
a file we’ll be talking about more later in this podcast.
And it’s just something that we consider in core.
But there’s another way to write a new operator in PyTorch,
and that’s using the operator extension mechanism.
So using the Torch library header and macro,
you can actually define operators
completely externally from PyTorch.
And then you can just, you know,
you register them via a PyBind 11-like registration system.
And then these operators become available
for you to use via the Torch.ops namespace.
So, you know, there’s a difference
between these two things, right?
If you add a new operator to core PyTorch,
the thing you need to do is
you need to make sure everything works, right?
So you need a CPU implementation,
you need a CUDA implementation,
you need working derivatives for it,
you need, you know, comprehensive tests
like autograd checking, all that stuff.
And sometimes, oh, and not only that,
but, you know, your operator needs to handle
all of the different kinds of tensors
that, you know, a PyTorch programmer
might throw at you,
including tensors with strange strides
or different layouts
or very different D types.
Now, if you’re just someone, you know,
who like just needs a little code
that works on floating points
just for this particular case on CPU,
often you don’t actually want to go
through all that rigmarole.
And also maybe your operator
is like not very well defined, right?
It’s not, doesn’t mathematically make sense.
It’s not really something of general use.
It’s just something very specific
for your problem.
Well, writing a custom operator
is great for this use case, right?
Because you just write out your operator
and you do the thing
for exactly the use case you need.
And no one else is really bothered
by the fact that you wrote
a custom operator like this.
So the use cases in these two situations
are kind of different.
But let’s talk a little bit about
what happens when you add
a core operator to PyTorch.
So what exactly does this entail?
So the first thing you need to do
is you need to define
what the API for this operation
is going to be.
And the reason for this
is PyTorch is not just a Python library.
It is actually, you know,
also a C++ library
that you can use directly from C++.
And it’s also a compiler and interpreter
that, you know,
you can interpret PyTorch programs on.
And so, you know,
you don’t just write an operator
by writing a new Python signature.
We need to write a API declaration
for the operator,
which is generic across
all of the different modes of use.
Interpreter, C++, Python,
and other situations
that can work in all those cases.
And what we call this,
you know,
specification is a JIT schema string.
So if you’re in PyTorch core itself,
there’s this file called
native functions.yaml.
And what it has is
it has all of the JIT schema strings
for all of the operations
that you might be interested in.
And JIT schema strings
are like some sort of mashup
of the Python type system
and the C++ type system.
So, you know, you can say,
OK, well, my first argument
is a tensor.
My second argument
is maybe a integer list
because I need to like
provide what the padding is.
The schema also knows
about aliasing.
So like what if I have a function,
does the input alias
with the output?
And it also knows
about like mutation,
like is my function
purely functional,
which is most functions
in PyTorch,
or does it, you know,
mutate one of its inputs?
And you have to tell it that too.
You don’t need this information
if you’re just writing Python code,
but you do need
this information to say
if you’re a compiler
and you’re trying to understand
whether or not it’s safe
to, you know,
do a code movement
optimization or not.
OK, so that’s cool.
So you write this entry
in native functions.yaml
and what this does
is it triggers off
a very long sequence
of code generation pipeline,
which actually goes ahead
and generates
the Python bindings
for your program,
the C++ bindings
for your code,
et cetera, et cetera.
And so all you need to do
after you define
one of these native functions.yaml entries
is you just need to provide
an actual CPU and CUDA kernel.
And so, you know,
in the YAML file,
it’s not, it’s not,
I’m not really here
to like tell you
exactly how to do this.
If you want to like
look at the actual code,
you should look at
some of the further reading links
after this podcast.
But what, you know,
what you’re going to do, right,
is you’re going to write
a CPU implementation.
You can say,
OK, my CPU implementation
is going to be say
softmax underscore CPU.
And one of the things
the code gen does
is it generates
a header stub.
And what this header stub says
is, hey,
here is the C++ function
I expect you to have written.
And once you write it,
then I will, you know,
do all the necessary plumbing
to make it possible
to, you know,
call into your kernel.
Now, the way you do this
is a little different
depending on if the kernel
is structured or not.
See my previous podcast
about structured kernels.
But the same,
the general concept
is the same.
It’s just we generate
slightly different stubs
in the two situations.
So there’s different code
you have to write.
You have to write more code
if you’re doing
the old-fashioned way
because you have to also
define the out variant
and the in-place variant
directly.
And the structured kernel version
takes care of all of that
for you.
But as a result,
you have to, like,
structure your code
a little differently.
But it’s very, very similar.
OK, so you’ve gotten
to this point
and you’ve got
all the scaffolding
that you need
to actually call
your operator.
How do you actually
implement your operator?
Well, as I said,
in PyTorch,
we expect you
whenever you add
a new operator
to give a CPU
and CUDA implementation.
So what does a CPU
implementation
typically look like?
Well, normally,
if you’re doing
some CPU code,
it depends on
how complicated it is.
But pretty common
situations are,
for example,
there’s some external library
that’s already written
efficient CPU kernels
and you just go ahead
and use those directly, right?
So in that case,
all you’re doing
in the kernel is,
you know,
you got some tensors,
you figure out,
you know,
what their data pointers are,
you make sure that,
you know,
all the invariants
that the library expects
are upheld,
like that the inputs
are contiguous.
Most libraries
don’t handle
discontiguous inputs.
It’s pretty uncommon.
And then,
you know,
you just call
another function in question
and maybe you have
to go and allocate
the output tensor
for it to write into.
But if you’re actually
writing a operation yourself,
well,
there’s a few facilities
for writing
very common styles
of operations.
In particular,
if you’re doing
a point-wise operation
or a reduction,
we have this really useful
class called
tensor iterator,
which takes care
of all the sort
of gnarly details
of,
you know,
like,
if I have a tensor
in a different layout,
how do I,
you know,
restride it
so that I iterate
over all the different strides
without,
you know,
processing memory
that’s not necessary,
blah,
blah,
blah,
blah,
blah,
do all of those things
and then all you have
to write
is a little lambda
that says
how to actually do
the point-wise operation
in question.
So,
you know,
and then all the other
infrastructure
taken care of for you.
This only works
if you’re doing
one of these very simple,
you know,
like point-wise cell operations
or there’s a few other cases
like,
you know,
tensor iterator
can also handle reductions
in some sense.
If you don’t have
a CPU kernel
that falls into this,
these categories,
then you might actually
have to,
you know,
oh,
goodness me,
write some efficient CPU code
that actually does
the thing you want.
Sometimes,
you know,
it’s simple,
it’s easy enough
to just write
a plain old for loop
in C++
because maybe
you don’t need it
to be that fast.
It’s just that
doing a for loop
in Python
is too slow.
And there’s also
like lots of other libraries
that try to build off
of this,
right?
Like Numba
and Cython.
All these ideas
are like,
oh,
yeah,
you know,
like Python’s really slow,
but like maybe you want
to write numeric loops
in Python
and then they compile
the C++.
Well,
it’s not too hard
to write these loops
in C++ as well.
And so in PyTorch,
people usually do that.
We provide a bunch
of facilities
for, you know,
doing common optimizations.
For example,
if your algorithm
is parallelizable,
you can use
the parallel for loop
construct that we provide
to, you know,
farm out your computation
onto different threads
so that, you know,
you can take advantage
of multi-threading.
and, you know,
if your kernel
is running slow,
well,
typically kernels
are pretty short.
So you can like easily
run it under perf
and then take a look
and see,
okay, you know,
am I missing cache a lot?
Am I spending a lot
of time on instructions?
You just do normal techniques
for optimizing performance
in this case.
Optimizing numeric code
is different, right?
Like people always
like to say,
oh, yeah,
you know,
matrix multiply,
how do you implement
that on CPU?
Well, you know,
you really need to know
about cache
and so it’s very different
from optimizing
other types of code.
But there’s also a sense
in which optimizing CPU code
is very easy
because,
sorry,
optimizing CPU kernels
is very easy
because there’s just
not that much code.
So you can actually
come up with a pretty good
mental model
of what you’re supposed
to do in this situation.
Okay, so that’s it for CPU.
What about a CUDA kernel?
Well, CUDA kernels
are pretty similar, right?
Like we need to do
all the same things
except instead of writing
CPU code,
there’s this CUDA
programming model
and we need to know things
about how the device
actually works
but then you’re still
writing CUDA
and many of the things
that, you know,
you like expect to see
in CPU,
you know,
the CUDA ecosystem
is well developed enough
so that alternatives
to these things
also exist in those situations.
So for example,
if you need to debug
your CUDA kernel
is crashing,
well, there’s a tool
CUDA memcheck
which will tell you
about, you know,
what is causing the crash.
You can also,
in a pinch,
use CUDA GDB
which actually lets you
step in problems.
you can also add asserts
to your kernels,
good old-fashioned
printf debugging
and, you know,
if all else fails,
well, you know,
once again,
your CUDA kernels
are usually pretty small
so you can like
maybe bisect your way
to figure out
what the error is.
Really,
the hard part
about writing a CUDA kernel
is actually understanding
the device model enough
so that you can actually
write concurring code
and so if you ever
like look at a presentation
about how to write
CUDA programming,
like what they’re
actually going to do
is they’re going to
spend a lot of time
talking about
how these processors
actually work,
you know,
what the like
actual physical details
of the hardware are
because this actually
really matters
if you want to
write efficient code.
Of course,
if you’re doing
something simple
like a pointwise op,
well,
it turns out
tensor iterator
also works in that situation
so you can just,
you know,
use our,
you know,
scaffolding in that case
but it’s actually
kind of challenging
to write a good CUDA kernel.
an example that I’m
thinking of recently
is we were working
on some linear algebra code
and the algorithm
that like,
so a lot of the times,
right,
there will be a well-known
CPU implementation
and we want to add it
to PyTorch
and we need to somehow
figure out how to GPU
accelerate it
and so this CPU
implementation in question
had a problem
which is that
it needed to do
a little bit of
computation at first
to figure out
how many iterations
of approximation
it was going to do.
Well,
basically we were doing
like these Taylor expansions
for the computation
in question
and we needed to like
look at the conditioning
of the matrix
to figure out like
how many Taylor expansions
we needed to do
and so I remember
reviewing this,
the CUDA implementation
for this PR
and us arguing about
well,
you know,
we can’t actually
on CUDA
make a decision
based on the data
what to compute on
without doing
a synchronization
because remember
CUDA is async
and so if we need
to like look
at the data in question
we have to wait
for whatever prior kernels
we’re running
to finish running
to give us the data
and then we need
to run our actually
operation
and then get it to CPU
so we like talked over
and like,
you know,
benchmarked a bunch
of different options
and it turned out
it was still
it was still better
to synchronize
so that we could
pick a good Taylor
approximation
for this case
but like there’s going
to be a lot of problems
like this
where like,
you know,
it’s not easy
to program a GPU
and so you’re going
to have to actually
understand like
there’s actually
like non-trivial
technical content
and like recasting
an algorithm
so it works on CUDA
but let’s say
you do that,
right?
So you’ve got
your CPU kernel
and then you’ve got
your CUDA kernel
and you’ve plugged
it all in via
the native functions
dot YAML system
well,
then you’re basically
done.
That’s it.
You’ve got some
more stuff to do,
right?
You’ve got to test
your operator
and we have a bunch
of facilities
for testing
in PyTorch
but they all involve
you know,
just like running
the kernel in question
and you know,
well,
you’ve already got
the bindings
provided for you
so it’s pretty easy
to get that hooked up.
We have a bunch
of stuff like
for example,
Mike Ruberry’s been
working on a new
op info abstraction
which lets you
describe some
properties about
an operator
and then we can
automatically run
tests based on
what kind of
properties the
operators on hold.
Unfortunately,
these kind of
mostly are for
like unary
and binary ops
so very simple
types that are
very regular
and there are
simple things
we can check
but you know,
there are also
some like very
generalizable checks
we do.
For example,
there’s a check
in our test suite
called GradCheck.
What does GradCheck
do?
Well,
remember that
when you’re
writing an operator
in PyTorch
you also have to
say how to
differentiate it
so we typically
have symbolic
derivatives
for all of our
operations
usually cast in
terms of other
functions that you
might have to
implement.
Well,
what GradCheck
will do is
GradCheck
will use your
analytic,
sorry,
not analytic,
not symbolic.
It’ll use your
analytic derivative
formula and it’ll
also numerically
compute what the
derivative is based
on your forward
implementation and
then it’ll just
compare the two
and figure out
whether or not
they agree or not
and if they don’t
agree,
GradCheck will fail
and this will work
for any
differential function
you have.
You don’t have to
write a separate
test for each of
them.
But yeah,
so you add some
tests and you
have to write your
docs for the new
operator and you’ve
got your kernels
and then that’s
great and usually
you submit the PR
and you give some
benchmarks.
It’s very easy to
benchmark kernels
once again because
they’re very regular
and you can just try
them a bunch of
different input sizes.
And then you’re
off to the races.
Really the hardest
part is convincing
PyTorch that we
actually do want to
take your operator.
But that’s a story
for another time.
That’s all I wanted
to say today.
Talk to you nextHow-new-operators-are-authored
EP12 History-and-constraints-of-Tensor
Hi, my name is Edward, and welcome to today’s edition of the PyTorch Dev Podcast.
Today I want to talk about a topic which was requested also multiple times by several people,
namely the history behind tensor, tensor impl, storage, storage impl, and basically like how is
the tensor data structure in PyTorch put together? This is a topic that I have written about in the
past. For example, on my blog, I have a blog post about, you know, basics about PyTorch internals,
and some of the things it talks about are how tensors are put together. So like there are these
things called strides, you know, we have a concept called storage. So if you want to know more about
these topics, go check out my blog post, then come up back to this podcast. Today, I want to talk a
little bit more about some of the historical and design constraints that have led us to where the
tensor data structure is today. So basically, given all these design constraints, if you, you know,
spend enough time, hopefully, you would end up in the same situation that PyTorch is. So I sort of,
there are a lot of things in tensor, right, because it’s a very traffic data structure. A lot of people
have added things over the years. And sometimes it can be a bit bewildering, like why the heck
are there like eight bit fields for like various, you know, variations of, you know, memory layout on the
tensor? Well, you know, hopefully, knowing a little bit about the background and the constraints will help
you understand, oh, yeah, I see why that’s there. It might not be ideal. But there is a constraint that causes
us to get there. So let’s get to it. So the first and foremost constraint that fed into PyTorch design of
tensor is the fact that PyTorch descends from th. I’ve said this before, I’ll say it again. Remember, PyTorch was
originally just Python bindings to the pre existing C libraries that shipped with LuaTorch, which in turn came from the
Torch 7 libraries. And why is this important? Well, we inherited a lot of the basic architecture for
tensors from these libraries. And in particular, the split between tensor and storage is the sort of most
prominent thing that, you know, PyTorch carries in its DNA today. I didn’t ever get a chance to talk to
original Lush or Torch 7 authors. So I don’t really know why they set things up this way. But when I sort of like
retroactively look at the past and like come up with my own explanations, one thing that I can say is that PyTorch’s
concept of a storage was very important for, you know, enabling something that’s very core to PyTorch’s DNA,
namely the ability to alias tensors together and do mutations on them. This is like very unusual.
Strides especially are very unusual. Many, many other systems, tensor flow being one prominent one,
only support operations on contiguous tensors. And sort of like what makes PyTorch a little spicy here
is that, you know, you can actually, you know, refer to multiple tensors on the same memory,
possibly with different layouts simply by adjusting the striding. So it’s something that’s very like
unique to PyTorch. And we got that from the libraries that we descended from. There’s other things
that we inherited from the TH days as well. For example, when tensor was just the C struct in TH,
they needed some way to do a reference count. So they just put the reference count on the tensor
itself. It turns out that intrusively ref counting in this way is very convenient. For example, when
you’re writing bindings, because if you have a raw pointer to a object, you don’t have to like do any
work with say enable shared from this to get out a owning pointer to it, right? You can just transmute the
raw pointer into an owning pointer and, you know, the owning pointer will just take care of incrementing and
decrementing the ref count. So when, you know, we brought PyTorch into, you know, the C++ land and
re-implemented the classes, we also preserved intrusive ref counting because all of our binding code was way
simpler when we had it that way. Also, we didn’t want pointers to tensors to be two words, which is, you know, what shared
pointer does in C++. The second constraint, which is useful to know about on tensor is the fact that it
actually is the result of merging the cafe2 and PyTorch libraries together. So if you’re a regular PyTorch
user, you might not, you know, think very much about cafe2, right? It’s this other library that, you know,
is graph mode only. But in fact, the same tensor representation in PyTorch is used verbatim with cafe2.
There’s actually two separate user facing classes. There is a tensor class that, you know, sorry, an AT
tensor class, which you use from PyTorch, and a cafe2 tensor class that you use from cafe2. And they
actually have different public APIs for backwards compatibility reasons. But the both of these are what
we call pimple classes, pointer to implementation classes. So they don’t actually, you know, represent the
data in the object. Instead, they just contain an owning pointer to the tensor in pull object, which is the actual
object that contains all the data in question. By the way, why is there the split between tensor and tensor
impl? Well, it’s because you know, we are a Python project. And a lot of people when writing code involving
tensors in C++, expect Python style reference semantics to work. So like, if I have a tensor y, and then I say
tensor x equals y, I expect x to, you know, point to the same tensor as y, I don’t expect a copy to happen
in this case. And you know, in C++ value semantics, you know, if you have a value type, like tensor impl is,
you did this copy construction that would actually copy all the metadata in question. And then it depends on the
semantics of the smart pointers inside what the other data does. So by splitting this into two types, and having
tensor be a actual pointer type, like in the same way shared pointer is, you just write tensor, and you, you know, can
assign things around. And it looks just like how things are in Python. So, you know, constraint three, I would say is
that, you know, we’re our Python project. So a lot of our C++ design comes out of trying to model off of Python.
There’s a great essay about this, by the way, which is on the wiki, basically, a manifesto about writing
Python in C++. We, as time has gone on, for efficiency reasons, we have had to walk back some
of the things we’ve done here. For example, you know, passing around tensors, as a pointer type is not so
great, because they force ref count bumps, right? In Python, this is not a big deal, because Python has a gil, so the
ref counts are non atomic, but atomics are kind of expensive. So you know, we’ve actually spent some time in the
recent past, trying to, you know, remove as many ref counts as possible. But generally speaking, if you can write Python
code, you can write PyTorch code, and the tensor class API’s are designed to make these look as similar as possible.
Okay, point four. So I’m done with the historical things. But point four is, we don’t really want there to be
virtual calls on tensor. And this actually has some pretty major implications. Now, I should preface this by saying,
if you actually go and look at the tensor class, tensor impl class, and look at all the methods on it,
actually, a ton of them are virtual. And there’s a reason for this. It’s a historical reason. But the reason
why we don’t want to virtualize most methods on tensor impl is because virtual methods thwart the
inliner. So you know, most operations on tensor, like, for example, querying the sizes should compile
into a direct, you know, memory access at the field that contains the sizes and questions, right, it should
be super fast, we should be able to get rid of all the function called goop. But you know, if it’s a
virtual method, well, some subclass could have overridden the behavior in this case. And so we
can’t inline in the situation, we have to actually do the V call jump. And the V call jumps are not that
expensive. But you know, we call size everywhere in pytorch. So it really does add up. Why is size
actually virtual then? Well, you know, this is a sort of like argument between like, you know, his history
and sort of design in the pytorch code voice, the history of pytorch is that size was virtual, because
when Zach originally wrote the class, it was virtual. And why was it virtual? Well, it was virtual because
we had this variable thing, see my previous podcast about the life and death of variable, we had variable
variables, a wrapper on a tensor. And they made this very reasonable at the time design decision, that they
didn’t want to duplicate the size information between variable and the tensor that it wrapped
because you know, if you duplicate the information, it can get out of sync. For example, if you resize
the underlying tensor without, you know, telling the variable about it. So if you don’t want to keep them
in sync, you need to change the behavior right on a tensor, you can just access the field directly. But on a
variable, you have to jump to the base class, and then actually query this size there. So okay, size is
virtual. Now we’ve gotten rid of variable, right, the variable tensor merge. And so this, this constraint no
longer applies. And now we have a design that we can actually just force everyone to like accurately record
what the size of their tensor is inside the class itself. But in the meantime, a bunch of people like went ahead and
overrode size for their own needs. And so we have to like unwind that situation, solve the problem, most notably xla,
cough, cough, cough. Okay, so but you know, in general, we want on methods on tensor to be virtual. And what that
means is that actually, when you look at the tensor input class, it basically has all of the fields that
you can conceivably want to describe, you know, what a tensor should be. So for example, we have sizes on tensors.
Yes, hypothetically, you know, strange extensions to tensors, like ragged tensors, or nested tensors might
not have size in the traditional sense. But you know, because size is such a, you know, intrinsic
operation that we use everywhere in PyTorch, we really do want you to like have some, you know, conventional
notion of size for anything you model in this way. And if you can’t model in this way, well, maybe, you know,
tensor input is not for you. Another consequence of this is exactly those bit fields about memory
layout, right, like we don’t want to actually have to compute the memory layout every time. So, you know,
given that we know what the sizes and strides of a tensor are, that actually tells us what the memory
layout is. And so we pre compute a lot of information in these bit fields so that you know, we can have fast
accesses that don’t involve doing some compute, they just like check what the bit is.
Okay, point five point five is extensibility. So you know, tensor is actually this is the same
as the previous point, right, which is that like the devirtualization constraint is in tension with
the extensibility constraint, right? By devirtualizing the tensor input class, it’s less extensible,
but operations on it are faster. By virtualizing it, you can override more behavior, but then the
tensor input class is less efficient. So we kind of need to play this game. And so like the the cut we
have, right, is that we want basic operations, like the basic data model for tensor to be virtual,
but then anything else on top, like especially operators, that can all be virtual. And in fact,
it is via the dispatcher. Okay, last constraint, size and memory, I have a really funny story,
which is when we were merging the cafe to and PyTorch libraries, I added a bunch of fields sort
of randomly because like, I was once again unioning the behavior of cafe to and PyTorch. And then I broke
some internal workflows. And what those internal workflows were doing where they were like allocating 4 million
tensors. And so every word I added to PyTorch actually ballooned their memory usage by several
gigabytes. So that was not very nice. And it like induced us to like spend a bunch of time trying to
optimize the actual memory size of the tensor impulse truck itself, because it’s it’s really overhead,
right? Like in PyTorch, you really want to just be, you know, storing memory for all of the, you know,
actual data that you’re doing your deep learning on. And you don’t want to waste time or space
on the metadata in tensor itself. And so we’ve done a bunch of optimizations, some very recently,
for example, done by Scott Walchalk. For example, we used to store sizes and strides as out of line
vectors on tensor, that’s really inefficient, because a standard vector in C++ takes three words in the
structure itself, right? It takes a size, it takes a pointer to the beginning, and it takes the pointer
to the end of the reserved data. So because you know, vectors can have a size that is less than the actual
data that’s allocated for it. So all that needs to be stored. And it’s not really necessary. And also you
don’t need to store the size for both sizes and strides, because the dimensionality of a tensor is fixed.
So you know, we actually pack these fields, and we also put the sizes and strides directly in the tensor
impl itself, assuming that most tensors are five dimensional or smaller. And that saves us having
to do dynamic allocations when we allocate tensors. Okay, so that’s it for, you know, why tensor is the
way it is. So the next time you go and look at the tensor impl class, hey, think about, you know, well,
we wanted this to look like Python. So that’s why there’s a pimple method. We wanted, you know, to support
all the stuff we could support from the good old torch days. So that’s why there’s storage and tensor,
we merged cafe two and pi torch together. So that’s why there’s a bunch of really random features in
tensor impl that don’t make that much sense. Well, that’s because some of them came from cafe do.
Another example of that is type meta, which you know, there’s two like ways of representing d types
in C++ scalar type, which is just an enum, and type meta, which is a pointer type that is open and
extensible. And that’s because cafe two supported registering custom types to tensors like std strings,
you could have a tensor full of std strings. Don’t ask me why you’d want it. Actually, it’s pretty useful
in some situations. And then fourth, there’s a bunch of, you know, constraints about like, you know,
efficiency, right? Like making sure that our methods can inline, making sure that the memory size of
tensor impl isn’t too big, but also at the same time supporting extensibility for, you know,
all of the weird and wacky other tensor types like sparse tensors and nested tensors and,
you know, funk torch tensors that people want to support.
Okay, that’s everything I wanted to say for today. Talk to you next time.History-and-constraints-of-Tensor
EP13 Conjugate-views
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about a new feature
that is going to be landing to master soon for complex numbers, namely conjugate views.
To explain what this feature is, I have to backtrack a bit and talk a little bit about
complex numbers first. So what are complex numbers? Complex numbers are a form of numbers where
instead of only having a single real number representing a quantity in question, you have
both a real quantity and an imaginary quantity. And the invariant, right, is that the imaginary
quantity, you know, when squared is negative. And no positive or negative real number when squared
gives you a negative number. So that’s what makes imaginary numbers different. This sounds kind of
strange. And, you know, for the longest time, neural networks don’t really use complex numbers. But in
lots of, say, signal processing applications, you know, complex numbers have a lot of interesting
properties that make it actually really good for, you know, doing certain types of computations.
So if you’re doing some like, for example, fast Fourier transforms, complex numbers arise very
naturally. And there’s also a line of research looking into how to use complex numbers for useful
things. Actually, when the complex numbers project started, it was a physicist, Roger Luo, who sort of
came and was like, hey, you know, I think this would actually be really useful. Took us a while to
actually listen to him. But you know, we got there in the end. So when you’re doing complex numbers,
there’s an operation that is really, really common, and it’s called conjugation. So what is
conjugation? Conjugation says, okay, if I have a complex number a plus bi, where i is the, you know,
constant that when squared gives you negative one, conjugation is taking this number and giving you
back a minus bi. Now, the reasons why conjugation and complex numbers are very common is sort of beyond
the, you know, scope of this podcast. But one way to think about it is, if you like, think about like,
your linear algebra class that you took in undergrad, okay, if maybe if you took the theory based one,
because I don’t know, they really go into complex numbers on the more practical linear algebra classes.
One of the things you do is you, you know, talk about fields on real numbers. And you know,
you do a bunch of stuff on them, and you learn some properties about linear algebra. And then you’re
like, okay, now you can generalize to complex numbers. And you know, you have to like change all
your definitions to make things work. And one of the things that happens is, you know, everywhere you
were doing transposes in your, you know, old theorems, suddenly you’re doing Hermitians,
you’re doing adjoins, you’re basically taking both the transpose and the conjugation of the matrix in
question, whereas, you know, in the real universe, you were just transposing. And you know, you just you
just need to do this to make all the theorems work out. And you know, if you’re really, really curious why
this is the case, I recommend, you know, like, taking a theoretical linear algebra class and just sort of
spending some time stewing with the theorems. Okay, so conjugation is a really common operation.
And, you know, it’s really simple, right? Like you just are doing a negation on one part of the complex
number. And so typically, right, you’re doing the conjugation because you’re about to do some other
operations. So if you are doing matrix multiply, you know, a common thing to do in standard linear algebra
is, you know, matrix multiply a with b transpose. Well, you know, in the complex universe, that’s going to
look like something like a matrix multiply with b transpose and conjugated. And here’s something very
curious happens. So if you think of conjugation as just a operation where, you know, if to conjugate a
tensor, you know, you take your tensor, and you produce a new tensor with, you know, everywhere it was a plus
bi is a minus bi, then this matrix multiplication operation is actually a bit less efficient than its
sold version. When you did a matmole b transpose, we didn’t actually ever do the transpose. Because
remember, pytorch, supports strides on tensors. So if we want to take a, if we want to take an operation
like transpose, and do it without actually doing the computation in question, it’s actually an 01 operation,
you just take your tensor, and you swap the strides. So instead of saying, okay, when you move in the y
position, like say that you’re indexing x, y, only move one, instead, changing the y position means moving
an entire, you know, row, and moving x position is what, you know, you only move one on. And by like
simply switching the strides, so that instead of, you know, moving one, you move a lot in the y case,
you are representing a transposed tensor. And actually, so if you’ve got a back end implementation
of matrix multiply, that knows how to implicitly do transposes, for example, BLAS is, you know,
matrix multiply has a flag that, you know, lets you specify if the, you know, argument is transposed
or not, then you actually can just avoid having done the transposition at all, because you just,
you know, say, okay, well, I want to do a transposed matrix multiply, where the right argument is
transposed. And you can just call the kernel directly, and you’re all good. And we never actually do the
transposition. And transposition is kind of expensive, you got to allocate memory for it, blah, blah, blah,
blah. So you don’t really want to do that. Okay, but conjugation, right? Well, conjugation is weird,
because, you know, conjugation actually involves negating, you know, half of the numbers in your
tensor. And so strides don’t really work for this. And so you’re in this weird situation where, oh,
well, sex to be me, I have to conjugate the tensor, and actually, you know, create a new tensor. And then
I can, I guess I can do the transposed tricks, and then call my, you know, a complex BLAS matrix
multiply implementation. But this is a waste, because actually, BLAS provides a fused matrix
multiply with a transposed and conjugate on the second argument. And so like, yeah, that’s faster,
because, you know, it’s just faster to have the fused operation. It’s why people like using the JIT
fuser, right? Like, you’re often memory bound in these situations. And, you know, being able to do this
fusion is very profitable. So what’s a poor person to do? So we hemmed and hawed a bit. And, you know,
we talked to some of the experts on, you know, basically doing complex numbers with neural networks,
namely Bodecker. And, you know, we talked about a few options, right? Like, one option was, okay,
well, we’re just gonna, you know, provide a new matrix multiply that, you know, explicitly takes a
little keyword argument that says, okay, do you want to conjugate the output? That looks really
ugly, right? Like, if you’re just writing some math down in PyTorch, you want to just say x, you know,
at sign y.h. And you want that to work, right? You want that you want to be able to write code that
looks like math. Like, yes, in principle, we can, you know, write lots and lots of fused operations
and tell people to, you know, look up, you know, some fused operation for whatever operation they want to
do. But they don’t want to do that, right? They just want to write math. And then hopefully,
you know, some compiler or something, some smarts in your program are good enough to actually, you
know, run that efficiently in that situation. So we really want to be able to write, like this
operation, and actually have it be fused in the situation. And so the next thing you tend to think
about in the situation is, okay, maybe we can do some sort of lazy tensor, right? So I’ve talked about
lazy tensors a little bit in the past, in this podcast. But once again, what’s a lazy tensor? A lazy
tensor is like, you don’t do the operation immediately, right? You just wait and see if
you run some other operation. And then if it’s profitable to like fuse in that situation, well,
good for you, you were lazy, you didn’t do the original operation. And so now you can do the fused
operation. But lazy tensors are a little difficult to implement. And one of the things that makes them
difficult to implement is that laziness means that operations which are ordinary, ordinarily reads can
turn into writes. What do I mean by that? Well, lazy evaluation, you know, as popularized by say
Haskell, the functional programming language, means that you guarantee that you only do the operation
once. So say you have a tensor, and you request it, you lazily conjugate it, and then you request the
value of the conjugation, and no fusion is possible. Under a lazy scheme, you’re obligated to actually
at this at the point in time you do the read to actually materialize the conjugate tensor, and then
go ahead and do the stuff you want to do. And this makes things a little complicated if you want to, you
know, be in a multi threaded environment, because okay, well, you’re doing a write on a read. So that
means that you know, you actually have to start synchronizing your reads. And that’s actually kind
complicated, blah, blah, blah, blah, blah. Okay, so. And also, it’s kind of different from this
transposition, right? Transposition was really elegant, you just allocated new tensor with different strides.
And then it just implicitly fused. Once you call the function question, namely, you weren’t doing lazy
evaluation, you were doing call by name evaluation, where you were willing to, you know, do the transposition
at every use site of the transposed tensor, if necessary. But like in practice, you know, most
things get to be fused in this situation. That’s not, that’s not entirely true. Like a lot of operators
in PyTorch don’t support non contiguous outputs. A transposed tensor doesn’t count as contiguous output.
So they’re non contiguous on it, right? They’ll transpose it on the spot when they need it. But this is a good
trade off for us, because most of the time, you know, a fusion is actually possible in the situation,
or, you know, it just doesn’t really matter. You know, because, you know, you’re only using the
transposed tensor once. So whatever, like, you know, delaying it for later, with possibility of
duplication is fine. So we want something that works kind of like transpose, but for conjugation. And so
conjugate views are a way to make this work. Okay, so how does it work? Well, you’ve got your tensor,
tensor, and you want to make a new tensor, 01. So you want to share storage, you can’t copy storage,
because then it wouldn’t be constant time anymore. And you want it to somehow represent having done the
conjugation. So I’m going to cheat. And I’m just say, okay, we’re going to define another bit field
on tensor that says whether or not you should interpret the storage as needing a conjugation or
not. So if you have a normal tensor, where in memory, you have three and then four i,
and the tensor doesn’t have the conjugate bit, then this entry represents three plus four i. But if you
do have the conjugate bit set, even though the physical memory says it’s three plus four i,
you actually interpret it as three minus four i. So okay, so we’ve got our 01 tensor allocation,
right, you just allocate a new tensor share storage, so the conjugate bit to be one. Now what? Well, you’re
done. That’s it. Okay, it’s not as easy as that. So if assuming that every operator knows how to respect
the conjugate bit, right, like, basically, like, if you look at the tensor, you need to look at the conjugate
bit, it’s it says, Oh, if you need to, you know, interpret the code differently, assuming that you have all the
operators working this way, then you have, you know, a 01 Hermitian operation, right? You just allocate a new
tensor, you swap the tensor, you swap the strides, and you set the conjugate bit. Easy peasy. And as long
as all the kernels know how to deal with this conjugate bit, everything’s great. Well, making
everything actually understand the conjugate bit is kind of difficult, right? Because we have a lot of
operators, you know, 1700 plus, and you know, we don’t really want to be editing all of our operators to
like, you know, pass in, okay, if the input is, you know, conjugated, then please, you know,
unconjugated, like actually materialize the conjugation, and do the operation in question,
blah, blah, blah, blah, blah. Okay, so that’s kind of difficult. So what do we do? Well, we have this
nifty feature called a back and fallback. And what a back and fallback does is it lets us say, okay,
whenever you see a tensor that has the conjugate bit set, run this special piece of code unless you’ve told
me otherwise. So it’s a fallback, because, you know, you can override the behavior of this. But if
there’s no override, if there’s no actual implementation, we call the fallback in this
implementation. And we can use the fallback to implement the okay, well, you know, I’ve got a
kernel, it doesn’t understand how to respect the conjugate bit. So I just have to get rid of all
the conjugate bits before I call the kernel in question. And the conjugate fallback will make
sure we apply this universally to all functions, even custom registered functions. So like, what does this do,
right? Like, so if I’ve got a functional operation, and I want to run a operation that doesn’t
understand conjugation on it? Well, let’s see. So you know, I’ve got some arguments, some of them have
the conjugate bit set, I need to get rid of the conjugate bit. So I just go ahead and conjugate
them, producing new inputs, that you know, whose physical memory represents the conjugation. So there’s
no extra interpretation that needs to be done. And then I just go ahead and call the original kernel.
Very easy. The logic is a little more complicated in the in place case, because you know, you can’t just,
you know, change the conjugate bit on the tensor, there’s other tensors that may be aliasing with
that storage. You know, the the the conjugation status of it is related to the storage, not the
tensor. So you can’t just conjugate a tensor in place by flipping the conjugate bit on the tensor,
you need to do something to the storage, namely actually conjugate the storage, but you can you can
make it all work out. And it’s a pretty fun exercise to see how to do it. And then what do you have? Well,
you’ve got conjugate views, right? You’ve got these views of tensors, you know, views in the sense that if
you mutate the view, or you mutate the base tensor, the views, all other views to the tensor get updated.
So there’s got a view, but it’s not a view in the traditional sense, it’s not a view in just striding
or just, you know, swizzling around the data, it’s actually a view in terms of some transformation on
the data. And this is okay, in this case, because there’s an inverse to the conjugate operation. In
fact, the conjugate is a self inverse, right, a plus bi to a minus bi to a plus bi. So because it’s a
self inverse, it’s really easy to go through through these things. It’s really easy to set up,
you know, the bi directional lens. If you’re familiar with the functional programming literature,
the bi directional lens that says, you know, when you make an update to some view, how to propagate
the update back to the original thing, inverses just make this easy. And then we’ve got something that
like is a view and you know, share storage. It has aliasing semantics, which is one of the reasons why
conjugate views are backwards compatibility breaking. So they’re kind of an experiment, right? Like, maybe
people are actually mutating their tensors after conjugating them and expecting the conjugates
to stay the same. I don’t know. So that’s one of the things we need to work out by putting this in
master. But like, you know, if this all works out, you know, we have an actually interesting new tool
that we can use in other situations that, you know, allow us to do fusion without having to worry about
the, you know, concurrency problems that lazy evaluation give us. So conjugate views, they’re not in master
yet, I think, but Anjali Chordia has been working hard on actually landing it. She’s done most of the work
on actually, you know, pushing this to the finish line. And yeah, I hope it is a cool feature and one
that will pay off for us in the future. That’s all I have to say today. Talk to you next time.Conjugate-views
EP14 Automatic-mixed-precision
Hello, everyone, and welcome to today’s edition of the PyTorch Dev Podcast.
Today, I want to talk about how automatic mixed precision is implemented in PyTorch
on the request of one of our listeners. Thank you very much.
So what is automatic mixed precision?
AMP, or automatic mixed precision, or internally referred to as autocasting,
is a feature by which when you write your models in PyTorch,
we will automatically downcast some of your parameters to lower precisions
so that your models can run faster. So what do I mean by that?
So imagine that you’ve got a bunch of parameters, right?
Your parameters are probably floating point numbers,
which is the normal thing to do in this situation.
And you want to like, you know, do a matrix multiply with the parameters and your input.
Ordinarily, you would just do, say, a float, float, matrix multiply,
and you know, that would go however so much fast. But you know, NVIDIA being the tricky people they
are, they actually have a faster implementation of matrix multiply that happens if you give it a
half precision input, and a floating point precision input, half being, you know, a representation of
floating point numbers that uses only half the number of bits. And you know, because there’s less
bits, there’s less compute to do. And so if you actually have silicon for it, which NVIDIA GPUs do,
it can run faster. So if you pass it in, in this half precision way, your stuff magically gets
faster. So that’s mixed precision operations. But the automatic part of automatic mixed precision,
you actually don’t have to do anything to your models to get the benefits. Automatic mixed
precision’s API is this context manager, you say, okay, turn on AMP, and then magically your modules
use mixed precision when it’s appropriate. What exactly does AMP do? Well, the heuristic that’s
applied here is actually pretty simple. Basically, AMP says, okay, when it comes to operations involving
parameters, this is the situation where the extra resolution on the parameters tends to not be so
useful, right? Like we use floating point 32 bit floating point numbers to represent parameters,
because we need to be able to do updates on them. But as far as the computation for the neural network
is concerned, most of that precision is not actually used. And so it turns out and you know,
this is not obvious, you had to run experiments and show okay, actually, this is profitable. It turns out
that you can just cast your floating point parameters into half precision, run your network this way, and
it will use less memory, it’ll run faster, and it will train just about as well. So Michael Kareli and co at
Nvidia actually did an implementation of AMP as part of their apex toolkit, you know, advanced PyTorch
extensions. And at some point, you know, mkareli was like talking to me at the PyTorch devcon. And he was like,
hey, you know, I want to put this in core, like, how can we do it? And at the time, we had been working
on this new dispatcher thing. Yes, I talked about the dispatcher a lot, because my team composability
works on dispatcher features, like that’s kind of what one of the big things we do. So I was like,
oh, you know, there’s this interesting new thing called the dispatcher. And I think it gives you enough
rope to actually implement automatic mix precision. And you know, we went back and forth a bunch of times
with a different few different proposals. But in the end, we have this implementation, implementation of AMP.
It works transparently, it has the same API that apex had, namely, context manager, you don’t have to know
anything about it when you’re writing operators, it’s a complete extension on top of operator writing. So like,
if you’re just a plain old fashioned operator, then some normal behavior will happen in that situation. Like, you know,
you don’t have to worry about it. And that’s, that’s important too, right? Because not all algorithms
have faster mixed precision implementations, like matrix multiply and convolutions, those actually have
tensor core algorithms, and they can go faster and have precision, but a lot of things don’t. And so,
you know, there’s no need to deal with them in that case. And then, furthermore, it’s extensible in
the sense that if you have external libraries, like say torch revision, which doesn’t live in PyTorch,
they can also be extended to use AMP. And it’s all extensible, right? Like sort of AMP is this like
capability layered on top of PyTorch. Operators are extra pieces of functionality that are layered on top
of PyTorch as well. And the dispatcher lets us, you know, put the square together, we don’t have the
expression problem, we can actually do the extension in both ways, and then fill in the last corner of the
square. Okay, so how does it work? Well, let’s remember what AMP wants to do, right? So what AMP
wants to do is, when you turn on this mode, when you turn on this context manager, we need to change
the behavior of all our operations that know about, you know, AMP, and this will be a fixed set of
operations that, you know, heuristically, we know are useful to do AMP things on. And we need to change
the behavior to instead of taking parameters directly. We say, Okay, well, I don’t want to
take this parameter directly, I want to cast it to a half precision, and then run the operation on it.
So algorithmically, that’s what we want to do. Like, sometimes, when I get an operator, I want to cast
things, and then you know, use the cast. And furthermore, like, you know, if this parameter
is being used a bunch, I want to cache the cast in this situation. So I’m not repeatedly converting it
unnecessarily. So how do we go about doing this? So step one is how to actually intercept operations
when you want to, you know, when a context manager is being set. And this is actually like the textbook
use case of what we call mode dispatch keys. So what is a mode dispatch key, a mode dispatch key
is a dispatch key that typically isn’t put directly on a tensor itself. But instead is something that
gets put into our thread local state that you know, basically, in the dispatcher, we have a thread local
state that lets us include dispatch keys and lets us exclude dispatch keys globally, no matter what the
tensor inputs are. So to, you know, enable this context manager, the AMP context manager, when you turn it
on, says, Okay, put the autocast key into the local TLS that says, Okay, whenever I do operations, I want to
include the TLS. And then if you know, the autocast key is not in local TLS, well, then I just bypass all these
kernels. The second recipe here that we need to know about is what are we going to do about all the
operators that you know, don’t know anything about AMP. In this situation, we want to just sort of fall
through to the default behavior, we just want to run the normal operation in the situation. So there’s
another tool in our toolkit in dispatcher. And this is called a fall through kernel. So fall through kernels
are kernels that we put in the dispatcher that say, Hey, don’t do anything here. And said just fall
through to the next valid implementation for the dispatch key in question. And you know,
why is there a next valid implementation? Well, all the dispatch keys are ordered in a sense,
right? So there’s a priority, you do autograd first, then you do the CPU key. And in this ordering,
autocast needs to live somewhere. And so, you know, when autocast, you know, when we when we have a kernel,
and then we hit autocast, because, hey, you know, autocast mode is on, if that kernel doesn’t do
anything special for autocast, fall through just says, okay, go to the next key in that case. And
most typical autocast kernels are going to go ahead and do some operations, and then also do a
redispatch, they’re going to say, okay, forget about doing any more autocast stuff, I’m done with
autocast, go ahead and do whatever the next thing you’re going to do was. Cool. And actually fall through is
implemented very efficiently, because the way we determine what, what dispatch key to, you know,
call into in the dispatcher, is we actually look at a bit set of all the dispatch keys, and we just do a
find the first set bit. So when you have a fall through installed for a kernel, we actually just
don’t set the bit inside this bit field. And you don’t actually have to, you know, go ahead and do the
dispatch and then realize, oh, there’s nothing to do here falls with the next one, it’s completely free.
So you can always add these fall through keys without paying any cost. Okay, so we’ve got a
way to intercept all operations when a mode is set using the TLS key. And we have a way of making
sure operators don’t actually call the AMP kernel if we don’t know anything about them. Namely,
we have a fall through key, and we register at this fallback, right? So any, anything that, you know,
doesn’t explicitly have an autocast key just does the fallback. What about the actual implementations of
operators that do have fallback keys? Well, it’s not too hard, right? So intuitively, you know,
we’ve gotten all our inputs, and we need to decide, you know, whether or not we’re going to cast some
of them to have precision. And then eventually, we need to call into the actual operation that is
underlying the autocast implementation. So what are the steps to this? Well, you know,
the naively, the meat of the algorithm, right, is like looking at an input and deciding if you’re going
to do it to have precision. Unfortunately, there’s no like, cut and dry rule for how to actually decide
if half precision is going to be useful or not. We have a few rules of thumb in the dispatcher tutorial,
like, you know, matrix multiplies and convolutions are likely to be profitable with half precision. If you’re
doing reductions, you probably want them at a higher precision because, you know, catastrophic
cancellation is more of a problem. But you know, really, really, it’s, you know, testing things
out and seeing what works well on actual models that you want to run things on.
Okay, so let’s say that you decided that okay, this parameter should get casted to half precision,
if it is a parameter. So we have a helper function that attempts the cache cast. And what it does is it
says, okay, you know, is this a parameter? Namely, you know, is it a least lab variable, make sure it’s
not a view. We actually forgot to put the view check in. And this really resulted in some hilarious bugs,
where people were taking views of parameters and loops, and we were continually adding things to the
cache. Parameters are good, because there’s a fixed number of them, you don’t have to worry about there
being too many of them. And they stay live for the entirety of the computation. So there, it’s usually
safe to cache them because the lifetimes line up. Okay, so you look and see if it’s a leaf, if it’s not a
view, and then all you need to do is go ahead and cache, cast it and, you know, put it in a cache. And the cache
is just a good old fashioned hash map. And it gets cleared at the end of every training loop, namely when
you, you know, exit auto casting. And that’s pretty convenient, right? Because at the end of the training
loop is when your parameters are likely to update, and therefore when all of the cast entries are likely
to be invalid. Okay, so how’s that actually implemented in PyTorch? Well, there’s a bunch
of operators that, you know, do need auto casting support. And actually, you know, the co-union write
in this case is very regular. And so at the time that mcarelli was working on auto casting, we still
had a lot of bugs in our boxed fallback, the mechanism they talked about in the previous podcast,
which we use to implement conjugate views. So that didn’t sort of work out. And it was okay,
because there’s only a fixed number of operators that he really needed. So instead, he just wrote
a little template thing, right? So he has this template meta program that takes in the name of
the operator, takes them what the type signature of the operator is, and then, you know, constructs
a new wrapper function that, you know, does the operations based on some policy, right? Because
some functions, we want to cast a half precision, some functions want to stay as float,
some functions, you know, if there’s a explicit D type, we want to use it. This is just a template
that picks apart the arguments, you know, looks through them, checks for parameters,
cast them to half precision, then sets a dispatch key guard that says, okay, don’t ever go to auto
cast anymore, and then redispatches. By the way, on the redispatch, typically the redispatch is going
to autograd. And the reason we want redispatch to go to autograd is because autograd is going to save
some inputs for backwards. And we would much appreciate it if it saved the half precision
inputs, because that’s half the memory you’re spending saving things for backward.
Okay, so you know, we’ve got our dispatcher, which lets us, you know, set up this autocast key,
that’s a mode that only gets, you know, turned on when we need them. We talked about what to do about
operators that don’t need autocasting. And we talked about what ought to do about operators that need
autocasting. And actually, that’s it. Like, autocast is a really, really short implementation. There is
not very much at all to it at all. And, you know, it’s a single file in our code base called autocast.cpp.
You can read through it, it’s got all the interesting details. Really, the hardest thing is just, you know,
figuring out what the policy you should apply on the operations should be. And shortly after we added
Autocast to PyTorch Core, you know, Francesco Massa, for example, gave support for AMP and TorchVision.
So it’s actually fairly well supported even throughout the library ecosystem.
AMP was so influential that actually Intel is working on a CPU version of AMP, not for half precision,
because there isn’t really good silicon for doing half precision on CPUs. But bfloat 16 does pretty
good on CPUs, especially when you’re vectorizing. So they want a version of automatic mix precision
that does bfloat 16 on CPUs. And they’re just, you know, modifying the existing CUDA autocasting code
to work in this case. So that’s how autocasting works. Take your parameters, cast them to half precision,
cache that cast, and then, you know, use it throughout. And once again, the way that it is integrated into
PyTorch in an orthogonal way is by using the dispatcher, which lets us, you know, layer on extra pieces of
functionality that you don’t have to care about unless, you know, you actually do want to care
about it. And then you can write implementations for it. That’s all I wanted to say for today. Talk to you next time.Automatic-mixed-precision
EP15 Shared-memory
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about a kind of niche topic,
which actually you’re probably using,
even though you don’t know about it,
namely shared memory and PyTorch.
What is shared memory?
Well, let’s think about what happens on your computer
when you want to run multiple processes.
Each process ordinarily has a separate memory address space
that is isolated from every other process on your system.
And, you know, if you remember how your operating systems class explained it,
there’s, you know, a very fancy virtual memory system
that your operating system implements along with your processor
to actually make this possible.
So having your processes have separate memory is a really good idea
because, you know, you really don’t want one process
stomping over the memory of another process accidentally.
For example, if you have a buggy, you know, Firefox instance,
you don’t want that to, you know, go into your bank account application.
That being said, sometimes it is useful to share some memory
between two processes.
And your operating system also has a facility for that,
and it’s called shared memory.
Normally, shared memory gets used when you do shared libraries.
So what’s the idea behind a shared library?
Well, the idea behind a shared library is that you have a bunch of libraries
on your system that might be used by multiple processes.
And it’s a waste to actually, you know, have separate copies of exactly the same binary
in each of the processes that you want.
So, you know, a shared library is designed in a way that, one, it can be put anywhere
in your address space, aka it is so-called relocatable,
or it has been compiled with position-independent code, FPick, as it’s called.
And then, you know, using the virtual address table, your operating system only needs to hold
one copy of the shared memory in physical memory, and then we’ll just, you know, map it
to the various virtual address tables of all the processes that are actually using the shared
library.
So that’s a really common use case of shared libraries in Unix-like systems.
How about in PyTorch?
Well, in PyTorch, shared memory can come in handy when you have a tensor,
and you want to share the contents between multiple processes.
Now, this is actually, you know, a little bit tricky to do, right?
Because if you’re wanting to write into the tensor, normally, if you have multiple concurrent,
you know, processes or threads working on writing something, you have to do some sort
of synchronization.
But sort of, you know, one of the glorious things about machine learning is it doesn’t
really matter if you synchronize or not.
So-called hog-wild training methods actually work pretty well, and they just work by sort
of, you know, YOLOing the updates without any synchronization, and things sort of just
work out in the end by the magic of gradient descent.
So PyTorch has support for shared memory so that you can take a contents of a tensor and
share it between multiple processes on a single machine.
And this is most useful usually because, you know, Python is silly, it’s got the global
interpreter lock.
So if you actually want to do, you know, parallel processing on a single node, you usually need
to have multiple processes to like actually max out your CPU, because otherwise, you’re only
going to be writing Python code on one core.
Okay, so what does this look like in PyTorch itself?
Well, there’s a few things that, you know, you have to know about shared memory that like lead
to a bunch of things that PyTorch does to sort of make this a seamless experience.
So one thing is that shared memory on your operating system is not reference counted.
In fact, once you create some shared memory, it will stay there indefinitely until someone
explicitly decides they’re going to get rid of it.
And this kind of makes sense because, you know, shared memory is often represented as a file
in a special dev shim mount point on your operating system, like slash dev slash shim.
And, you know, of course, files, files don’t go away unless you actually RM them.
And so this leads to a problem, which is that, you know, let’s say that, you know, I allocate
some shared memory.
Well, I need to get rid of it when I’m done with it.
Otherwise, it’s going to hang around until the end of my, you know, operating systems,
until it reboots or something like that.
So you could imagine setting up your process so that, you know, if the process, you know,
is shutting down, then it can deallocate all the shared memory.
But this works out poorly if your process, for example, crashes for whatever reason, and none
of the destructors run in that case.
So actually PyTorch solves this problem by providing a sort of watchdog process.
This is the shim manager, the shared memory manager.
And what the shared memory manager does is, you know, when we start using shared memory
inside PyTorch, we spawn off a daemonized version of this watchdog process, whose only job in
life is to watch the relevant processes that, you know, are associated with this PyTorch instance.
And when all of them are dead, clear all the shared memory in question that it has been told
about.
So in this particular case, the shared memory watchdog process is much smaller.
It’s not running custom user code.
It’s just getting signals from the processes when shared memory is being allocated, and
when it’s being deallocated.
So it’s much less likely to accidentally crash due to a bug.
And, you know, it’s a way we can make sure shared memory actually, you know, gets preserved
in this way.
Okay, what are some other things that we need to do to make shared memory work out?
Well, another thing we need to do is we need to actually, you know, back our tensors with
the shared memory in question.
So how does that work?
Well, you know, we have a representation for tensor and, you know, inside the tensor is
a data pointer that points to some data.
And we represent this internally via a data pointer class, which sort of says, hey, here’s
where the data is.
And also here’s where to deallocate.
Here’s how to deallocate it.
And so the fact that the deallocator for memory stored by tensors is actually, you know,
user programmable means that you can actually override, you know, where things come from.
So if you’re just doing a normal tensor allocation, you just say, okay, I want the stock CPU allocator.
And that gives me a data pointer that says, okay, to free this memory, just free it in the
normal way.
But if you’re doing shared memory, and you want to like pass it around with another process,
then you can use a different allocator, which says, okay, please allocate this shared memory
for me.
And when it’s done, deallocate it by, you know, both deallocating the shared memory in whatever
special way it needs to be.
And also sending a message to the shared memory manager to say, okay, well, I’m done with this
shared memory, you don’t have to worry about it anymore.
And so in fact, the way we implement shared memory in PyTorch is there’s actually a few allocators.
So there’s a th map allocator, which says, okay, I’m just going to give you some shared
memory, and then I’m going to get rid of it, you know, unmap it the normal way when you’re
done with it.
There’s also a ref counted shared memory allocator, which says, okay, well, you can give me this
shared memory, and I’ll actually keep track of it via a ref count that is distributed over
all PyTorch processes.
So, you know, if I have multiple PyTorch processes that are referring to this shared
memory, I won’t deallocate it until the distributed ref count goes to zero.
And so once again, you know, what does the deallocator in this case do?
Well, it just says, okay, well, when you’re done, you know, decrement the distributed ref
count, and then also check if the distributed ref count has gone to zero.
If so, free the shared memory.
By the way, how the ref counts are stored, also shared memory.
And you know, it’s just the easiest way to implement this sort of thing.
And of course, the the managed shared memory allocator is the one that knows about the shared
memory manager.
And that one does the stock behavior, but also talks to the shared memory manager to get things
done.
Okay, so that’s it about shared memory on CPU.
But it turns out that we also support shared memory on CUDA.
And the way we do that is sort of very similar CUDA API provides a way of taking some arbitrary
CUDA memory, and then saying, okay, create a opaque handle, some byte string that when passed
to another process can be used to get another CUDA handle to the memory in question.
And so this way, you can also share CUDA memory across multiple processes.
How convenient.
However, CUDA shared memory works a little differently than CPU shared memory.
Unlike CPU shared memory, where you know, if once you allocate it, it just stays live until
you explicitly delete it.
CUDA memory only stays live as long as the host process actually keeps the CUDA memory alive.
And so for the longest time in PyTorch, we had this restriction that, you know, when you
have some CUDA shared memory, you must make sure that the CUDA shared memory stays live in
the originating process long enough for all the consumer processes to be done using it.
Otherwise, very strange things will happen.
And, you know, these strange things include, you know, like it being overridden with total
garbage, because remember, we have a caching allocator.
So we don’t actually CUDA malloc and CUDA free every time you allocate CUDA memory.
We, you know, allocate a big chunk of CUDA memory, and then maybe sometime in the future,
you know, we reuse it for something else.
So if someone else is still referring to, you know, some CUDA IBC memory that, you know,
we decided was unused in the host side, and then reused to something else, they’ll see it
actually get overridden with some random data from the next allocation.
So that was a kind of foot gun.
And when Vitaly Fedunin joined the PyTorch project, one of the first things that he implemented was
distributed ref counting for CUDA IPC tensors as well.
And it works kind of similar to how CPU ref counting works, right?
So there is a, you know, shared memory file, hey, you know, shared memory once again, that
maintains the distributed ref count.
And then there’s just a sort of polling mechanism on the host side, which just looks and sees,
has the, you know, ref count gone to zero?
Has the ref count gone to zero?
Oh, the ref count’s gone to zero.
Now I can release the tensor.
There were a bunch of different possibilities we had for how to go about doing this.
But polling was the sort of simplest implement.
Okay, so shared memory is a way to share memory between multiple processes in your system.
It’s not so useful if you’re doing multi-node training, but because Python has a GIL, it’s pretty useful if you’re using a single node and you just need multiple processes to parallelize.
You probably are using shared memory if you’re using Torch multi-processing.
And there’s just a few things that PyTorch does to make this work out nicely.
But, you know, mostly we’re just relying on, you know, mmap support for shared memory files.
So that’s all I wanted to say today.
Talk to you next time.Shared-memory
EP16 Stacked-diffs-and-ghstack
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about something a
little different, namely instead of talking about PyTorch itself, I want to talk about one of the
tools that we use to help develop PyTorch, namely GHStack. GHStack solves a problem that goes
something like this. Imagine that you’re working on some code in your project and you know you go
hack, hack, hack, type, type, type, and you’ve got a working implementation and you send it up for
review as a pull request on GitHub. And while you’re waiting for people to come and actually
review your beautiful code, you’re like, okay, well, I’d kind of like to start working on the
next feature, which is going to build on top of this patch that I did before. Maybe this patch was
some refactoring or, you know, a little bit of infrastructure that you needed for the next thing
you were going to work on. So, okay, back to your local development copy, you hack, hack, hack, work
on your next, you know, piece of the, piece of the puzzle, and maybe you’re now done with that piece
as well. Now what do you do? Well, imagine that, you know, your first PR still hasn’t been reviewed or
maybe it has been reviewed, but it still hasn’t landed to PyTorch because lands on PyTorch take a really
long time. Don’t ask. It just, it takes a really long time. So, you know, you’ve got the second poll
request and, um, sorry, you’ve got the second patch on top of your first patch. It’s logically
independent. So like, you know, it can be reviewed in isolation from the first patch, but you’d kind
of like to put it out there and let people take a look at it. How can you do this? Well, GitHub doesn’t
really make it easy for you to handle a situation like this because, you know, poll requests are,
hey, here’s a branch and then here’s a master I’m going to compare against it. And then that’s what
the diff is going to be. So there’s no easy, like built-in workflow for submitting extra patches on
top of each other for review without, you know, necessarily forcing someone to review all the code
from your previous patch as well. So GH stack implements what we call stack diff development,
and it solves this problem, namely by allowing you to create pull requests that, um, are stacked on each
other so that, you know, you can submit your first pull request, then you can submit your next pull
request, which depends on the first pull request. And if someone wants to, you know, uh, review, they can
review each of these pull requests separately, but the second pull request still has all the changes
from the first pull request. So you can build on top of your work. Stack diff development is not
really an invention of the PyTorch project. Um, you know, a lot of people have used it before in
particular, if you use the code review tool called fabricator, um, developed by Facebook originally,
um, that also implements the stack diff model. And really GitHub is a little behind the times and
still not supporting stack diffs. I heard that they’ve got some feature development in the works
for supporting this workflow, but right now it doesn’t work. And so, um, you know, we have tools
like GH stack to make this easier for us. Okay. So how do you use GH stack? Well, let’s imagine that,
you know, you’re doing this story that I just told you before, which is you hacked on some feature a,
and then you hacked on another feature B, which depended on feature a. So normally, uh, you know,
well, everyone uses get a little differently. And so like one workflow that you might do is you say,
okay, well, I’m just going to just keep, you know, committing stuff like edit. Okay. Rework,
you know, foo bar until, you know, you get to the end. Right. And then you like do a bunch of extra
commits on top. And then you push all those commits to the pull request to be, um, you know, reviewed.
And then, you know, adding new updates, isn’t so hard, right? You just make a new change,
you commit it, and then you push it to your pull request. So GH stack requires you to work a little
differently. Instead of maintaining a blow by blow commit history of every change you made,
instead, GH stack wants you to create a single commit per, um, per logical change that you want
to submit to fabricator. So let’s say, sorry, submit to GitHub. So let’s say that, you know,
you’ve got three changes, right? Two like refactors that are independent of each other. And then a
feature implementation, you structure these so that you have a commit one, which is refactor one,
a commit two, which is refactor two, and then a commit three, that is, you know, the actual feature
of request in question. And then once you have these three commits, and they’re all ready to go,
you run GH stack. Well, if you need to install it, you pip install GH stack, and then you run GH stack.
And what GH stack will do is it’ll look for every commit that, you know, is off of your branch from
master, and it’ll create a pull request for each one. So if you had three commits, it’ll create three
pull requests. Then when you want to make changes to the pull request later, you just go ahead and
amend or interactively rebase them. By the way, about interactive rebasing, interactive rebasing
might sound, you know, tricky and complicated if you’ve never done this sort of thing in Git before,
but it’s actually very easy to use. And the way an interactive rebase works is you write
git rebase dash I, and Git will give you a list of all the commits that you’ve made on top of master.
And then all you need to do is say, okay, well, this is the commit that I want to edit. And this
is the commit that I don’t care about. And so, you know, you say edit, and then Git rebase will drop
you into a working tree with only, you know, the commit that you want to edit. And so you can go
ahead and edit it, amend the commit, and then continue your base further on. So the way I tend to do
interactive rebases is that I, you know, first I work on my patches, like, you know, okay, patch one,
I’m done, commit it, work on patch two, I’m done, commit it, and so forth. And once I get to the end,
usually what I do is, if I notice that I need to fix up on commit one, and it’s a small one, I’ll just
make a little edit at the very top, I’ll commit it, so that I have a separate fix up commit, you know,
standing on its own, I don’t run gh stack yet, instead, I, you know, run my build, make sure
everything works. And then I do an interactive rebase to move my fix up commit to the commit
that it actually logically belongs to, and then amend it in using the so called fix up option in
the rebase option. And so this, you know, makes it easy for me to keep track of all the changes that I
want to do, you have to make sure not to like overly merge conflict with yourself when you’re doing
this kind of thing. But it gets easier with practice. So anyway, so that’s it, right? So
you’ve got these three commits, and then Git gives you some tools for modifying, you know, commits in the
middle of the stack. And I mostly try not to like make modifications. And you know, it’s mostly a way
of me letting letting myself get ahead of myself when I’m working on code. By the way, in fabricator and
mercurial land, there’s actually support for actually going backwards and forwards in history
using the hg prev and hg next commands. So this is actually a much better user interface than Git.
Haha, sorry, Git. Well, Git’s user interface is famously bad. So it’s not surprising if I’m bad
mouthing it. So like if you wanted to amend your commit, instead of amend a previous commit, one that
wasn’t at the top of your stack, instead of having to do the GitHub thing of setting up an interactive
rebase, or like making a fix up commit and then moving to the right place, all you have to do in
mercurial is say hg prev, and that’ll put you in the previous commit, you can go ahead and modify it.
And then mercurial, if you have enough extensions installed, will automatically restack all of your
later commits on top of this one. This is very convenient. And I like it better. If you want to
try to like replicate this developer experiments in Git, there are some quilt tools, apparently,
I’ve never used any of them. But I think they’re trying to do something similar. So anyway,
so you’ve got the stack of diffs, right? So you’ve got a stack of commits, you’re on g stack on them,
they all get posted to GitHub. And that’s it, edit them, and then g stack again to, you know,
put some more things on. And for example, if you need to update to the latest version of master,
you just need to use a non interactive rebase in this case. So, you know, you got your commits,
and you say get rebase, you know, master origin master, if you’re, you know, just get fetching,
like I normally do. And then I’ll just move all your commits over. And of course, you might have to
resolve some merge conflicts. But you know, it’s, it’s, it’s pretty straightforward. And you know,
not much more difficult than merging. One downside to rebasing in this way, is you have to resolve merge
conflicts for each commit individually. Unlike a merge commit, where you just do everything all at
once. This makes sense, because you know, when we actually land a stack of disks from g stack,
we will land each commit separately. So they will show up durably in the final GitHub history. That’s
good for us, because, you know, you went through all the trouble of making sure CI was passing on every
commit. So we will go through the trouble of making sure we preserve history in the situation.
Okay, so that’s basically how gh stack works. You can get it once again, by pip installing gh stack.
There is one caveat, though, which is that in order for gh stack to work, you need push permissions
to the pytorch pytorch repository. So most people just, you know, fork pytorch and push their stuff
there. And unfortunately, gh stack doesn’t work because the way it works is that we create a bunch
of branches representing what you’re trying to merge into, and then what your actual commits are,
right. And the what you’re going to merge into branch has to actually live on pytorch pytorch,
because if it doesn’t, when you open a pull request, you’ll open the pull request in your fork and not
in pytorch itself. Okay, well, can I get right permissions? Well, if you’re working on some feature
that, you know, might be useful for stacking, and you, you know, have talked to someone on the pytorch
team about it, like say, on an issue, you know, you can just ask for right commits. And we basically give
write commits, write access to anyone who asks for it, because you can’t actually write to master
directly. There’s some complicated process by which commits are sucked into our internal build system
in FV code, and then spit it back out via this piece of software called ship it. So you can’t touch the
master branch, you can just create temporary branches. And so if you need to use gh stack to organize one of
your PRs, just ask someone and we’ll add you to the project. Okay, so what are some things to know
about when using gh stack? Well, one thing is that when you rebase your gh stack, or you make a modification
to a commit that’s very early to the gh stack, we will push an update to every subsequent PR in your
stack. So please use this with care, right? Like each PR you push will trigger off a full CI run for
everything in PyTorch. Our CI runs are not that cheap. So you know, like try to be nice and don’t,
you know, repeatedly repush your stacks when you know, you know, oh, yeah, I just need a little bit
of modification here, maybe defer that till later, once you finished all of your modifications, and
then push the gh stack all at once. Another common thing to do is you’ve got your gh stack, and you’re just
working on the latest commit, as long as you don’t rebase it onto master, you can safely gh stack in
this case, and it’ll only push the, you know, latest commit that you modified in that situation.
What are some other things to know about gh stack? Well, gh stack is also an open source project. It
lives on GitHub at easyang slash gh stack. Yeah, I sort of wrote this tool, because I was so mad at having
to deal with not being able to do stack diffs on GitHub. So I wrote it just to solve my own problem.
And it’s, you know, it’s not very much code. And you know, you can also check that out and use it on
your own projects. gh stack supports other repositories, you just have to use a special
command to land gh stack diffs, because the normal merge to GitHub button doesn’t work. For PyTorch,
PyTorch, this doesn’t matter, because the normal merge to GitHub button doesn’t work for completely
unrelated reasons related to the ship it situation. That’s something you have to know about there.
Okay, so gh stack, it lets you do stack development. Stack diff development is really good. It lets you
move ahead on what you’re working on without blocking on code review, or your code actually
landing to master. It makes me a lot more productive. And maybe it will make you a lot more productive
if you’re not just working on one off patches in PyTorch. So give it a try. That’s all I wanted to say.
Talk to you next time.Stacked-diffs-and-ghstack
EP17 Continuous-integration
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about the
continuous integration service that runs on all your pull requests in PyTorch. This service has
sort of been built over many years and has gone through various different versions and is probably
going to change some more in the future as well. So what’s up with that? It’s also really complicated
because we test in a lot of configurations. So what’s up with that? And how can I understand
how the CI works? Well, it’s not too bad because there’s a few very important constraints that went
into the building of the CI. And if you understand those, you’ll kind of understand why things are set
up the way they are. Okay, so let’s talk a little history first. So what did PyTorch’s CI look like
at the very beginning? Well, at the very beginning, there were only really like four developers working
on PyTorch. And we were running all of our CI on Travis, you know, for example, because that was
what everyone used at the time. And we had a problem. And the problem was that PyTorch is a GPU
accelerated library, we needed a way to actually run our code on GPUs. And none of the CI services
actually made it possible to do this. Okay, so what do we do in this case? Well, we did what any, you know,
enterprising hacker would do. Sumith set up a desktop box in his apartment with a GPU in it. And we set up a
gen consistence to like, go ahead and run our GPU tests on that single box. And when you only have
four developers on a project, this kind of works okay. And we added a few more developers. And you
know, I took a box home to my apartment. And you know, we had two GPU boxes. But this clearly wasn’t
sustainable, right? Like the PyTorch project, even back then, it was growing, we were getting more and
more pull requests. And our, our pool, our, our, our backlog for the GPU runners was getting more and
more backlogged. So Peter Nordhuis was sort of looking around for something to do at the time.
And he was like, Okay, I want to build a new CI system for PyTorch. And so he was like, Okay, well,
we need to be able to run GPUs. And we need to be able to scale so that it’s not just, you know,
two GPUs and people’s basement in their apartment. Imagine having a basement in their apartment.
And so how are we going to do this? Well, once again, because none of the CI providers provided
this, we needed to just build it on top of AWS. So we did, we built an auto scaling Jenkins,
you know, fleet of machines that, you know, could run GPU and CPU jobs. Fortunately for us, AWS
would sell us GPU machines. In fact, it would even sell us Windows GPU machines. The only thing it
wouldn’t sell us were OSX machines, because Apple is a thing. So we actually just bought a bunch of
fixed runners from Mac Stadium to get that going. All right, so you know, that’s sort of the first
iteration of the CI system. And you know, we’re going through a bunch more iterations. So at some
point in the past, we migrated to CircleCI, when you know, that was about the time when, you know,
CI providers who you could pay money to actually started supporting GPUs. And so you know, we helped
CircleCI get their GPU support up and going. And now we’re kind of looking at moving again to GitHub
actions, because GitHub actions is just really well integrated with GitHub. And we like that a lot.
Okay, so that’s enough of history of the CI. So once we, you know, upgraded from just randomly running
people things on people’s machines, to actually running things in CI, we also made some key design
choices that sort of has stayed with the CI system today, even though we’ve migrated from one system to
another and probably are going to be migrating again. So the first big decision we made was, hey, GPU
machines are really expensive. So we don’t really want to spend time running code on GPU machines when
it’s not necessary. And in particular, the most time consuming thing that is totally useless to run on a GPU
is building PyTorch. Of course, you know, normally, you wouldn’t build a GPU enabled version of PyTorch on a
computer that doesn’t have a GPU, because it would be kind of pointless normally, but you can do it. And if
you’re, for example, building binaries, you know, you can always set up the binary build, and then send it off
to another machine to actually run on, which does have a GPU. And so that’s how we set things up in the CI as
well. When you run a GPU job in our CI, we don’t build it on a GPU, we build it on a CPU that has CUDA
installed, but you can’t actually run anything. Once we’re done building, we actually go ahead and send
it over the wire to the GPU executor through various mechanisms. Right now we send them via an ECR
registry that’s in AWS, but there are a bunch of ways you could do it. And then only then we run the tests
which do require GPUs. And that’s how we actually do the testing in this case, right? So GPUs are
really, really expensive. They’re like 10x more expensive on AWS. Is it 10x? It’s like an order
of magnitude more expensive. And they’re also more expensive on CircleCI as well. So it just makes
sense to reduce the amount of time you’re running on them. Another major constraint that we had is,
you know, hey, PyTorch is a really popular project. And people want to run their PyTorch programs
in a ton of different situations, right? Like they don’t just want to run them on Linux,
they want to run it on Windows, and they want to run it on OSX, and they want to run it on, you know,
various different Linux distributions, and, you know, various different versions of Python. And so,
you know, we offer to support all of these configurations. And this is kind of trouble for
a CI setup, because, you know, these configurations are actually really, really, you know, complicated.
Sometimes there are a lot of different moving parts. Did you know that we actually test PyTorch
under different parallelization primitives? So normally, we use OpenMP, but we also support TBB,
which is Intel’s Thread Building Blocks Library. And so that’s another configuration separate from
OpenMP that we test under. And so making sure all the like prerequisite software is installed for all
these cases can be a bit of a chore, and, you know, is wasted time, once again, if you’re doing it
at CI time. So what we did was instead, we said, okay, we’re going to make a Docker image for every
environment we actually want to run our CI in. And then, you know, these Docker images will just,
you know, basically have all of the software you need pre-installed at the correct versions for the
particular run of the CI. And so, you know, for example, when we needed to, like, figure out how to
move things from a CPU machine to a GPU machine to actually run it, we actually just, you know, move the
entire Docker container because that was convenient. Okay, so, you know, we use Docker to actually, you
know, maintain each of the environments. And this is really convenient. And it works pretty well on
Linux. Yeah, Windows and OS X, we don’t really use Docker on, but we also only really test in one
configuration in these situations, because it’s kind of too hard to do it. Okay, so we use Docker for this
purpose. By the way, because we use Docker for this purpose, if you’re, like, trying to debug a
particular Linux failure in our CI, hypothetically, you can download the Docker image that we ran the CI in
and run your code in exactly that environment. And I used to do this a bunch when I was testing very
strange bugs. But it’s a little inconvenient to do. You actually need some credentials to actually access
the ECR because Amazon doesn’t support, you know, passwordless ECR authentication, if you actually need
it, feel like you need it, just ping someone on Slack, and they’ll be happy to give you the credentials to
access the images there. So, so what have I said so far? So GPUs are really expensive. So don’t run
things on GPU if you can. And, you know, we also need to run under a lot of different configurations. So we use
Docker to manage these different configurations. What else? So the last constraint that I want to talk
about is more of a, like, anti-constraint, in the sense that we didn’t, like, explicitly go in to
engineering the system with this constraint. But it sort of just naturally happens if you don’t do
anything else. And what this constraint is, is RCI doesn’t rely on any external servers. Okay, so what
do I mean by this? Well, let me talk about one particular feature that we built into RCI. So one of
the things that sometimes happens is someone breaks a test. And when the test is broken, you either have to
revert the PLO request, or you have to, you know, put a patch in. And these, both of these remediations can be
somewhat slow. Because when we have a ton of, because landing divs to PyTorch is actually kind
of slow, since we have to run all of Facebook’s internal CI before it’s all okay to go. So we
wanted a way to actually make it so that we could avoid running tests if, you know, someone, sorry,
we wanted to fit faster escape valve to turn off test running if we knew that something was wrong,
but we didn’t want to revert in this case. So Zachary DeVito wrote a little thing to make this
happen. And so how do we do it? Well, one way you can imagine doing it is you set up a server that
just says, okay, here are the tests that are known to be okay, here are the tests that are known to be
bad. And then just make sure the CI service pings the server whenever you want to know, you know,
which test should I skip? Because, you know, we know there’s a problem on master. Okay, we didn’t do
that, right? Because to do that, we would actually have to like design a service and bring it up
available to the public internet. And, you know, do all the things necessary to actually run the
service. So you can see why this is an anti constraint, right? Which is that, you know, if
people don’t want to run servers, then they will try very hard not to run servers. And so the way it’s
actually implemented is using, you know, Facebook’s internal cron job infrastructure, because,
hey, you know, Facebook has a bunch of, you know, services, once again, that are not publicly
internet accessible, because, you know, that would be a security risk. We piggyback off of the cron job
service to publish a file to S3, which once again, is a service that we don’t run, right? It’s a service
run by Amazon. And that file gets downloaded when you do testing. And that tells you whether or not,
whether or not a test should be skipped or not, right? So this is a sort of like Rube Goldberg
contraption, whereby you don’t do the obvious thing. Instead, you do the thing that, you know,
reduces the requirement for needing us to actually run a service to get things going. Another example
is the CI status HUD. So if you don’t know about the HUD, it’s a little react app that basically reads out
the information about CI signal for all of our configurations, and displays it in a very compact
form. So it’s easy to see if any particular job has failed. So once again, this job was set up
without needing, so like normally, you’re like, okay, well, I should set up some sort of service,
the service will have a database, the database contains all the statuses. And you know, I’ll just
render it from that database. Well, that’s not how this app works. Instead, the app is just a pure
React app, there is no back end service associated with it at all. Instead, what it does is it queries
Jenkins to get the list of recent jobs via an APC, just an RPC call, you know, with cores protection,
sorry, cores enabled so that we can actually read the Jenkins data. And also, you know, reads out a bunch of
GitHub statuses that we actually just stashed once again in S3, and then it renders that. So there’s no server,
there’s no database involved here. We’re just piggybacking a bunch off of other infrastructure.
So recently, we’ve been adding more support for actually putting services behind things.
It’s slow going, right, because we have to make sure it’s all secure and, you know, actually make sure we
administrate the systems. But you know, we’re getting there. But a lot of things that the CI works on,
you know, are sort of done in the circuitous way to make things work out.
Okay, so enough about constraints on the CI. What does the CI actually look like today? Well,
as I said, we run a lot of stuff in a lot of different configurations. And, you know, actually,
it’s sort of infeasible for us to test every combinatorial many configuration that we want to do.
So what we do is like, there’s usually something weird about some particular job. And sorry, there’s usually
something weird about some particular configuration we want to test whether or not it’s, oh, it’s rock ’em,
or, oh, it’s, you know, with ASAN turned on, we pick one particular config to, like, put that weirdness
onto. And the hope is that, you know, we can, you know, we, we, there, the errors won’t be correlated.
So if something fails on ASAN, it’ll fail regardless of what your Python version is or what your Linux
distribution is. So we have a bunch of builds, but like we sort of like packed each of the configurations
we want to test into them. What do these configurations look like? Well, I’ve already told
you we support Linux, OSX, and Windows. Some other things that we need to test, we test with CUDA. We also
test without CUDA. And we also test a CUDA build of PyTorch, but run on a machine that doesn’t have
any GPUs. This is something that we used to break all the time because, you know, it’s subtly different.
And if you make assumptions in the CUDA build of PyTorch that a GPU is always available, then this binary
won’t be usable on CPU. So that’s why we have that build. We build for various different versions of
Python, because our support window for Python is the most three recent versions of Python. And yes,
there are relevant backwards incompatibilities in Python that we need to test for, especially in
Python surface syntax, because, you know, like, for example, F strings, we couldn’t use F strings until
we dropped support for like Python 3.6, I think. So, you know, we needed to make sure people didn’t
actually add in features that were too new. What other things do you test? We have an ASAN
configuration. ASAN only gets run on one build because it’s really, really slow to run ASAN code.
And we also have some other configurations like Rockm GPUs. Actually, the Rockm GPU configuration
still lives on Jenkins, because CircleCI doesn’t actually have any machines with AMD GPUs on them. So we
have to run it ourselves. Actually, AMD has a data center full of servers with AMD GPUs, and they’ve
graciously loaned it to us to run our CI there. Another weird CI configuration is XLA. So what makes
XLA weird is that it is actually two repositories we’re doing CI on: the PyTorch repository and also
the XLA repository. And so whenever you run the XLA build, we always take the latest version of the XLA
repository and do that. This is kind of like bad practice, right? Like what you’re supposed to do in
CIs, you’re supposed to pin versions. But XLA, you know, is constantly adding fixes, and they don’t want
to have to coordinate with the main PyTorch repository. And so we worked out this compromise,
whereby XLA is very responsive. If you make a change to PyTorch, and it needs an XLA change,
they’ll set up a PR that fixes it for you. And then once you land your diff, they’ll just go ahead
and land it straight to XLA. So the breakage on XLA is very small. And this is kind of worked out okay,
because most diffs don’t break XLA. And so you know, like you don’t have to worry. But oftentimes, yeah,
if you see XLA is failing, that’s probably because you know, something got landed in master and XLA just
needs to catch up. So that’s it about open source configurations. We also run CI inside Facebook,
and Facebook CI, you know, sort of mostly tracks open source here, like if something fails in open
source, sorry, if something fails in Facebook CI, usually it means something failed in open source
CI. But there’s a few cases where this is not true. One case it’s not true is if you’re making build
system changes, like you add a new file, you add a new directory, you made CMake changes, Facebook has an
internal different build system based on buck. So usually someone is going to have to go and fix
that change for you. Another thing that is pretty unusual is the internal builds much more aggressively
build on mobile platforms, we have some mobile open source builds, which are also kind of weird,
and you know, worth knowing about. But Facebook’s mobile builds are also kind of weird and interesting.
And so that’s another situation where you might have a error that you know, doesn’t show up on open
source. But we try very, very hard to make sure that you can get all the signal in open source,
because otherwise, you’re gonna have to go through, you know, very long round trips with a Facebook
employee to like figure out what the problems are. Okay, so that’s everything I wanted to say about our
CI, right? Like, so what is our CI? It, you know, tries to make sure we don’t build things on GPUs,
it makes sure that it is scalable, because we want to, you know, scale with the team. We use Docker to
manage all of our build configurations. And historically, we don’t really run any extra
services, although this is changing over time, especially with the work that say Taylor Roby
is doing to do better performance tracking. So that’s all I have to say for today. Talk to you next time.Continuous-integration
EP18 Serialization
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about a somewhat
dry but still very important topic to PyTorch, namely serialization. Serialization is the
mechanism by which when you have a PyTorch program and you have some tensors floating around or
god forbid a more complicated program such as a PyTorch module or a Tor script module,
it allows you to serialize this data to disk so that you can load it up again when you do another
run. So in any sort of you know usual PyTorch program you are probably making heavy use of
serialization because you’re doing things like you know doing your training loop and then saving
your trained weights to disk so you know because like you actually want to use them later for
something for example. Okay so how does serialization work? Well it’s a long story. So I think the way
easiest way to understand where PyTorch is with serialization is today is we’re going to first
talk about how serialization works in general in Python and then we’ll talk about how historically
PyTorch did serialization based off of this and then we’ll talk about the new developments namely
JIT zip file based serialization which is what more recent versions of PyTorch are using by default
when you do torch.save. All right ready? Let’s dive in. So instead of answering the question how does
PyTorch do serialization? Let’s ask an easier question which is how do you do object
serialization in Python? And the answer to this is well there are a bunch of ways to do things but
there’s one that is very popular and a lot of people use namely Pickle. Pickle is a protocol and
file format for doing arbitrary Python object serialization in Python. So like if you have some
class or you’ve got some object you’ve got some numbers you’ve got a list whatever you can run it
through Pickle and Pickle will give you a byte stream that you can put on disk and then you can
unpickle things later. How does Pickle work? Well Pickle is defined for on a per object basis and the way
you define how a Pickle works is you define what’s called a reduce magic method underscore underscore
reduce underscore underscore underscore or you know if you’re cool it’s actually
underscore underscore reduce underscore EX underscore underscore the EX meaning that you also get to know
what the Pickle version is. So you know for any given class if you want to be able to serialize
something using Pickle you just define what the what the Pickle sorry what the reduce function for
things should be and the way you write one of these reduce functions is actually it’s sort of recursive.
You just define the serialization you want in terms of smaller more primitive objects that you want to
serialize. So let’s imagine that we’re serializing a tensor. So the way we serialize a tensor is that
we actually return a tuple from our reduce function and you can actually go look at this code inside
torch slash underscore tensor dot pi. It’s all implemented in Python. At least one version of it is. So what do we
do? Well you know we get our tensor and we’re going to construct a tuple containing all of the important
pieces that we need to rebuild the tensor. So it’s going to contain our storage it’s going to store
our sizes it’s going to store our strides and very importantly it’s also going to store a function
that says how to take all of these you know particular pieces to that are there namely you know the size and
the stride and the storage and reconstitute this into some actual tensor because when we actually want to
you know load our tensor from pickle like pickle needs to know how to actually take one of these you know
tuples and turn it into the actual object in question. So you do that by providing a function
a rebuild function as we call it internally that takes the various pieces that you serialize one by one
and reassembles them back into the whole. So that’s how pickling works in general
and pickle itself is a pretty simple file format. There are plenty of tools you can use to look at
pickle so if you’ve like ever thought oh you know pickles are just these opaque blobs that I can’t
actually look into well okay they are binary objects but like the format is actually very simple
it’s like a little stack machine that you use to just build up various data structures inside pickle.
By the way if people have ever told you that like pickling and unpickling arbitrary objects is unsafe
that’s because you know pickling can induce object construction and so if you’re not safe about
what objects you construct via pickling then you could accidentally trigger you know remote code execution
like say you unpickle an object that actually goes ahead and you know runs a shell command from
whatever you pass as a constructor. So that’s something to keep in mind for as we’re getting later
in this podcast. Okay so Python serialization usually done using pickle because pickle is built into the
standard library everyone sort of knows about it it’s got a protocol for defining this that most people
do. It’s actually a little tricky to like pickle things correctly for example imagine that you are
pickling an object and in you so you have a class and you are pickling it and then in a new version of
your software you add a new attribute to your class right so adding a new attribute ordinarily is a backwards
compatible change because well you know like all the old users of your code weren’t using that attribute so what
what what skin off their back is it if there’s a new attribute but with pickling in the mix this is actually
usually a bc breaking change because any old pickles from older versions of your class don’t actually have this attribute set so when you
So when you actually write the unpickling code for your code you’ll unpickle an object that is missing this attribute
and so if you don’t like if you assume that the attribute is set which is a very reasonable thing when you’re writing a class
it’ll break when you unpickle this old thing. Fortunately Python has another mechanism for overriding behavior in this situation
there’s a magic method called set state which gets called whenever you’re you know you’re actually populating the state quote unquote for an object that’s being pickled and that doesn’t actually have a full on reduce implementation and so in that situation the way to make things bc is usually just to look for missing attributes and fill them in before you load the object.
Okay long long tangent aside okay so how does tensor pickling historically how was it implemented well we did the same thing that everyone else does and we use pickle to do it so what do you expect to see well you expect to see on tensor there’s a reduce implementation and indeed there is a reduce implementation in torch slash underscore tensor dot pi and it has a bunch of functions for example it has a function that knows what’s going to do it.
For example it has a function that knows how to rebuild the tensor actually this function is called v2 because in 0.4 when Sam Gross merged tensor and variable he actually you know changed also the serialization format in a backwards and forwards incompatible way so you know we had to make a new rebuild implementation.
digression about forwards and backwards compatibility so backwards compatibility typically is this idea that if you serialize a tensor to you know some saved format a backwards compatible software means that when there’s a new version of the software you can load that old version of the pickle in the new version.
backwards compatibility is a good idea we try very hard not to break backwards compatibility ever especially with a serialization format however there’s a very similar and also important notion called forwards compatibility so forwards compatibility means if I serialize an object from a newer version of pi torch.
Can I load it from old versions of pi torch.
Can I load it from old versions of pi torch.
And you know it should be clear to see that like maintaining indefinite forwards compatibility means you can never ever change the serialization format but it is useful to be able to like load new tensors from older versions.
So whenever we’ve broken forwards compatibility we’ve usually had some mechanism by which you can get back the old format so if you just in a pinch you need to send something to an older version of pi torch.
Okay, so digression over so we have we’re on v2 of the tensor serialization and you know v2 is obviously not forwards compatible with v1 but you’d have to be running pi torch like 0.4 so that’s like ancient history and no one really cares but for zip file format this is going to be relevant in a moment.
Okay, so what do we do for tensor what we do exactly this so you know we’ve got a function to rebuild tensors based on the data.
What is the data it consists of storage sizes etc storage itself also has an implementation of how serialization works namely you know you just you just there’s a reduce implementation but this reduce implementation does something very interesting which is it actually calls into torch dot save to do the implementation.
And so now here is sort of the first interesting thing that’s going on which is that actually pickle i.e the interface that Python gives you for pickling objects is not the same thing as torch dot save and torch dot load which is the other sort of very published mechanism for doing serialization and py torch like usually like when you look at the tutorial you don’t like directly instantiate a pickle object and pickle Python objects.
And then you actually use torch dot save to actually use torch dot save to save these things.
So what does torch dot save do differently.
So what does torch dot save do differently.
Well, if you go look at the implementation of this file also all in Python easy to look at.
What you’ll find is that we actually do most of the things you’d expect right which is that we are going to create a pickler and then we’re going to feed it the data in question and then out is going to pop a byte string.
What we do differently is that we want to do.
What we do differently is that we want to do duplicate storages that are shared between multiple tensors so let’s imagine you’re serializing a list of tensors and you have a you know law the list of tensors actually list of views onto a single tensor.
So if you you know serialize this naive way we you would you know stamp out a copy of the same data for every single occurrence and in the once you do serialized it like these would all be different tensors and if you mutated one of them the other tensors wouldn’t get mutated.
So that’s bad we don’t want that we want this the sharing to be preserved during pickling.
And so the way this is done is we use this other mechanism in Python object serialization called persistent IDs where basically for any given object that’s being pickled you can override the behavior for what happens in that situation.
And so when we see a storage we actually record a persistent ID that records what that storage is and then for subsequent you know occurrences of that storage.
We make sure they get all memorized into the same version of the storage.
So okay so that’s basically in a nutshell how serialization used to work in the old days.
And so what happened.
So what happened is that we were building.
So serialization in the old days was just targeted at eager mode right the only thing people were really serializing were tensors and maybe modules right modules with parameters.
Because you know those were just Python objects but they also had tensors on them.
We actually discourage people from pickling modules directly but people do it anyway.
What you’re supposed to do is you’re supposed to get the state dict for the module and serialize that that’s because you know serializing arbitrary Python objects is kind of error prone.
But anyway that was what people were normally using serialization for.
So in comes TorchScript.
So what does TorchScript need.
So TorchScript is a bunch of things but one of the things it is is it’s a distribution mechanism for arbitrary PyTorch programs namely TorchScript programs that are understood by the TorchScript compiler.
TorchScript compiler.
And this is important because if you want to sort of ship your model to production it’s important to have a self contained file format that contains all the information you need to run the model.
So the Python code and as well as the tensors and so people were like looking and they were like okay we need some serialization format for TorchScript.
And oh you know there’s this interesting thing which is that PyTorch is using pickle but actually pickle is kind of a bad idea to actually serialize tensors because tensors are really big.
And you actually want to like if you’ve got your data living on disk like you know a bunch of parameters you want to just map them into memory you don’t want to actually have to parse them into memory which is what you know traditional PyTorch serialization used to do.
Okay so what did they decide well they decided that one they wanted to use standard file formats so we really didn’t want to be in the business of making up a new file format because then you don’t have any tools that can work with file format.
And two was you know we kind of wanted you know our code to like be in the Python style right like you know there’s all this existing infrastructure for pickling and unpickling Python objects.
And you know if we define a totally different serialization format rather than pickling.
Well we’d have to redo all of that and we’d have to keep these in sync indefinitely.
So what did they do.
So they did two things.
So one is that they decided we were going to use zip files for our serialization format.
Don’t laugh zip files are actually really cool.
It’s a really well designed file format and one of the reasons it’s really well designed is well you don’t actually have to compress things in a zip file.
So you have an uncompressed zip file.
What it turns out is that zip files have a bunch of really good properties.
One is that you don’t actually have to read through the entire zip file to figure out where things are.
There’s a manifest at the beginning that lets you efficiently index to any particular location.
So if you’ve got a bunch of big tensors you don’t have to scan to all of them to actually find out where your tensor you’re interested in is.
Another really good property of the zip file format is that it you know is the tensors are laid out exactly as is in memory.
You can easily and map them into memory if you want to load them in your package process.
And finally, like everyone knows zip files right like zip files are you know the darling compression format in Windows and like you know there are tons and tons of full tools that can work with zip files efficiently.
So if you have a serialized you know thing that Pytorch gave you from torch dot save in a recent enough version of Pytorch you can just unzip it literally like unzip it use it like rename its file extension to zip and then like double click it and it’ll give you all the internal bits.
The second choice they made was they were going to keep using pickle.
So what does that mean.
Well, remember, I said pickle is a very simple serialization format, you know, like most of the complexity involved with it is because like you can call arbitrary code to actually reconstitute these objects that are saved.
But other than that, you know, you’re just saving these two poles of various other things that themselves, you know, might be two poles of other things.
So what did we do.
We just implemented a pickler and unpickler from C++.
So inside JIT slash serialization, there is a pickler and an unpickler and it is feature for feature compatible with our Python implementation and it understands the pickle format.
And this implementation is secure because unlike stock Python pickler, which, you know, we’ll just attempt to unpickle anything that you throw it.
Our pickler only supports a limited set of, you know, types and all those types don’t actually do remote code execution.
So, you know, you’re safe there.
So, hey, so then that’s basically where we are today.
Okay, so when you torch save and torch load without using the use on, you know, use the non zip file format, which which does get called occasionally, for example, if you serialize a storage just by itself directly using pickle, we don’t use the zip file format.
Just just just just a fun fact.
But if you are using torch dot save and torch dot load, we give you the zip file.
This zip file contains a data pickle that represents, you know, metadata about the tensor in question.
And then it contains, you know, a bunch of files representing the actual data in the tensor.
And this works pretty well.
It’s a little slower than the old school pickler, but not that much slower.
And people have been pretty happy about this new serialization format.
Okay, so that’s been a whirlwind tour of serialization in PyTorch, starting from our humble beginnings as a Python pickle extension, and then to our not so humble endings of also a pickle extension, but you know, also with a zip file around it.
So I hope this explains a little bit about why our serialization code is kind of complicated.
And also why whenever you want to make a change to the serialization format, it’s really complicated to do so because of BC and FC and also because you have to edit Python and C++.
But hopefully, if that’s something you ever actually need to do, you’ll know where to look to figure it out.
That’s all I have to say for today.
Talk to you next time.Serialization
EP19 native_functions.yaml
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about nativefunctions.yaml,
but I don’t actually want to talk about nativefunctions.yaml.
What I really want to talk about is enough about our just-in-time compiler
for people who are not compiler engineers
and working on the eager portion of PyTorch.
You’ll see what this has to do with nativefunctions.yaml in a moment.
Okay, so what is nativefunctions.yaml?
Well, nativefunctions.yaml is this YAML file named nativefunctions,
which basically describes every operator supported by PyTorch.
So imagine that you’re thinking about sum or add or sub.
Each of these operators that PyTorch supports has an entry in this YAML file.
And so this YAML file basically is a sort of canonical source of information
about these operators, except for a few exceptions, which we’ll get to later.
Okay, so why is there this YAML file, right?
Like, if you were just writing a Python library, you’d expect, well, you know,
if there’s a bunch of functions that my library supports,
I’ll just write Python definitions for them.
Or even, you know, if you’re writing a library and you’re doing C++ bindings,
you’d expect, oh, well, you know, I’ll have a bunch of C functions
that implement the functions that I need, and I’ll just register using pybind11.
So, like, why do I need this separate representation?
And as is always the case, when there is an abstract representation about what operators you have,
there’s probably code generation lurking underneath.
And in particular, native functions.yaml gets fed into a variety of different code generation pipelines,
which basically stamp out all of the boilerplate necessary to support all the things that you want to,
you know, you want an operator to do.
And this is where, like, being JIT for non-compiler engineers is important,
because, yes, native functions.yaml plays a very important role in generating our eager PyTorch bindings,
that is to say, you know, the actual functions you call when you’re just running PyTorch from Python.
But what it also does is it also generates bindings to TorchScript,
our compiler and interpreter stack in PyTorch.
And so whenever you’re, like, working on a new operator,
when you’re trying to define a new operator,
whatever it is you do also needs to work okay with compiler stack.
And here, it’s helpful to know a little bit about what the compiler is trying to do with this information
to, you know, figure out why, you know, there are certain constraints
about what kinds of things you can do in native functions.yaml.
So let’s take one example to start.
So in native function.yaml, one of the things you do is you write down a so-called schema string
for any operator you want to define.
So what does this schema string look like?
Well, let’s take our example of addition.
So what is an addition?
Well, it takes two tensors, and it produces an output tensor.
And so the schema string for add basically is, like, you know,
tensor add, open paren, tensor self, comma, tensor other, close paren, right?
What it says is, hey, you know, here are the types of the arguments,
here are the types of the outputs, pretty standard stuff.
But if we look a little deeper into this type system, you know,
the fact that we have this schema string, the fact that we have this JIT schema format
actually says something about what we are planning to target.
Because in particular, the schema is not Python types.
It’s not C++ types.
It’s JIT schema types.
And what JIT schema types represent is sort of the intersection of all language features
that are supported by Python, as well as language features that are supported by C++,
and most importantly, language features that are supported by the TorchScript compiler.
So let’s just take an example, right?
So let’s say that I wanted to write a function in PyTorch that takes a void star pointer as its input.
Well, you can’t do that.
And the reason you can’t do that is while void star works as a, you know, type in C++,
there’s no such type as void star in Python.
Well, unless you count, you know, one of the C type types, but we don’t like most of the Python types that the PyTorch binding support are like stock types,
like normal types, like integers, floats, booleans, tensors, for example.
So you can’t write a function like that, right?
And if you wanted to write a function that took void star, you would first have to fix both the eager code generation code
to understand a void star pointer, as well as the C++ code.
That would be easy because void star is very simple, as well as the TorchScript code to know how to represent a void star pointer
in what we call our box format, our iValue format, which is basically a universal container for any type of, you know,
object that you might actually pass into one of these functions in question.
So yeah, there is a limited set of types available to native functions.yaml.
And this limitation makes it easy to actually, you know, write code that works for all of the platforms that we care about.
Of course, this can be kind of annoying sometimes.
For example, we don’t have support for enums in native functions.yaml because how enums are defined in C++ and in Python
are fairly different and it’s pretty involved.
There’s no reason in principle we couldn’t solve this, but, you know, you have to actually pre-declare an enum in C++
and you have to pre-declare an enum in Python, except in Python, that’s not the Pythonic way to do enums.
You just, you know, provide a string saying what option you want.
So actually most enums are implemented sort of crappily using strings.
And I say it’s crappily because, like, you don’t actually want to be passing around strings and doing string comparisons.
In Python, it’s okay because string and turning happens.
And so if you’re lucky, it’s just a point of equality.
But in C++, that doesn’t happen.
And so you actually do want an enum type.
But we haven’t implemented it yet, right?
Because it’s a little complicated to, like, work out a representation for enums that works in all the situations.
By the way, if you’re interested in doing this, well, talk to us because it is something that we’ve been wanting to fix for a really long time.
Okay, so that’s it about types in nativefunctions.yaml.
What’s another example of something that, you know, you need to worry about in nativefunctions.yaml,
not because it matters in eager mode, but because it matters in the compiler?
Well, a great example of this is mutation and alias info in nativefunctions.yaml.
Okay, what’s that?
Well, if you ever look through the YAML file,
you might notice that there are some operations that have some little weird, like, extra, like, fluff in their type signatures, right?
So they don’t just take a tensor as an argument.
They take a tensor, open parentheses, A, exclamation mark, close parentheses.
What the heck does that mean?
Well, what that means is that this argument isn’t just being read in as a pure argument.
That is to say, we’re just taking it as an input.
We’re also going to write to the argument in this situation.
So, okay, you might say, that’s really useful for documentation purposes in eager mode.
But, like, why does it matter if I specify this correctly or not?
Well, it matters because, once again, we’ve got a compiler.
And our compiler wants to do certain optimizations, and some optimizations might not be valid if you don’t know if an operator is mutating its arguments or not.
For example, dead code elimination says that if I call a function on some operands and then I don’t use the result, I can just get rid of that operation entirely, right?
Because it’s dead code.
Well, I can’t get rid of this function call if the function is actually in the business of mutating the tensor.
Because, you know, like, we might just be calling this function for the purpose of doing the side effect in question.
So it’s actually really important to put down correct mutation information on your functions.
Because if you don’t, and then your function goes into the TorchScript compiler, which it will, because the whole point of putting something in native functions.yaml is so that you get all of the support, right?
Eager, C++ script.
If you don’t do it right, then your compiler may just miscompile your code.
It may, you know, throw away your opcalls.
It may reorder them with other mutating opcalls.
Bad business all around.
Of course, what you really should do is just write your operator without having any mutation at all.
But, you know, sometimes that’s not possible.
This is a really common mistake people make when they’re defining custom operators.
Because you’re, you just like, you just write a type signature down and you think, oh, this looks fine.
And the, you know, PyTorch accepted it.
What’s wrong with it?
Well, what’s wrong with it is, you know, this downstream thing about the compiler.
So if you’re thinking about, like, what kinds of info the compiler needs, that’ll help you understand, like, what kinds of stuff native functions.yaml actually needs.
There’s one more thing that, like, really, really, really affects people.
When they’re making changes to native functions.yaml.
And this is backwards and forwards compatibility with serialization formats in JIT.
In the previous podcast, I talked about serialization sort of in a general sense.
And I talked about this forward compatibility and backwards compatibility concept.
Well, this concept also applies to operator definitions.
So stepping back a moment, when we think about forwards and backwards compatibility in PyTorch, we usually only really care about backwards compatibility because you just write some Python program and you just want this Python program to keep working when you upgrade to the next version of PyTorch.
And there are a lot of changes that we can make to functions which are actually backwards compatible.
For example, if we add a new keyword argument to an operator, but we give it a default, from the perspective of Python, that’s totally backwards compatible because, well, if I had a call to the function before that didn’t pass the argument, well, it’ll just get defaulted.
And, you know, if I’m doing my job correctly, the default behavior will be compatible with whatever the old behavior was in that situation.
Well, well, well, but remember, native functions.yaml is being used in different situations.
And in particular, there are two particular situations where this is not exactly backwards compatible.
And by the way, these might be just mistakes and we should fix these mistakes, but sort of it’s just how PyTorch works today.
So situation one is, for the longest time, when we serialize PyTorch, so stepping back a moment, so one of the things that TorchScript does is you have a model that has a bunch of function calls, and we can serialize these function calls back into Python code.
And so something very interesting happens as a result of something that a compiler wants to do, which is whenever you serialize some functions, we actually write out all the defaults to the serialized model.
So let’s just imagine, like, I’m doing a matrix multiply, and I added an optional flag that says whether or not I should transpose the second argument or not.
So this doesn’t actually exist in PyTorch, this doesn’t actually exist in PyTorch, but there are plenty of examples that are actually existing, I just can’t think of them right now.
So in this situation, if I write a, say, matmol a b, what will actually get serialized is matmol a b true, where true says, sorry, false, where false says don’t transpose the second argument.
That’s kind of weird. Why does the JIT do that?
Well, one of the reasons the JIT does this is, you know, one of the things that it does when compiling your program is it tries to translate it into an intermediate representation that’s easier for the compiler to deal with.
And one of the things that makes IRs easy to deal with is when they are very regular.
So what do I mean by regularity? Well, it means that I don’t have to, like, you know, go ahead and canonicalize stuff every time I look at it, I can just assume that things are in canonical form.
And an example of something in canonical form is a function call, which has all the defaults actually explicitly written out, as opposed to, like, implicit, because if they’re implicit, you have to go figure out, you know, what the behavior, what the defaults are and fill them in if you wanted to, like, actually write code that operated on the semantics of this function.
Okay, so because this IR representation transformation happens, well, as an accident, when we reseerialize things out, we actually just lost the information about whether or not, you know, something was explicitly defaulted or not explicitly defaulted.
And so we just always serialized it out.
Why is this problematic?
Well, it’s problematic for forwards compatibility.
Recall from the previous podcast, forwards compatibility refers to if I serialize a model from a newer version of PyTorch, and let’s say that it doesn’t actually use any of the new features of PyTorch, which, you know, would necessitate using the new version of PyTorch.
Can I run it on an old version of PyTorch?
And so if you add this defaulted new parameter, and, well, it’s getting serialized out, uh-oh, when you, you know, try to load this model in an old version, there will be this extra parameter that your old version of PyTorch doesn’t understand.
And, well, sucks to be you, the model can’t be loaded anymore.
So there is a way to solve this problem, uh, in PyTorch Master, and I don’t exactly remember how we resolved it.
It’s either some sort of, like, backwards compatibility, sorry, forwards compatibility, uh, well, one is we don’t really offer forwards compatibility, but I think there’s some, like, surgery you can do to fix the problem.
Or it might just be that we fix this problem to begin with.
But, like, the meta point here is that this was a problem for a while, and the reason it was a problem is because, you know, JIT is using this representation in a way that is different than how you normally might conceptualize it in just eager mode.
And so to just understand the consequences of various changes you might make, you have to also understand, you know, what’s going on in JIT.
Is this bad? Like, what if we, like, just wrote our format really, really nicely and explained all of the invariants in question, and, like, you could just read up about them and know everything?
Well, yes, ideally, this would be the case.
Ideally, we would have a really good backwards compatibility and forwards compatibility story, and we wouldn’t have problems like this.
Great! If you want to work on this, you know, come talk to us.
Like, you know, this is a really important project for PyTorch, and we’ve just, you know, been very slow in actually getting some, because who wants to work on backwards compatibility, honestly?
I do, actually.
But I’m always working on other stuff, unfortunately.
So, yeah.
So what did I talk about today, right?
So I talked about nativefunctions.yaml.
I didn’t really tell you, you know, how to write things in nativefunctions.yaml,
and I don’t really want to in this podcast, because there are pretty nice documentation that you can look at.
What I wanted to go over today was more, you know, why does nativefunctions.yaml have all of this stuff?
And the reason it has all this stuff is because, well, there’s this compiler stack attached to it,
and, you know, there are a bunch of constraints that, you know, we need to solve simultaneously in both cases.
So if you ever find yourself wondering, you know, why is something this way?
Well, maybe there’s something in the compiler that needs it to be that way.
And also, I also want to emphasize that compilers are not that magical.
Like, there’s not that much they’re doing.
So you can understand it, even if, you know, you don’t work on the compiler on day-to-day.
And, like, once you understand it, you might actually be able to look at the situation and say,
hey, actually, there is no reason for it to be this way, and we can fix it.
And then, you know, you can just fix it.
And that’s pretty powerful, and so a generalizable lesson that applies to all of software engineering.
That’s everything I wanted to say today.
Talk to you next time.native_functions.yaml
EP20 TensorIterator
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about Tensor
Iterator, but I’m going to go about it in a sort of unusual way, which is, imagine you’re walking
into a software engineering interview, and you’re wondering what question you’re going to be asked
today, and your interviewer sits down and says, okay, Edward, please add two vectors together.
And I say, come again? Yes, given two vectors, A and B, add them together, giving a new vector,
which contains the pairwise sum of each element in each of the vectors. And I think to myself,
oh, this, how complicated could it be, right? Write a for loop, iterating over the size of the
vectors, and, you know, just look at the two entries, add them together, and then, you know,
set it in my output and return the result. You know, easy. Are we done? And then the interviewer
gives you an evil smile, and they says, okay, now, what if you want to make it really fast,
and you want to make it work in a lot of situations? And so this situation is sort of
exactly the situation that tensor iterator is in, right? On its face, the job that tensor iterator
is trying to solve is very simple. Namely, given two tensors, you know, do some point-wise operation
on them. So, you know, given two tensors, you know, like, look at the first element in the first one,
look at the first element on the second one, add them together, that’s the first element of your new
tensor and keep doing it step by step by step by step. And you might think, hey, this should be
really simple. And it is, sort of. But it turns out that, you know, when you’re in a library like
PyTorch, there’s actually a lot of different conditions, and also a lot of different performance
optimizations you might want to apply in this situation that, like, actually end up making
tensor iterator very, very complicated. So the goal of today’s podcast is to talk a little bit about
all of these things that, you know, go into making a tensor iterator work. So where to start?
So let’s start from the very beginning. So I gave you these two vectors, and I wanted you to add them
together. And one of the first things you should ask me is, well, okay, what are their sizes? Are
their sizes the same? Because, you know, a tensor is not just a one-dimensional array, it’s actually a
multi-dimensional array, right? Possibly with arbitrary dimensionality. And so, you know, when
you want to add two things together, it turns out that, you know, adding two things together doesn’t
require the two input tensors to be the same shape. In fact, PyTorch implements something called
broadcasting. We got broadcasting from NumPy, which also does it. What broadcasting says is it
basically simplifies the situation when you have a tensor, and you want to, like, add a single scaler
to it. So if I have a tensor, and I want to add two to it, that just means, hey, add two to every single
element in the tensor. And this is a special case of broadcasting. Broadcasting actually, you know, sort of
generalizes to arbitrary dimensions. So let’s say that I have a five by four by three tensor, and I’ve got a
sort of sample tensor of size three that I want to add for every single element in my five by four by
three. Namely, I want to do it five by four times, right? 20 times, stamp in this extra three, and
replicate it in all of the situations that it shows up. And broadcasting also supports that. The way you
figure out how I tensor broadcast, by the way, is you sort of line up their sizes, but to the right, rather
than to the left, and then just pad it out so that it goes all the way to the end. And so their sizes
from the right have to match up. And then everything else gets replicated. All right. So one of the first
things you have to do when you’re doing tensor iterator is when you want to add things together is
actually like, we will accept inputs that don’t have the same sizes. And you need to do something
reasonable in that in that case, namely, you need a broadcast in a situation. Okay, sure. So let’s say
that you know how to do broadcasting, and you’ve written the algorithm to figure out what the output
shape should be. You know, what else is there? Well, I didn’t tell you what the, you know, types
of your inputs were, right? And, you know, normally, if I give you two vectors in an interview, you just
assume they have the same type. But what if they don’t have the same type? Well, in PyTorch, we have
this thing called type promotion, also taken from NumPy, which says, hey, under some situations,
we are willing to add together tensors, which have different types. Once again, why is this desirable?
Well, sometimes, you know, you have a floating tensor, and you’ve got an int tensor, and you just want to
add them together, right? You want to treat the integers as if they were floating point numbers,
and then do this addition. So there’s a table, and I’m not going to tell you this table in the podcast,
but there’s a table, which imagine that you know, you have all the different d types in PyTorch, like
int32, int64, int16, float, double, etc, etc. And then you’ve got another axis on the table, which is
all the d types as well. And the type promotion table tells you given two inputs of these two d types,
what is the output d type in question. And so this is something else tensor iterator has to, you know,
compute, right? Which is that, hey, you know, what is the actual output d type, because maybe the input
types of my values are not the same. Oh, but it gets better than that. Because, hey, you can also
give a addition operation explicit out tensor that you want to write the results to. Does the out tensor
have to match the computed d type in this situation? The answer is no, it doesn’t, it can be different.
And yeah, we’re also going to promote, as necessary to like fit the output into the output d type you
give us. So hey, all right, like, yeah, so you know, you asked about what the types could be, we said they
could be anything. We asked about what the shapes could be, they could be anything. Does it get worse?
Yes, it does get worse. Okay, so I mentioned that addition can take an optional out tensor, right?
And so what does this mean? It just means that, hey, when I add these two tensors together,
don’t allocate a new output buffer for the situation, just write it in place into this pre existing
buffer. What happens if that output buffer aliases with one of the inputs? And this is like, actually,
kind of tricky to deal with. And in general, like the sort of aliasing situations, make, you know,
otherwise straightforward algorithms a bit more complicated. So in some situations, it’s okay for
this aliasing to happen, right? So like, let’s imagine that I am adding a tensor in place, right?
So I’ve got this tensor, I want to add two to every element in it. What actually happens in the situation
is I put in the tensor as an input. And I also put the tensor in as an output. And because addition is
sort of atomic, right? Like I just read out from the memory, and then I do my operation. And then I write
out back to the memory without ever like looking at any of the other memory locations. This is fine. And like,
nothing bad will happen if the inputs and outputs aliased with each other. But let’s imagine that my
output tensor actually is strided in a funny way. For example, what can happen with strides in PyTorch
is that multiple logical locations on the tensor can refer to the same physical memory, right? We talked
about broadcasting, well, broadcasting exact is exactly a situation like that. So what happens if you know,
you’ve got multiple logical positions pointing to the same physical location? Well, let’s say you’re
processing your inputs one by one. And so okay, I want to add two to some location. So you go ahead and
you read out the physical location, do the addition and write it back out. Oh, the next time you read out
from that physical location again, because once again, this is one of those tensors where multiple
logical positions mapped in the same physical position. Well, you’ve already clobbered the old value
there. And well, sucks to be you, you just are going to get total garbage in this situation.
So something else tensor iterator has to do is it needs to make sure that there aren’t any sort of
illegal overlaps between the inputs and the outputs. And also sometimes, you know, with the
also, the output itself needs to make sure that it doesn’t overlap with itself, which can also cause
problems where you know, you write to the output location, and then you write to that output location,
again, clobbering the original value. Oh, man, by the way, the problem of determining whether or not
there is an overlap is actually like equivalent to like solving diaphantine equations. So PyTorch just
does an approximation, it would be really, really difficult to do this properly. Oh, one more thing,
this this like aliasing thing where the destination and the source could overlap. This is one of the
reasons why like there’s a difference between mem copy and mem move in the C API, right? One of them
is allows for aliasing and another one doesn’t. And so you have to be careful when you’re writing code
to figure out whether or not aliasing can occur or not. And since PyTorch is a library, and it can be
called by anyone, we have to basically assume that arbitrary aliasing can happen to anyone.
All right, so we talked about shapes, we talked about d types, we talked about memory overlap,
are we done? No. So I mentioned about strides, right? So we talked about how strides can be used to
like implement broadcasting. And so what do I mean by that? Well, you know, in PyTorch, we support this
operation called expand. And so what expand does is it takes a tensor of some size, and then expands it
to be some bigger size with the same element, you know, repeated, repeatedly, but we don’t actually
materialize all of this in memory, what actually happens is, you know, it just gets stamped out in
duplicate copies. And the mechanism by which this happens is a stride. The stride says, you know,
once you advance in some dimension, your index in some dimension, how should the, you know, physical
location change, right? And so normally, in a like contiguous dense tensor, if I advance in a dimension,
that means I should skip ahead however many, you know, 100 bytes, 400 bytes, whatever, to get to the next
sort of chunk of data in this case. But when I broadcast, I just say, oh, that number is zero. So I’m not going
to advance at all. So broadcasting is a degenerate case of striding. But in fact, striding has a lot of other
possibilities, right? And you know, what it comes down to is that when I have this like flat contiguous
piece of memory, there are multiple ways I can interpret it as a multidimensional tensor. And the
like very like simplest example of this is when I think about how 2d matrices are represented,
right? There’s this concept of row major and column major matrices, right? When you read out the numbers,
you know, 12345 in physical memory, does that correspond to reading out a column or a row? And PyTorch
supports both of these simply by just specifying what the strides of a tensor are. Okay, so you can’t
assume that like the layout for both of your tensors is the same. And so oh, another thing tensor iterator
has to do is given the two inputs, what should my output layout be, right? Because, you know, if I give
you a column major tensor and a row major tensor, well, I had to make some decision about what the output
should be. This is there’s a very complicated resolution algorithm for it. But like one of the
properties that it wants is if like I add a column major tensor to a column major tensor, I still get a
column major tensor. And similarly, if I add a row major tensor to a row major tensor, I also get a row
major tensor. Is that it for strides? Not quite. Okay, so there’s something else that happens. So we’re getting
out of the realm of correctness, right? Where like, we just need to like deal with all these things like
shapes and d types and layout, because they’re part of the public facing specification. And we’re now
getting into the how we actually make the algorithm run fast. So one of the things that is with strides
that like makes them kind of slow is like if you have a really big dimensional tensor, well, you have a lot
of strides. And if you want to index, you know, the indexing formula with striding is, you know, take the
first index, multiply it by the first stride and add it to the second index, multiply it by the second
stride, and so forth and so forth and so forth and so forth. So you can imagine with a really high
dimensional tensor, indexing computation actually takes a lot of time. And in fact, we try very, very hard
not to do arbitrary dimensionally indexing. And most of our helpers for doing indexing require us to know
exactly what dimensionality a tensor is. By the way, that’s also the reason why like, say, eigen is
actually templated on dimension size, because it’s way easier to generate efficient code in the situation.
But tensor iterator is supposed to work on tensors of arbitrary dimensionality. So like, how do we do this
efficiently? Well, another important optimization that we do is sometimes we have multiple dimensions,
but they’re actually all contiguous, right? Like, let’s imagine that I have a contiguous tensor,
a contiguous, you know, five dimensional tensor, and it’s just laid out in memory, you know, exactly
12345678910. And I’m adding it to another 5d tensor. Well, I don’t actually care about the dimensionality
in this case, right? Like, if the dimensions are exactly the same, there’s no broadcasting, there’s
nothing like that, I could just treat these both as one dimensional tensors and add them together. And
that would give me exactly the same result. Well, okay, I get a one dimensional result, and I have to
reshape it into a five dimensional tensor. But like the computation between these two cases are
the same. So another optimization that tensor iterator needs to do is it needs to say, Okay, well, you know,
I’ve got this n dimensional tensor, it’s got all these strides, but what I’m going to do is I’m going to
coalesce these dimensions, where when I have contiguous stripes, I’m just going to treat them as one mega
dimension. And so I don’t actually have to spend time doing indexing computation inside
them. Oh, man. So that’s a bunch of stuff that tensor iterator does. Okay, so like, you know, we’re like
looking at sizes, we’re looking at strides, we’re looking at d types, we’re looking at overlap. Is that
it? Well, not quite. So remember the interview question, right? Like, so it’s like, okay, how do you add
two vectors together? Oh, I will just write a loop, and I will add the elements together, and I’ll be
done. And then your interviewer says to you, Okay, how do you make it faster? Well, there are a lot of
things you can do to make it faster. So one thing you can do to make it faster is you can parallelize it
when there are lots of data, right? So you know, what does that mean? Well, you know, I’ve got this giant
tensor, I’ll just split it up into chunks into grains. And I’ll ship each chunk to some thread. And
the threads will, you know, do the addition in parallel on each of them. And you know, like, if I
I’m not trying to run in a like multi threaded environment, I get all the cores to myself, you
know, that’s a big speed up, because, well, one, you know, the data can be shipped out in this way,
without like too much interference. And two, because I’ve got all these cores, and they all have ALUs,
and they can actually be easily doing computation in this case. So there’s, you know, when you run a
CPU kernel, alpha tensor iterator, parallelization is something you get for free in a situation.
But wait, there is more. So you’re doing your addition on a single thread, right? And it’s like,
hey, you know, please add the first element, please add the second element, please add the third element.
Well, we can do better than that, right? Because there’s this little thing called vectorization.
See, my earlier podcast, vectorization means I can actually do chunks of, you know, multiple numbers
at a time, and take advantage of that AVX silicon in my CPU. So you know, not only am I going to
paralyze, but I’m also going to vectorize when I’m actually doing the side by side elements. That’s
also something tensor editor takes care of. Okay, so I paralyze my code, I vectorize my code,
can it go any faster? Well, yeah, it can, right? Because, you know, the whole point of running
things in PyTorch is GPU acceleration, right? GPUs are these massively, massively parallel processors.
And you know, they have way more parallelism than just the poor vector units on our multi core CPU can
be. So another thing tensor editor does is it lets you write kernels that work both on CPU and on CUDA,
while sort of saving all of the, you know, other stuff like shape and D type and, you know, layout,
that common stuff, letting you just do that once for the two implementations.
And that in a nutshell is, you know, most of the things tensor iterator does, there’s more stuff
that I sort of haven’t really talked about and glossed over. But at a high level, tensor iterator
is, you know, sort of pulling its weight in two ways. One is that it is doing all the complicated
semantics for point wise operations that you just don’t think about, but like, are these things that
like people rely on working uniformly for all binary operations. And two is it makes it to write reasonably
efficient code, you know, when you’re writing things in PyTorch. And it does so without like needing a
just in time compiler or anything like that, right? It like gets compiled once, it doesn’t take up too
many much binary size, and you get decently fast kernels that work in a huge variety of situations.
Not everything is great with tensor iterator. Tensor iterator is kind of slow, it does a lot of,
you know, bookkeeping and that bookkeeping adds up, we’d like to make it faster. But this is one of the
reasons why it’s been so hard to replace because it really is doing a lot. And, you know,
ask what you can do for tensor iterator, I say. That’s all I want to say for today. Talk to you next time.TensorIterator
EP21 torch_function
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about Torch Function,
a magic method that you can put on any class that makes it possible to override the behavior of
everything that happens in the PyTorch library. Torch Function was developed in collaboration with
many, many people over many years. For example, at the very beginning, Dylan Biswalco made a request
for subclass preservation in PyTorch and he wrote an implementation for it. We didn’t end up using
this implementation, but the prototype was enough to convince us that we should fund this project.
And a number of folks at QuantSight, namely Ralph Gammers, Prasun Anand, and most importantly,
Hamir Abasi, actually took it and did an implementation that we actually landed in PyTorch
based off of the numpy implementation called array function. So Torch Function operates very
similarly to array function. So if you know how that works in numpy, you know how it works in PyTorch.
So why would you want to use Torch Function? Well, let’s imagine that you are writing some code and
you want to reuse the functionality in PyTorch. So for example, we’ve got all these functions,
right? We’ve got torch.add, torch.fft, tons and tons of API surface like that. And you might have code
that, you know, you’ve had written against the PyTorch API and you like it, but you want to do a little
bit more, right? Like, so the normal tensor behavior works okay, but you want to extend a little. Maybe
you want to like keep track of some extra metadata on your tensors, or maybe you want to, you know,
run some extra code like logging every time you do an operation. And so you just like to, you know,
subclass tensor and, you know, customize the behavior a little bit. And at the same time, still be able to,
you know, run all the good old fashioned operations on PyTorch. Or maybe you want to
completely override the meaning of tensor, do everything on your own, but still be able to use
all of the preexisting API that PyTorch provides. And a good example of when you might want to do that
is say tracing. Well, you kind of run into trouble if you want to do this. So if you are just thinking
about object oriented programs in Python, ordinarily, you can just change what the methods on an object
are. And because Python is duct typed, things will mostly work out, right? So like if you if I had a
tensor, and it supports an add method, I can just write another object that has a different add method.
And then you know, if I call add on the object in question, I will just go to my, you know, whatever my
other implementation is in my new version of the object. But if I have a function, and I pass in an
object to that function, ordinarily, I can’t overload the behavior of the function in that way. Because
well, it’s a function. And you know, we don’t actually do a dynamic dispatch in that situation,
we always call a single implementation in that case. And sure, maybe the function might call a method
underneath, but maybe not, right? Like and a lot of functions in PyTorch don’t, they go straight into
the C++ bindings. And so you know, there’s no opportunity for overriding behavior in Python.
So that’s what torch function is for torch function is a magic method that lets you override what the
meaning of functions in the torch namespace do. No matter what, you know, the object in question is.
So all you do is you write a class, you put a magic method called torch function on it. And then
whenever you call a function in torch, instead of doing the normal behavior, it’ll bounce to your torch
function implementation. And then you can override the behavior however you like. And in fact, it does more
than that. You know, originally, the only thing we wanted to do was make it possible to override functions in this
way. But it also turned out that it was really helpful to have a generic protocol for, you know, overriding the
behavior of all operations on tensors, not just functions, like sort of analogous to you know, like,
if you want to do logging, you want to write some code that works polymorphically over every function and
method on tensor, you don’t want to have to just write a single you don’t have to, you know, do an
override for every single method and function you want to do. So what torch function actually does today,
is it lets you override all method behavior, all function behavior, and you know, write your own
custom functionality. And then, like, you know, have your code that’s written against the PyTorch APIs
actually use it in this situation. Torch function is pretty useful, and it’s already been used a number
of different situations. The original request that led us to implementing torch function was someone was
writing some code using tensors, and they had sort of units of measure associated with the tensors.
So the tensors represented physical quantities, and they wanted to, like, you know, classify tensors based on,
you know, what was what. And they had a problem, which is that whenever they did an operation on, say, a voltage,
like, say you had a tensor representing voltages, and they added two voltage tensors together,
even if they were subclassed in the beginning, when they added those two tensors together,
well, the subclass wouldn’t be preserved. So originally, the like, the pitch for this was,
hey, we want to be able to subclass tensor, and we want the subclassing to be preserved whenever we
do operations on classes, because that’s pretty useful, right? Like logging sort of works the same way,
right? If you have a tensor, and it’s a logging tensor with extra metadata on it, well, you need to,
you know, get it back another logging tensor after you run an operation on it. Otherwise,
your logging will just stop. So in fact, tensors have a default torch function implementation
that says whenever you have a call onto a tensor that is a subclass of tensor, we will automatically
preserve the subclass for it if all the arguments are that subclass. Otherwise, we’ll just say it’s not
implemented. And you’ll have to figure something out in that situation. Another situation that torch
function has been used for is this tracing use case. Actually, it’s called torch.fx. So what is
torch.fx? Torch.fx is a manipulation toolkit for PyTorch programs, what it does is it says, okay,
you write your PyTorch program using Python, you can use torch.fx to trace it into some representation,
you can do some transformations on it. And then you can reinterpret it, recompile it back into Python
code that you know, you might send a torch script or something like that, right? So it’s a lightweight,
easy to prototype mechanism that, you know, lets you do all the syntax manipulation in Python.
And how is torch.fx implemented? Well, it’s also implemented using torch function. So what torch.fx does
is it has a tracer class, the tracer class implements torch function. And instead of, you know,
doing all the normal operations, when you call into torch functions or methods, what the tracer object
does instead is it just writes down what happened, and then gives you a new object that is just, you know,
another tracer, and then you know, you keep track of things this way. And then, you know, once you have
one of these traces, you can do whatever you want to it. But the point is, you didn’t have to modify
your PyTorch program at all to run it under torch.fx, you can still call regular torch functions on the
tracer object, and it all works. Okay, so that’s, you know, some of the use cases behind torch function,
how does it actually work? And why is it actually so effective? So let’s first talk about how it works,
because it, the inner workings of torch function explain a little bit why it’s so effective. So the way
torch function is implemented is it’s a purely Python binding concept. What do I mean by that?
Well, think remember, in the very first episode of this podcast series, I talked about how PyTorch
Python bindings work. And so in general, you know, we have this interface where a lot of code is written
in Python, and eventually you cross over into C++, we translate all the Python arguments into C++ arguments,
and we pass them on below. So, you know, between there, there’s like another level of indirection
until you get to the dispatcher, another topic that we’ve talked about in a different version of the
podcast. And so what happens is that torch function is implemented directly on the Python binding layer.
So all of this extra business that, you know, gets you to the dispatcher or the dispatch keys or any of the
various subsystems in question, torch function bypasses all of that, right? Like, it happens exactly when you have the Python
binding layer. There’s a very pragmatic reason this is the case. And that’s because when we want to call into torch function,
well, torch function is an honest to goodness Python function, right? So we need to pass on all the arguments
that we were given. And so we need to actually like keep the Python representation around. So if you go any
lower, you know, past the Python binding layer, you’ve lost all the Python objects, right, you just have C++
objects, and then you’d have to like reconstruct them into Python objects. And that’s annoying. So it happens
at the Python binding level. But there’s a second implication to this as well, which is that we can actually
also override the behavior of functions in Python itself. So what happens is we have a number of
functions which are implemented in Python. So they’re not so so the way we implemented torch function was we
wrote some code generated code to insert into all the Python binding sites that basically said, hey,
if you see an argument that doesn’t look like a normal tensor, it like looks like some object with a
tensor torch function, go call that. Well, we have a version of that that lives in Python. So whenever you
have a code in Python that’s written directly in Python, you can write a little preamble at the
top that says, well, if any of my tensor like arguments contain something that looks like it has a torch
function, then call the torch function instead of the regular function. And so this way, not only can we
bind at the Python binding layer, which is sometimes kind of low level, right? Like, you know, we don’t
the Python binding level is not public per se, right? Many of the functions that you see there
are in fact the public API because they coincide. But many functions are not they’re just like sort
of internal things, the way that we get into the C++ binding. Well, you can also override the higher
level Python operation that actually explains what’s the stuff you actually want to do in this situation.
And this fact about the torch function implementation that it operates at the Python level,
and it can operate both at the, you know, level of the Python bindings, but also any higher level
abstractions you read it in Python, it’s actually one of the reasons why torch function is so powerful
and so popular for doing applications like tracing. And that’s because it preserves the high level semantic
structure of your program. We actually, you know, one of the questions that I often get about torch.fx
is, you know, hey, torch.fx is just tracing, but don’t we already have a tracer in PyTorch? And indeed,
that’s true. We have what’s called the autograd tracer. This tracer lives in the C++ level,
it lives in the dispatcher, and it also does sort of the same thing as fx, which is that it traces
things. So why then is there like another tracer that’s fx that’s built on this torch function thing?
And the answer is fx gets to trace at a much higher level than the autograd tracer because it gets to
interpose on actual Python functions. In fact, you know, one of the things that fx is all about is it’s all
about tracing nn.modules. And because it lives entirely in the Python world, you know, it can actually,
you know, record directly what the nn.module you were operating on was when this sort of thing happens,
right? That would be totally impossible to do in C++ because C++ has no conception of an nn module, right?
Everything has been translated into just plain old function calls at that point in time.
Another implication of this is that because it happens at the Python binding layer,
you have an opportunity to actually, you know, look at the Python call stack or like, you know,
override the meaning of things that are not even tensors. So for example, when you call sizes on one
of these fx tracer objects, we don’t have to give you an integer in C++, we would have to give you an
integer because like C++, if you say you return an integer, it has to be an integer. But in Python,
everything is duct type. So we can actually just return you another tracer object and like do the
right thing when it shows up in a trace. Which brings me to my second reason why torch function
has been so popular. And that is because it is in Python. It turns out that people really,
really like to write code in Python. This was actually, it’s a little surprising that I didn’t
learn this lesson given like PyTorch’s entire shtick is that like, hey, you just write normal Python
and your programs work, but hear me out here. So, you know, we knew that PyTorch, you know,
from a machine learning practitioner’s perspective, you know, it was really useful to write things in
Python. Like that was a essential part of the DNA of PyTorch. But when we were like writing the first
version of the compiler, we were like, Oh, no, Python doesn’t have strong static types. And we’re
in the business of writing compiler. And you know, we don’t want to write a compiler without having
static types, because compilers are complicated, you really want as much help as you can get enforcing
all the invariants that you have. So you know, we decided, okay, we have to write the compiler in C++.
I don’t think this is the wrong decision. Like, you know, having the compiler in C++ is really useful.
But what we underestimated was the appetite for, you know, like, sort of short, easy transformations
that people might want to do, you know, like, like, you know, democratizing compiler, right? If
like, if you had to like, learn about type systems, and programming language theory, and, you know,
lower level intermediate representations, just to like, make a little manipulation to your code,
you know, that’s going to gatekeep a lot of people out of doing compilery things, when actually, you
know, that’s how they should be solving their problems. And so it turned out that like, giving
these tools to people and letting them do them in Python, well, so one is a lot of people needed to
do stuff like this. And previously, the only way they could do it was by writing C++. And that was
terrible. And the second thing is that things were simple enough that like doing everything in Python
was actually tractable. And you know, people could keep track of everything that was going on.
So like, hey, you know, like, if you can prototype your entire thing in Python,
without having to recompile PyTorch, recompiling PyTorch. Hey, that’s a huge win. And so that’s one
of the reasons why people like this a lot. And like torch function, it being a Python level extension
mechanism means you don’t have to actually, you know, talk to us PyTorch core or have to rebuild
PyTorch to play around with it, you can just write your Python function in your research code,
write like, you know, just a stock dependency on PyTorch, no family business going on with C++ extensions.
And you can do whatever you want, like sort of crazy interesting stuff. And that’s pretty powerful.
That being said, there are some downsides to being a purely Python level mechanism. And the biggest
downside and one that we’ve been working on recently, is that you can’t take advantage of any of the
machinery that lives below the Python binding layer. And the most important piece of machinery here is
autograd. So hey, if you override things with torch function, you don’t get autograd anymore. Like if
you want autograd, you’re going to have to figure out how to do it yourself. That being said, we are
trying to figure out how to solve this problem. And the way we are thinking about how to solve this
problem is a concept called dispatch to Python. The way dispatch to Python works is that, you know,
we still have this torch function binding layer that works in Python, but you can choose to go into the
C++ layer. And in the C++ layer, there’s a lot of things we can’t preserve the Python, you know,
status of like, you know, if you have an integer argument, that’s going to turn into a C++ integer.
Sorry, we’re just going to completely forget about the original Python object in that case. But for tensors,
we do record what the py object for the tensors are. So all we need to do is make sure that we preserve
the idea that Oh, this is a tensor that has some extra Python behavior on it, we blast it through
our C++ dispatcher layers doing autograd doing batching, everything like that. And when we eventually
get to the final implementation, instead of dispatching to our CPU or CUDA implementation,
we just dispatch back to Python, translate all the arguments back into the Python and call into there.
And that way, you can actually also take advantage of autograd while still prototyping everything in Python.
We’re still in the early days of working on this. Functorch, which is being worked on by Horace and
Richard, is a sort of experimental, you know, repository working off of this to give functional
transformations to PyTorch. It’s pretty cool. But you know, like, I’m hoping that this can be another
really cool tool, complementary to torch function to let people further extend the behavior of PyTorch on the
the inside. That’s everything I wanted to say for today. Talk to you next time.torch_function
EP22 Why-is-autograd-so-complicated
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk a little bit about
the constraints slash motivations slash things we are trying to do in the Autograd engine in
PyTorch. The Autograd engine in PyTorch is the part of PyTorch which is responsible for
implementing automatic differentiation. This is very important for a deep learning library.
If you think about the tagline of PyTorch in the past, it’s a numeric computing library
with GPU acceleration with automatic differentiation. The automatic differentiation
is how you can write models in PyTorch, run them, and then differentiate through them,
finding out what the gradients are so you can go that way when you’re optimizing your models.
It’s very, very important and it was built very early in the history of PyTorch and is still
something that people use all the time today. Unfortunately, the Autograd engine is also very,
very complex and that makes it difficult for people to understand how it works and it has a lot of
features and a lot of, you know, sort of peculiarities and this makes it also difficult to
understand. So difficult that I don’t think I could actually like technically explain what is going on
with the Autograd engine in just a podcast. I’d have to actually write a blog post about it.
Um, I’ve been promising to write a blog post about it for a while. Um, ever since my internals talk,
but it’s just, it’s just a really, really complicated subject. So today what I’m going to try to do is do
something a little simpler, which is, I’m just going to talk about a bunch of the things, a bunch of the
important properties that we wanted out of the Autograd engine and some of the implications of those
properties. For example, uh, one thing that, you know, we needed for our Autograd engine was it for
it to be fast. Like, um, you know, we had a version of the Autograd engine that was written in Python
and it was pretty slow and we weren’t saturating GPUs when we wanted to, um, run networks on it and that
prompted us to port it all to C++. And so, you know, the Autograd engine lives in C++ and it uses
multi-threading simply because, you know, at the time it was designed, we needed it to be fast enough to
saturate GPUs on common, you know, um, uh, distributed, sorry, data parallel training, uh, regimes. So, you know,
that was the only way we could get there. Another thing that, um, the Autograd engine needed to be was, um, it
needed a very concise way of writing derivatives for operations. As I’ve mentioned before in many other, uh,
episodes of this podcast, PyTorch has a lot of operators and, you know, one of the things that,
you know, we sort of ensure is the case for every operator someone adds is that it actually has a
derivative definition. And so if you had to write, you know, like multiple pages of boilerplate just
to add a new operator, because that was how derivatives were going to be generated, you’d be
in big trouble because like we just have way too much code in PyTorch for anyone to maintain in a
reasonable way. And so to get around this problem, we actually built a code generation system for
Autograd engine. This code generation system existed from the very beginning of the C++ implementation
for Autograd. And one of the like sort of very, um, famous and, you know, you will probably touch it
if you ever add a new operator to PyTorch files in, uh, our code base is the so-called derivatives.yaml,
which is this yaml file, which for every operator, we know how to do derivatives of,
you write down what the derivative of any given operation is with respect to each of the inputs
in the function in question. And so most derivatives can be written in a single line.
And this just makes it really easy to like, you know, write new derivatives when they are mathematically
obvious. A topic that I should talk about sometime is about the code generation pipeline in PyTorch.
And, you know, one of the reasons why we have a code generation pipeline, which is, you know,
not the easiest thing to, to understand any sort of metaprogramming at this scale is not so easy,
but in the case of Autograd, and I think in the case of most of the uses of code generation in PyTorch,
it is well worth it because without it, um, C++ just doesn’t have a strong enough metaprogramming
mechanisms, we would have had to have written a lot of code to just implement one of these things.
Like, if we think about like, when you write something in derivatives.yaml, what’s going on
here? Well, there’s a lot of things going on. For example, when you write one of these derivatives,
you can refer to inputs that were given to you inside the, um, that you can refer to inputs that
were given to you, um, as inputs to the forward implementation. What does that actually mean?
Well, what that means is that when we’re running the forward of a, um, model in PyTorch, when you
refer to an input in the backwards formula, that means we have to save that input so that it’s still
available when you, um, actually, you know, refer to it in the backwards pass. So, you know,
we have to save it. We have to like write a struct. We have to put a place where we can save the thing.
We need to actually save it in the forwards thing. And we need to get it out again and plug it into
your formula. So that’s a lot of moving parts and the code generation handles that all for you.
So you could just, you know, it looks like you’re just closing over, um, closing over, you know,
the input at that time. Like, you know, one way to think about derivatives is they’re like just
higher order functions, but you know, in C++, that’s not so easy to do. So we have, um,
a lot of things to make this simpler. Another thing that PyTorch needed to support when doing
automatic differentiation was views and mutation, right? So like one of the really big things,
part of PyTorch’s DNA is that you can take out views from tensors. So these views, you know,
don’t allocate new data. They share storage with the original tensor in question, and you can also
mutate them. So, you know, like if you want to fill in just a single row on a tensor, you could view out that
row and then just run fill on it. And our automatic differentiation system actually needed to work
correctly, even when people were doing views and mutation. There’s a few ways senses in which I mean,
it needs to work correctly in the situation. One sense it needs to work correctly in the situation
is just sort of basic correctness, which is just to say that, you know, you have a, um, tensor that you
want to save for backwards so that you can use it later. And then if someone goes ahead and, you know,
scribbles all over it with garbage sometime later in the forward pass, well, you’re just going to get
garbage out in the backwards pass if you try to reuse that buffer exactly as is. And no, we don’t want
to copy out variables when we save them because that would be expensive. And remember, we want automatic
differentiation to be fast. We don’t want to like impose, uh, you know, that kind of overhead on users.
And also you’d probably run on memory if we were doing that. So to make sure this doesn’t happen, we have
this mechanism called view counters, sorry, version counters, which, um, uh, record, you know, what, how many
mutations have happened to a tensor in question so that when we save it, we can say, oh, you know, three mutations
have happened. And then when we come back, we check, you know, is it still only three mutations? If it’s five
mutations, that means someone’s mutated it in the meantime, and we can give a good error message in that
situation. But there’s another more important, uh, thing that we need to do to support views and
mutation with automatic differentiation, which is that we can actually support differentiating through,
um, mutations in some situations. For example, if I have a tensor and I, um, you know, take out a view
and then write out that view with that tensor, which requires gradients, the result is that my, you know,
base tensor, which I wrote into now also requires gradients, right? Because if I use it as part of
my loss computation, that bit of the tensor that I wrote in using that view now contains data that,
you know, tracks its provenance back to that tensor that I originally, um, requires grad from. And so
there’s actually a pretty complicated apparatus in autograd. We’re making sure we can keep track
of what automatic differentiation happens in the situation when you do a mutation on a view
with something that requires grad. And this is, um, if you remember the podcast about inference mode,
this was some extra metadata that you actually don’t need an inference mode and inference mode lets you
dispense with doing that. But, you know, when you’re doing normal automatic differentiation,
you need this information. And so we track it so that you can, you know, do all the things you
expect to be able to do in Python. There’s some other performance stuff that we do to sort of,
um, make reverse mode automatic differentiation work in a predictable way, because at the end of the day,
what our reverse AD engine is, is it’s this multi-threaded C++, you know, opaque engine that
like runs your code and you don’t really know like what is going on with it because it’s not written in
Python, you can’t debug it. And furthermore, there’s no like direct sequence of calls you make,
right? You just call into backward and then a whole lot of stuff happens in that time. So one of the
things is it needs to be possible to debug problems in your autograd graph in a reasonable way, right?
Because, um, yes, we say PyTorch is this eager mode framework and, um, you know, like you can just
write code and write debug statements, but that doesn’t really hold true when you do, um,
a reverse mode AD because, um, all this stuff is happening without any corresponding source code,
by the way, tangent, a research project at Google for doing source to source automatic differentiation.
One of their pitches is like, Oh, you know, we’ll take your Python program and turn it into a
differentiated Python program that you can just debug directly if you need to debug problems.
So PyTorch doesn’t do that. So what do we do instead? Well, we have a bunch of extra mechanisms built
into AD such as anomaly mode, which, um, anomaly mode normally you use to debug why are NANDs showing
up in your tensors. But another thing that it does is it, you know, keeps track of what backward
operations correspond to what forward operations. So when something fails in a backwards operation,
it’ll tell you, and by the way, this was the forward operation, the back trace that actually caused that
situation. Another thing that we do is we have a pretty sophisticated hooks mechanism whereby you
can insert arbitrary pieces of Python code at any point when you’re running your, um, backwards,
uh, you know, computation and say, Hey, you know, give me what the gradient is at this point in time.
And let me take a look at it, you know, maybe modify it if I’m doing some weird gradient scaling
or something like that. But really, you know, I can just take a look at it and figure out if,
you know, it’s what I expect or not. It’s the way of inserting say debug print statements. And so,
you know, these things are not conceptually complicated, but a lot of, you know, effort
is spent inside, um, the Autograd engine. So if you’re like reading the code and you’re like, Oh,
what is all this hooks business and this anomaly mode business? Well, it’s not important to the core
algorithm, but it is important to making sure users get a good experience when using the Autograd engine.
There’s also some really unusual features that our Autograd engine supports, which also add to the
complexity of the implementation. So one of these things is so-called re-entrant execution. What does
re-entrant mean? Re-entrant means you’re inside some sort of procedure and you want to call back into the
procedure again while you’re inside it. So you’re re-entering while you’re already inside. So re-entrant
execution in the context of automatic differentiation, the Autograd engine is you’re in the Autograd engine,
you’re executing, you know, your, um, backwards functions one by one. And then inside one of
those backward functions, you actually execute, um, Autograd again. Why would anyone want to do that?
Well, one, one answer to that is, you know, like Autograd is just this operation, right? Like it computes
the derivatives of a function. And so like that just is a normal mathematical computation that, you know,
you should be able to do anywhere. And in the other, in other words, grad should be composable,
but there’s another like sense in which re-entrant execution is really useful. And that’s for
checkpointing in PyTorch. Checkpointing is this trick for reducing the memory usage of your models
that says, Hey, I’m not going to record the, um, saved variables. Remember that, right? Um, I’m not
going to record the saved variables for everything in my network. Instead, I’m going to force the network
to re-compute, um, the, um, variables when I actually get to them. I’m trading away, uh, compute
so that I can reduce the amount of memory I use. So how do we implement re-entrant, uh, how do we
implement checkpointing in PyTorch? Well, we do it with re-entrant execution. What we do is we, um,
run our PyTorch program, we run the forwards, and we just don’t save anything. And then when we come
back in the backwards and we need to figure out how to, uh, you know, execute the, um, backwards
formula, well, we’ve failed to save anything. So what we do is we just rerun the forwards again,
and then re-entrantly call backwards on it to get the, um, actual backwards, uh, uh, computation,
uh, computed in this case. Um, this was implemented by Priya Goyal back in the day and people use it.
And so it’s, you know, one of the most important use cases for re-entrant execution
in PyTorch. There, there’s a bunch of like complicated stuff where, uh, you can actually
get into these, this bad behavior where, um, you keep re-entering, uh, over and over again,
and then you blow your stack space. And there’s also some logic in the Autograd engine to deal
with that. One last thing that the Autograd engine supports, which is that, um, normally Autograd
Autograd is this thing you think of as running on a single process on a single machine, right?
Like you just run Autograd, you’ve got your entire graph. Well, in the distributed setting,
we actually have an implementation of distributed Autograd, which allows you to distribute Autograd
across multiple processes, uh, across multiple nodes in case, you know, your program in question
is too big to run on a single processor. And so there’s a sort of like specialized version
of Autograd, um, called distributed Autograd, which uses many of the same implementations,
but override some important stuff that makes it possible to just run, um, Autograd in this
distributed fashion. So that’s pretty cool. Also complicated in its own right. You can read
more about it if you’re interested. So why is Autograd so complicated? Well, one is that there
are a lot of features. There’s a lot of performance requirements. And, you know, when you put it all
together, there’s just, you know, you have to work pretty hard to do something like this. Um,
so that’s one of the reasons why, um, for example, in my previous podcast, it was really, you know,
interesting for people to be able to reuse our Autograd engine directly because, Hey, um, you know,
we’ve already done all this stuff, so you’d like to reuse it in that situation. But, you know,
there’s also like something to be said about a simple implementation of Autograd that, you know,
is hackable, maybe doesn’t have all the efficiency, doesn’t have all the features, but, you know,
just has the core, um, algorithms for Autograd. That’s a good idea too. And we have a bug report
that’s tracking this issue. So, um, hopefully you’ve come away from this with a little more
appreciation of, you know, why Autograd is more complicated. And so if you’re ever looking at this
code and you’re like, Oh, what is this business with hooks? What is this business with, um, you know,
this view metadata? What is this business with this multi-threading nonsense? Well, hopefully,
um, this podcast has given you some clues about why those things might actually be there.
That’s everything I wanted to say today. Talk to you next time.Why-is-autograd-so-complicated
EP23 Code-generation
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about code generation in PyTorch.
Code generation refers to the practice of writing other scripts that generate code for you.
In the case of PyTorch, these are our Python scripts, which get run as part of the build
process and produce a lot of C++ files that actually make it possible to build PyTorch
as a whole.
Code generation is kind of the heavy guns, right?
Because when you start code generating your code base, a lot of things stop working.
For example, if you’ve got an IDE and you want to jump to definition, well, it, whoops,
looks like your, you know, method you’re looking for is actually in a code generated file, which
means it’s not in your working directory, and you have to go build PyTorch first before you
can actually go look at it.
So it makes things kind of confusing.
And, you know, most of the time, if you’re just writing a C++ project, you try very hard
not to do code generation.
But in the case of PyTorch, we’ve used code gen actually from the very beginning of the
project, even back in the days when A10 wasn’t even a thing.
And it’s ended up being a pretty good tradeoff for us in terms of what it allows us to do.
The high level of why, you know, code generation tends to be a good idea is that it lets you
greatly reduce the amount of code you have to write manually in a project.
If you had, you know, a, you know, hundreds of classes that you would have had to have
written, you know, one by one.
Well, if you have a code generation pass, you can just generate them all from, you know, a
few lines of YAML.
And you don’t have to worry about it.
So that’s what it’s doing in PyTorch.
Code generation is being used to generate a lot of code that we otherwise have to write hand by hand.
It makes the framework more maintainable.
But, you know, it is kind of complicated.
And so I just want to talk a little bit more about what kind of stuff we’re using code generation for.
Also, what are some of the pros and cons of using code generation and some other counterpoints in the design space?
Because code generation isn’t the only way to skin the cat necessarily in some situations.
Okay, so what are we using code generation for in PyTorch?
There’s a lot of things that we’re using it for.
And at a high level, the biggest way to think about, you know, why we’re using code gen for any given thing
is because usually it was something that we needed that you can’t do with plain old-fashioned C++ metaprogramming with templates.
So a really simple example of, you know, C++ just doesn’t support enough language features to do this
is a generation of APIs like functions or methods on classes, you know, based on a small amount of data.
So for example, we have a type named tensor, and it supports a lot of methods on it.
And those methods essentially call into another class.
It’s really a dispatch mechanism that, you know, is very uniform.
So like for every method, what it does is it just, you know, takes its arguments and calls into another function
that like actually does the method processing for us.
And in one of the, you know, philosophies in the C++ API in PyTorch is that, you know,
we want it to be possible to just write the same code you would have written in Python.
So if you wrote x.add in Python, you can write x.add in C++.
But C++ doesn’t have operator dot overloading.
So we have to actually manually write out every method by hand whenever we want to write a class like tensor,
which supports a method like this.
So we don’t write out these methods by hand because we have hundreds of methods on the tensor class.
This is how we use code generation to actually do this.
Another example of us using code generation is when we do automatic differentiation, see my previous podcast,
we need to generate a class representing the set of saved data for any given piece of autograd information.
And we actually generate one class per piece of autograd per operator because autograd might save different things depending on the operator in question, right?
Because there might be different mathematical values from the inputs that you need to compute the derivative in these cases.
We don’t do a box representation for autograd because that would be less efficient.
Instead, we just have a specialized class for each operator that only contains fields for exactly what we need.
And oh no, once again, there’s no way to, in C++, conveniently generate a ton of classes with slightly different fields based on some simple specification of what the things are.
So instead of having to write them all out by hand, we also use code generation to generate this code generation is also used in some cases to deal with things that don’t live in C++ at all.
For example, we have a bunch of Python bindings.
We do code generate the arg parsing logic for parsing the arguments from them.
But we also need to generate pi i stubs, type stubs, that make the type information available for all the C bindings in question.
Well, how do we do that?
Well, there’s thousands of operators.
So once again, we co-generate the pi i.
So someone, we didn’t used to have this capability.
We didn’t have any type stubs for it.
And all someone had to do was just go and write an extra Python script that knew how to generate these Python type stubs.
And that was it.
They didn’t have to like painstakingly go through every operator in PyTorch and figure out what their type signature would be.
And then saddle us with the burden of having to continuously maintain this extra set of stubs.
Instead, it just gets generated by code in this situation.
Some of the time, what we do is we say, OK, you want to implement an operator and you need to implement a CPU and CUDA version of this operator.
And usually there’s a fixed prototype that we expect a user to implement in the situation.
So we also use code generation to generate the prototypes for these functions so that, you know, you know what you need to implement downstream.
OK, so those are some of the main uses for code generation inside PyTorch.
So what are the benefits of using code generation?
As I said, I’ve harped on repeatedly about, you know, often we use code generation when there’s no other choice.
We just can’t do what code generation wants to do using just C++ templates or other mechanisms.
But there’s also other reasons why code generation is something that, you know, we reach for.
For one, when we build a code generation system in Python, we can actually do much more complicated things with surface syntax.
For example, we have a native functions.yaml.
Inside it, we have this miniature domain-specific language for specifying JIT schema, which is like something that we have to write a parser for.
And, you know, we also have derivatives.yaml, which is this compact representation for writing derivatives for functions.
And, yes, in principle, you can write a templated piece of code that is a parser for some arbitrary syntax.
And people have done this just to show that it can be done in C++.
But in general, C++ is much better at, like, modeling metaprogramming based on, like, C++ types, right?
Like, that’s how, you know, partial specialization and tricks like that work.
So C++, really compact code when you, like, want to look at the type structure of your C++ programs and metaprogram off of that.
Really bad, horrible, awful, no-good-looking code when you want to, like, implement a parser that happens entirely at compile time.
And, yes, constexpr makes things better.
And the, you know, bigger your C++ version is, that also makes it better.
But, unfortunately, PyTorch is still stuck on C++ 14.
Hopefully, we’ll get to C++ 17 soon.
But, you know, we need to work in a lot of different platforms.
And that sort of puts a limit on how futuristic our C++ code can be.
Another reason that we like using Python code generation is it makes it easier to write better error messages.
Template error messages in C++ are famously horrible, right?
Maybe if we get C++ concepts in the future, things will get better.
But, like, you know, a lot of people don’t really know how to debug C++ template errors, but are perfectly fine if, you know, it’s just a Python script.
And there’s, you know, albeit a complicated Python script, but it’s, you know, raising an exception somewhere.
Because then you can add print statements, you can, like, look at, you know, you can tweak things around, you can print extra things out.
And it’s just easier to, you know, deal with than C++.
Yes, you can figure out how to do all of these things in C++.
But C++ metaprogramming debugging is a skill, and most people don’t have this skill.
Whereas most people do, and when I say most people, I mean, like, you know, most developers on PyTorch.
Most developers on PyTorch do know how to write Python code, do know how to debug Python code.
So that makes things a lot easier.
A sort of similar thing related to this is that in C++ templates, you often have to do very complicated encoding mechanisms
to, like, represent complicated data structures, because, like, as I said, C++ is all about, like, operating on types.
And if you actually want to do data, well, you have to work pretty hard.
And in Python, well, you can just write a data class and, you know, use that to represent whatever data you need to pass around.
In fact, our code generation is very strongly typed Python.
We use data classes everywhere, frozen data classes, and it’s fully type annotated with MyPy.
And that makes it easy to also do refactors, where you just, you know, make a change to the data type,
and then you just look for all the places you need to update in the situation.
One last thing.
With a code generation framework, we generate C++, which then is compiled by the C++ compiler,
which means that if something isn’t working, you can look at the generated code and be like,
hmm, is this the code that it would be written by hand?
And so it’s just generally easier to reason about the performance characteristics of Python-based code generation,
because you’re often trying to generate code that looks like code that you would have written by hand.
And with templates, it can be obscured, because there’s this level of indirection.
You’re never actually looking at the code that actually gets generated,
and it’s easy to accidentally put in inefficiencies when you write things that way.
I spent this whole time, like, saying what the pros of doing code generation are,
but, like, there are also some very big cons, right?
So I’ve talked about a few of them already, such as that code generation is complicated.
A lot of people don’t really want to, like, deal with this random Python script that is generating code.
If you do a bad job at maintaining your Python code that generates C++,
it can be really, really hard to maintain.
In fact, that was the state of the old code generation before we wrote it again with strong types.
But there’s some less obvious cons to code generation as well.
One is that code generation is not portable.
What do I mean by that?
What I mean is that, let’s say that, you know, you have some stuff that generates code for you,
and then you have some external user of PyTorch that also wants to make use of this code generation pipeline.
If I had a C++ template, I could just say, oh, instantiate the C++ template in your project,
and then you can get whatever functionality the C++ template gave you.
And they don’t have to do anything extra in their situation.
Whereas if I have a Python code gen script, well, now I have to, like,
actually design the code gen script to be runnable outside of PyTorch for some, you know,
extra data that the user does in question.
And it’s just, there’s a lot more work you have to do to make sure something is publicly available.
We are doing some of this work, actually.
So for external backends, we spent a long time giving only a C++ template-based API for registering extensions.
But it eventually became clear to us that that just wasn’t enough.
We didn’t have enough features to do it.
And Brian Hirsch has been working on out-of-tree code gen for backend extenders.
It’s pretty cool.
I’ll post a link to it in the podcast description.
But, like, you know, we spent a long time not doing this because, well,
there’s a lot of work you have to do to actually make external code gen work.
And I just want to talk a little bit more about, you know, I said previously that C++ templates are pretty good for doing metaprogramming based on the C++ type system, right?
And it makes sense because it’s built into the C++ compiler, which knows all the vagaries of how C++ types work.
And it has turned out that when we write Python code generation framework, we actually need a, like, you know, model of the C++ type system, because sometimes we just need to do administrative stuff, like conversion from one type to another.
And, well, you know, the best way to do that is to actually know something about C++ types so that you can, like, you know, basically run the whatever implicit conversions or type matching that C++ would have done in this situation.
So, we had to implement that.
We have a crappy version of the C++ type system in our code gen.
It would have been easier to do this in C++ itself sometimes, perhaps.
Because sometimes it’s very easy, but, you know, when you add a little extra feature, then it becomes difficult to do something with templates.
So, I spent most of this podcast being like, hey, you know, you can either do code generation or you can do C++ templates, and these are two points in the design space for doing this kind of thing.
And one of the reasons why I put these as the two, like, possibilities is because both of these have the same efficiency characteristics, assuming you’ve done it correctly, right?
C++ templates get instantiated every time you give them some parameters so they can generate code that’s just as efficient as if you had written it by hand, which is what, you know, a code generation would do.
But there’s actually a third point in the design space, namely boxed fallbacks.
So, what are boxed fallbacks?
Boxed fallbacks are basically a way of writing polymorphic code that runs at runtime rather than at compile time.
And the way this is done is by making sure all of the inputs to an operation in question are boxed.
They’re stored in a uniform representation called an I value, and then you can actually write C++ code that’s polymorphic in a sense.
By the way, if you’re used to be able to doing generic programming, say, in Python or in Java, where you just, you know, like, write some, use, like, a reflection API or something like that to write code that works no matter what the types of inputs are, you know, that you’re also taking advantage of the fact that those languages, their internal data representations are all boxed.
They’re all uniform.
So, you can just write runtime code that does this.
C++ doesn’t have that.
So, we have to actually turn things into their box representations before we can write this uniform code.
Boxed fallback code is often way simpler to write.
I recently, Brian, once again, he’s been working in this space.
So, he’s the expert.
Brian has been, you know, taking some code that we used to do in CodeGen and writing it using a boxed fallback, namely some CPU fallback code.
So, what does this do?
It just says, hey, I want to run an operation, but I don’t have it implemented for XLA.
So, I’m going to cast it to CPU and then run the operation on CPU and then put it back in XLA.
And it’s really, like, easy to do the boxed fallback version.
You just do the obvious thing.
You, you know, iterate over the arguments.
You look for ones that are XLA tensors, convert them to CPUs, call the actual thing, and then, you know, iterate over the results and turn them back into XLA.
Very, very simple.
You’d have to do quite a lot of work to, like, write the code generation version of it.
And you’d probably have to do less work, although still some amount of work, to write the CPUs plus template version.
The boxed fallback is very simple.
It’s easy to debug as well because you can add print statements in the normal way.
There’s no templates involved.
The problem is it’s less efficient, right?
Because you’re boxing things up and you’ve got this little interpreter that, you know, has to go and look at what the types of everything are.
So, boxed fallbacks, simple.
And, you know, they work at runtime.
So, they, like, work even when you can’t see the code in question that you might need to do.
But it’s less efficient.
So, you probably only want to use them in cases where efficiency isn’t important.
And CPU fallback is definitely one of those cases because, well, you’re falling back to CPU.
So, like, you don’t expect it to be fast.
You’re just trying to make it work at all in the first place.
So, that’s most of everything that I wanted to say about code generation.
One of the open questions that I have as a programming languages person is, is there a way for us to have the best of both worlds, right?
So, I had this picture of, oh, I can metaprogram things ahead of time and it’s kind of complicated, but it’s really efficient.
Or I can write this interpreter that does everything at runtime.
It’s simple to write, but less efficient.
Can I have the best of both worlds, for example, by writing an interpreter and then partially evaluating it so that I can get the fast compile time version?
Well, I can’t easily do this if I write my interpreter in C++, but maybe if I write it in a different language, it’ll be easier to do.
That’s something that I’ve kind of been thinking about, although we don’t really have any concrete projects for dealing with this.
That’s everything I wanted to say for today.
Talk to you next time.Code-generation
EP24 torch.nn
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about Torch.nn,
PyTorch’s public API for actually building neural networks. Of course, if you are a user of PyTorch,
Torch.nn is one of the very first things you actually learn how to use, and there are lots
and lots of documentation about all sorts of ways to use modules in PyTorch. And as this is a dev
podcast, I’m not going to really talk about how to use Torch.nn so much as if you are a maintainer
or a potential contributor to the library, and you want to make modifications to Torch.nn,
well, what are the kinds of things you’re going to have to worry about? What are some of the
philosophies behind the design of this component, etc? So let’s dig in. So Torch.nn, as I said
previously, provides the NN module abstraction, most importantly, which is how most people put
together their deep learning modules. Why does Torch.nn exist? Well, it exists because when you
are setting up your modules, model, you have a lot of computations that you want to do, you have a lot
of parameters. And you need a convenient way to keep track of all your parameters. Because for example,
when you are doing optimization, you need to iterate through all your parameters, and you know,
apply the gradient you computed for each of them to the result. And so if you’re a purely functional
person, like in Jax, actually having to like manually keep track of all your parameters,
and you know, like a global spot in your application gets kind of annoying when your model
gets very big. And so what Torch.nn does is it gives you a convenient object-oriented like interface
that automatically can collect up all the parameters for you, so that you don’t actually have to keep
track of it yourself. You can just ask, hey, what are the parameters of this model? And it’ll tell
you all of them. Pretty cool, right? Another thing that is really important about Torch.nn is unlike
many of the other pieces of PyTorch, which we’ve moved to C++, because, well, you know, C++ is faster,
we’ve tried very hard not to actually move Torch.nn to C++. And so if you crack open the Python files in
PyTorch itself, because, hey, you know, how is convolution implemented? Well, it’s still a, you
know, plain old Python class that you can, for example, copy paste into your own project and tweak
however you need. And so another reason why Torch.nn is in Python is it’s more hackable, right? Like a lot
of times you are, you know, doing something that someone has done before, but maybe with some tweaks.
And there’s nothing wrong with copy pasting code and research code. It’s probably the fastest way to
get going. And, you know, long term maintainability isn’t as much of a concern. And so we wanted to make
sure this was still something that people could do when they wanted to do those things. Of course,
getting all these features to work ends up being pretty complicated. So if you’ve ever cracked open
module.py, the module that actually implements module for real, it’s actually really,
really long, and there’s tons and tons of stuff going on. So let’s just talk about the most important
things that it’s doing. So one, I said that modules are able to collect all parameters. How do we know
if something is a parameter or not? Well, in PyTorch, there’s a parameter subclass of the tensor, which is
how you make this distinction, right? So anything that is a parameter, and you put it into a module, we will
keep track of it. Anything that’s not a parameter, just a plain old tensor, we won’t keep track of that.
In order to keep track of all the parameters you put on the module, we need to override the behavior
of what happens when you modify fields on your modules. So most modules override behavior of set
adder and get adder to basically say, hey, when you set an attribute on my module, is it a parameter? And
if it is a parameter, then we actually just go ahead and, you know, put it in our record of all the
parameters that are on the module. So that’s another piece of like complication inside the
implementation of module. Some other thing modules need to support while modules support being transitioned
from one device to another, traditionally, the way that you like allocated module on CUDA is you first
allocate it, and then you run dot CUDA on it. So another thing that modules need to support how to do
is, you know, find all of the things in the module, all the tensors, and not just the parameters, but also
other buffers and also any recursive sub modules that are also part of this module, and also make sure things
get called on them. And so there’s a, you know, little helper function called underscore apply, which knows how to
iterate over what essentially is every tensor in the module and apply an operation to each occurrence of it.
Another thing that modules implement are hooks, hooks are ways of just interposing in on the behavior of
modules without having to manually write in code in every location. And to implement this, well, you know, when you
define a module, you write a function called forward. But when you want to actually invoke a module, you don’t call the
forward function directly, you call the operator call, like underscore underscore call, like just a
planal function call on the module directly. And that call does a bunch of work, it like processes hooks and
figures out all the sort of administrative stuff before actually calling the forward implementation to do the
actual thing you want to do. So there’s a lot of goop in module.py. But you know, if you just keep these
three things in mind, right, like we need to keep track of the parameters. So there’s overriding behavior of
set editor and get editor, there’s implementations of these functorial operations, which operate over all the
tensors on the module. And then there’s a bunch of hooks and interposition that, you know, let people tweak the
behavior of modules without having to edit them manually, you’ll actually, you know, be able to understand a good
majority of the lines of code in modules up high. There’s really only two other things you have to worry about. One is
serialization, right? Like a really important thing to be able to do is once you have your module, and you have
trained it, you want to dump all the parameters to disk so you can use them them again later. Well, similar
to how we keep track of all the parameters, there’s also a notion of sets of things that actually get
persisted when you serialize a module, the recommended API for doing this is state dict, which just gives you
a dictionary mapping from key names to tensors that says all of the parameters in question, you can also
technically pickle the module directly, although this is a lot more fragile, because pickling requires you to
actually maintain exactly the same name of the module, and exactly the same module that the module is
defined in module in the Python module sense. One last complication, when writing modules in PyTorch itself, is most
modules in PyTorch are what we call torch scriptable. What’s torch script? Well, torch script is our compiler for PyTorch
models. And essentially, what it lets you do is if you have a torch scriptable model, you can translate it into torch
scripts intermediate representation. And then you can, for example, ship it in a like Python agnostic form, or you can
also run some optimizations on it. And because torch script is a compiler, but Python is really complicated, there’s some
restrictions that apply when you want to write modules, because you need to make sure they’re actually torch scriptable. The most obvious
restrictions are that there’s a limited set of types you’re allowed to use, because the interpreter in
torch script doesn’t support arbitrary types. And you also have to make sure that the set of Python you use inside your
forward function is the set of Python that is actually understood by torch script. Although torch script actually does support a lot of
Python features. So chances are normal things you do are going to be understood. One of the more unconventional
things about how torch script compiles modules, it’s it’s actually a staged computation. So when I imagine compiling an NN module, you could imagine
compiling an NN module, including the constructor and the forward implementation. But that’s not actually how torch script works. What torch script does is it first instantiates the
module as a normal Python, so you actually construct the module. And only once you’ve constructed the module, do you actually then attempt to
compile the forward implementation on it. There are some benefits to doing this. In particular, because the initialization of the module happens in ordinary
Python, you can go wild with anything you want in this case. And you know, there’s no restrictions on the initialization code for the
modules, you can do anything you want. And furthermore, once you’ve actually initialized all the attributes on the class in question, torch script has a much more accurate
much more accurate picture about what the actual parameters on your class are. So if you have some weird situation where you know, if you
pass in a parameter, and it’s true, you allocate a parameter. And if it’s false, you don’t allocate the parameter. Well, Tor script can handle this
fine, even though Tor script is statically typed, and you need to know exactly what all the fields on your module are. So that’s some of the things you have to be
aware about when you’re working on modules in NN module. What else? Well, there’s been some new developments in NN module. Shocking, I
know, because everyone and their dog subclasses from modules. So when we make changes to the class, we have to be very careful, because
there’s a lot of people who will be very unhappy with us if we ever break backwards compatibility on modules. That being said, we’ve been able to
come up with some new things that like make modules easier to use. One of the coolest new additions is the concept of lazy modules. I’m
authored by EM Castillo from preferred networks. What lazy modules do is solve a common problem that you have when you’re trying to
construct a model, which is that you don’t know how big the parameters should be. Because you know what’s going on while you’re
passing in some input of some known input size, and it’s going through your model. And at some point, you’re like in the middle
of the model, and you need to provide an FC layer. And that FC layer needs to know how big the input is, because the parameter in
question is going to be, you know, the size of the input times the size of the output. But you have no idea what the input size is
going to be like, you know, you’ve run a pile of convolutions, who knows what the result is going to be. And you don’t want to have to
manually, you know, compute what the sizes at that point in time. So prior to lazy modules, you had to suck it up and like add some print
statements to figure it out what it was. With the lazy module, you just say, okay, well, lazy FC, with what the output size is
supposed to be because that’s not specified. And then the first time you run the forward on the module in
question, it says, hey, this input is size x. Okay, now I’m going to evaluate the now I’m going to allocate
the parameter, because I know what the size of the input is. Another really interesting recent development
is for the longest time, you couldn’t actually allocate a module directly on CUDA. And so we forced
everyone to like allocate on CPU first and then move it to CUDA. This wasn’t too bad when models were small,
but people are really excited about really big models. And sometimes the models are so big, you can’t even
fit them on a single machine. So how the heck are you going to construct a module in that case when it’s too
big to fit on your machine. So what Joel Schlosser has done is he’s added a new device keyword argument to
all the modules in PyTorch. So what does this mean? So if you are constructing a module in PyTorch,
and you pass in device equals CUDA, when you construct it, instead of constructing a module on CPU, and then
moving it to CUDA, what it will do instead is it’ll directly construct the module on CUDA. This patch was
super simple, right? All we did was like, edit the initialization code to actually respect the device. But,
you know, I don’t know why we hadn’t done it before. But you know, Joel actually made it happen.
And we’re hoping that throughout the rest of the PyTorch ecosystem, people will start following this
convention. And so given an arbitrary module, you can just pass in device and get the module on the
device in question. One of the like cool interactions with this other feature that we’ve been working on
called meta tensors is if you say device equals meta, what you’ll get is you’ll get a constructed module,
but all of the tensors will be not allocated, there’ll be meta tensors saying what their sizes are. And then
you can do post facto analysis on it in this situation. One of the open questions for us with
the NN module design, there’s a few things. So one problem that is coming up for us soon is we actually
do need some sort of functional version of modules because sometimes you’re doing sort of higher order
training, or you’re doing APIs that only work on purely functional programs. And in those situations,
like the very stateful nature of PyTorch NN modules doesn’t work so well. So that’s one thing like given a
module, can we turn it into its functional version? Another open problem that has been plaguing us for a while is
many of the weight initializations in PyTorch are very out of date, like they basically hearken all the way back to
LuaTorch days. And the research has gone beyond and figured out that there’s better ways to initialize weights in
these ways. And we’re stuck in a hard place because well, on the one hand, we like to update the
initializations. But on the other hand, if we do that, lots of people’s, you know, pre existing models,
might break because well, they may be expected some particular initialization. We have some ideas
about how to fix this, like imagine some sort of like version that you can specify, hey, I want weight
initialization version three, and that comes with all the updates and you just explicitly opt into it.
But no one has really implemented this yet. And something I’m kind of interested in seeing done at
some point. That’s everything I wanted to talk about NN module today. Talk to you next time.torch.nn
EP25 Mobile-selective-build
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about mobile
selective build. PyTorch is a project that is trying to do a lot of things. And one of the more
unconventional things that the project tries to do is we use the same code base that you use for
doing your, you know, good old fashioned Python training loops in Python on your regular desktop.
And we also use this code base for actually deploying PyTorch mobile models on mobile,
like so that you can run some, you know, image model on your phone and get a result back without
actually having to go back all the way to a server. So this is kind of crazy, actually, because mobile
is a completely different universe than server side programming. And there’s one particular aspect of
it that I want to talk about today, which is selective build, namely the fact that when you are writing
applications that go on mobile, binary size is really, really, really important. On server,
binary size isn’t that important. It is kind of important because if your binary gets too big, like
say four gigabytes big, then a lot of tools like the debugger stop working. But it takes a lot of code to
get to four gigabytes. On mobile, this isn’t really the case. You really, really want your app to be as
small as possible. Because you’ve got people who are downloading your app on really, you know,
shitty cell phone connections. And if your app takes up a lot of binary, then they’re not happy. And
without really having any sort of clamp on the binary size, the easy thing for a software project to do is
just keep going and going in binary size. And so there’s very stringent restrictions about binary
size, people will yell at you if binary size increases too much. And it’s in this context,
that PyTorch designed selective build. So what is selective build in PyTorch? Well, this is the
concept that, hey, PyTorch comes with a lot of operators, right, a lot of support for many, many
different operations. And half the time, you’re not using even half of these operators for any given
model, right? If you’re like doing a ResNet, oh, so old fashioned, but if you’re doing a ResNet,
there’d probably only be 20 or so operations that you actually need out of PyTorch’s, you know,
more than 1000 operators. So what’s the idea? If you are shipping some models to mobile, and you know,
what the set of models you want to do are, well, don’t ship all the operators ship only the operators
that you actually need to run on mobile, and you’ll get big binary size savings, and everyone will love
you. And also, all of the people who are, you know, frantically working on adding new functionality to
PyTorch, they don’t have to worry about going over some binary size limit, because all that stuff isn’t
actually going to be used. Now, ordinarily, when you are building some application for mobile, typically,
the way you do it is you build everything statically, and you statically link everything together. And
static linking has this interesting property, which is that we know exactly what is being used inside a
statically linked application. So if a function is not being used, we can actually just prune it away.
And all of linkers will do that automatically in that situation. You can’t do this, by the way,
for a dynamic library, because a dynamic library offers a public API. And anyone else, even people
you know nothing about, could make use of any of the exposed functions in your dynamic API. So usually,
everything has to be put in. So if a static library can be done this way, why doesn’t, you know,
elimination of operators that you don’t need happen automatically in PyTorch? Well, there’s two
reasons. So one is that when we run, when we run models on mobile, we’re running them via an interpreter,
either the TorchScript interpreter, or the light interpreter, which is a sort of pared down version of
TorchScript that has less support, but, you know, is smaller in binary size and runs a little faster.
So when you have an interpreter, one of the things in the interpreter loop that you need to do is you
need to, you know, look at your op code, which says, hey, run this operator, and then have a giant
switch statement for all the operators that you understand and, you know, have a call to each of
them. And obviously, static linking isn’t going to know that, well, this particular branch, which is
doing some, you know, mish activation or whatever, isn’t actually ever going to be used by your model,
because it can’t know, there’s no way for it to know. So we need to tell the interpreter,
hey, you know, these ops, you don’t need to compile in, you can’t get it automatically with static
linking. But let’s say you wrote your model directly in C++, which is something you can do. And you could
actually use to deploy models, although most people don’t, because it’s a pain in the ass to update native
code on mobile, because you have to, you know, build an entirely new version of your app, it’s much easier
to just push an on the wire update, for some data that just is your, you know, serialized model. But let’s
say you did do that. Hypothetically, static linking should get you what you want in this case, right?
Well, not quite either. So in PyTorch, we use this operator registration mechanism to make it possible
for people to sort of insert in, it’s like a form of dependency injection, like if you load up the
LibTorch CUDA library, then all calls to torch.add suddenly have the ability to call into CUDA,
as long as they’re passed by CUDA tensor. And this is done via dynamic dispatch. And the important
thing is that in order to make this dynamic dispatch work, we have to register an implementation of the
operator at library loading time. And what happens when you do that? Well, that’s a static initializer
in the library. And once again, the compiler cannot eliminate this, because it doesn’t know if this
arbitrary piece of code that gets run at library startup might actually, you know, do something
important that you can’t dispense with. So okay, by the way, that’s why you need like whole archive if
you’re linking against PyTorch statically, because otherwise, they’ll just drop all the static
initializers if nothing in the object file in question is referenced. It’s, it’s pretty nutso. But
you know, that that’s the way it is. Okay, so we need a way to actually figure out what operators that
our model needs, and then apply this to a build of PyTorch, so that we don’t, we don’t actually send them
when we’re building the application for mobile. Okay, so let’s take these in two steps. So first,
what operators does our model need? So if I have a TorchScript model, my TorchScript model is serialized in
some machine readable form. And so at the first level, it’s really easy to figure out what operators a
model needs, right? Like we just go to this serialized format. And for every operator call in it,
we just say, Okay, well, I see an ad, so I need ad. And then Oh, I see a convolution. So I need
convolution, etc. Easy to get a list of operators that the model needs. But there’s a problem with
this, which is what if your operator uses other operators. And this is really, really common in
PyTorch, because we have a lot of like really small, cheap operations that you can use to sort of massage
things into the correct form, like viewing and reshaping. And many, many operators use this. And so
if you’re doing one of these things, well, you also need to be able to track what those uses are,
you need some sort of dependency graph from operators to operators. So how is this done?
Well, the way we do this is we actually have a LLVM based static analysis. So what you do is you take
PyTorch, you compile it with Clang, producing LLVM bitcode for all the object files. And then our static
analysis goes through and looks through all of these, all of the bitcode looks for things that look
like operator definitions. They’re easy to find because there’s a specific API call you use to
register the operator. So it just looks for instances of that API call. And then it, you know, spiders that
code until it finds all the dispatcher calls, which mean that, hey, I have a dynamic dependency on some
other operator, and then generates that into a YAML. That’s pretty interesting. Most people don’t want to
compile PyTorch with LLVM bitcode to actually get this analysis graph. So we also have the YAML
checked in for an easy kickstart. If you don’t actually want to, you know, run this pass. By the
way, this pass is supposed to be updated by a bot. But the last I was I was checking for this podcast,
the last time it was updated was February this year. So you know, if you’re running into a problem with the
open source mobile selective build, like something’s missing, and it shouldn’t be just rebuild from
scratch. The instructions are there. It’s pretty simple. I’ll also link it in the episode notes for this
podcast. By the way, there’s another way to get the way things your ops needs, which is some sort
of dynamic tracing. And we actually debated a lot when we were trying to decide what to do for figuring
out what ops memory needs. So what how does dynamic tracing work? Well, instead of trying to statically
read out the operators your model needs by looking at the TorchScript model, just run the TorchScript
model. And when you run it, you’re going to hit a bunch of operations and record what operations you see.
And then that gives you exactly the set of operators you need. So no need for, you know, this dependency graph
analysis. Life is easy when you’re dynamic. Of course, there’s a problem with this, right, which is you need
representative inputs for your model. And well, maybe that’s not a big deal. If you’re like deploying these
models, because you want the representative inputs anyway to test that the model doesn’t crash. But if there’s say
control flow in your model, then a single representative input might not actually cover
everything. So you need to make sure you actually fully cover it’s like a code coverage problem,
right? You need to actually cover every operator that’s actually used to make sure that you’ve gotten
everything. Okay, so that’s how you get all the ops your model needs. How do we actually apply this to a
build of PyTorch? So as I said, static linking doesn’t let us, you know, do this automatically. So what we
actually have to do, so you have to, you know, take these operator registrations or things that would
otherwise force the compiler to include a code in question, and make sure that we have a way to say, okay, don’t do
that when we don’t need it. So a lot of operator registrations are done via code generation, see one of my
previous podcasts. So in that case, it’s very simple, we actually just feed in the YAML file that says all the
operators we need to our code generation. And the code generation says, Oh, you know, the selective build says that I
don’t need this operator, so I’m just not going to generate a registration call. And if I don’t
generate a registration call, then the code that it calls is now dead, because there’s nothing actually
calling it, and then it’ll get pruned away by the static linker, no problem. Unfortunately, there are some
registration calls that don’t actually get generated by code gen, they’re just done manually via our, you
know, very nice and intuitive m.def or m.imple syntax. So for this, we have a very clever scheme, which is called the
selective name macro. The basic idea behind this macro is that when you build PyTorch, we also dump all of the
operators that are supported into a constexpr string. And so we actually have this constexpr function, which can
basically take in an operator name and say, hey, is this included in the giant comma separated constexpr list
of, you know, all the operators that are allowed or not. And what the selective name macro does is it just
applies this constexpr function to the name that you are registering. So you wrap selective name
around the name you want to register. And if it is in the constexpr list, you let it go through no
problem. And if it’s not, you generate a, you know, basically a dummy type that says, hey, don’t actually
do this registration. And because this all happens in compile time, then the compiler knows, oh, okay, now I’m
just not going to generate any code for this at all. We had to do this a little especially because in C++,
you can’t actually pass strings to templates directly. So you know, we have to make sure this
gets all resolved into a Boolean, which we can then pass into a template. There’s one last detail, which is
actually pretty important when you’re trying to understand how the selective build system works, which
is how this integrates into your build system. So in CMake, everything is fine, you just do a CMake build
of PyTorch, with the particular operators that you want it to ship. And then you know, there’s no
problem. But at Facebook, we actually have multiple apps, and all these apps want to use Facebook. And so
we actually have this problem, which is that we want different sets of allowed operators, depending on
which app we’re doing. And the build system we use at Facebook, namely buck has a constraint, which is
that you’re only allowed to have one copy of any given library at any given time in the build system.
And this is just to make sure people aren’t like, doing some sort of Node.js style disaster where there’s
like a bazillion copies of the same dependency everywhere. But that’s a problem for us, right? Because,
you know, there’s only one PyTorch library, but each of the apps wants a different version
of the PyTorch library in this situation. So what do we do? Well, we cheat, we actually generate multiple
copies of the PyTorch library for each version of the app that we need in this situation. And we don’t
we don’t generate a copy of everything, just the relevant parts that actually contain the operators.
This used to be just some glue code, which did the registrations. So it was a very small bit of code
that we like had to recompile for everything. But we’ve actually expanded this to recompile
all of PyTorch. Because as I said, we want selective D types, and D types are like sort of coded into the
operators themselves. So there’s no like registration mechanism we can use for D types, we have to handle
this actually by recompiling the kernels in question. There’s a kind of funny alternate universe, where
instead of like recompiling the entire library for the sets of operators you want to do, you could also
just modularize library. So they have you have one library for convolution, another library for add, another library for sub,
etc, etc, etc, etc. So isn’t that like the, you know, good software engineering way to like, you know, deal with the
system, and then you only depend on the libraries you need. Well, yes, this kind of works. And actually, Cafe2 used to do this. And
there’s a problem, which is that one, building libraries takes a while, right, because you have to link them. So it’s like takes a minute a
pot. And so that would be really, really slow. And second, well, people just don’t write code this way. They don’t generate a
1000 libraries for 1000 really small pieces of functionality, and then you know, mix and match them for what you actually want to do. And a lot of the
ecosystem is not set up to do this properly. So for example, we have to load iOS applications into Xcode to actually, you know, work on them. And if we actually generated a library for every operator,
it would crash Xcode, because there’s just too many libraries. So you know, yeah, don’t don’t do no JS style stuff in, in mobile. One final thing I want to say, so the selected build for mobile is intended to be something that you don’t really have to worry about if you’re developing PyTorch. But sometimes it rears its head. And the most common situation it rears its head is you’re working on a kernel, you modify some of its implementation details,
implementation details, so that it’s calling some new operator. And then some guy comes to you and says, hey, my rando mobile, like application stopped working. And that’s usually because there’s some YAML somewhere that describes the set of ops the model needs. And it’s out of date, right? Because you changed what the dependency structure of the model is. And so now there’s a different way. There’s a different set of operators that are needed. And you have to tell the YAML file, hey, this is a new thing, you have to rerun the analysis pass. A lot of these things are checked in for better or
for worse. Fortunately, it’s really easy to regenerate this YAML files. And also the PyTorch edge developers are very friendly and very willing to help in these situations. So you can just reach out and you know, learn how to do it. And there’s also ample documentation internally for this sort of workflow. Okay, that’s everything I want to talk about today. Talk to you next time.Mobile-selective-build
EP26 PyObject-preservation
Hello, everyone, and welcome to the PyTorch Dev podcast. Today, I want to make good on a promise
that I made on the very first episode of the podcast, namely, how the heck when we bind C++
A10 tensors to Python, can we make it so that the Python object doesn’t go away when the Python
object goes dead? Namely, how do we preserve the Py object? This podcast is going to be a little
technical. So the way that I’m going to do it is I’m going to first explain how the trick works,
which will sound really simple, stupidly simple, in fact, but with a lot of complexity underneath.
And then we’ll just go on a wild romp on various aspects of how C Python works and how C extensions
work with Python to explain all of the more subtle moving parts of how Py object preservation works.
Because yes, it does sound very simple. And it is simple. There’s just a lot of T’s to cross and
I’s to dot. All right, so where should we start? So just to remind ourselves about what the problem is,
imagine you have two objects, two ref counted objects, the ref counted separately, you know,
so object A has got some ref count three, object B has got some ref count two. And what you’d like to
do is you’d like to set them up so that object B stays live as long as A has non zero ref count,
and object A stays live as long as B has non zero ref count. So let’s imagine that these are one of the
objects is our C plus plus tensor with a C plus plus reference count. And another object is our PI object
representing the C plus plus tensor. And it has a Python ref count. This puzzle basically devolves into
how do we make sure that we actually keep the C plus plus tensor and the PI object around in the same
linking their lifetimes together. Before we explain the solution, it’s helpful to think about two solutions
that don’t work. One solution that doesn’t work is to have the C plus plus object have a strong reference
to the Python object, and the Python object to have a strong reference to the C plus plus object. Why
doesn’t this work? Well, you have a reference cycle. You call in your class about reference counting,
that reference counts are very nice, but they have a problem, which is that if you have objects that
refer to each other, those objects will never get garbage collected unless you break the cycle in some
way. So if we have C plus plus refer to Python, Python refer to C plus plus, that’s a cycle that’s
straight out, we will just never garbage collect the objects in that situation. Another solution that
doesn’t work is to have one of the objects have a strong reference to the object, and another object
have a weak reference to the other. So for example, have the C plus plus object have a weak reference
to the PI object, and the PI object have a strong reference to the C plus plus object. This is what
PyTorch does today. And it doesn’t work because if all the references to the Python object go dead,
the Python object will get deallocated because well, the C plus plus object, even if it has references to
itself only has a weak reference to the Python object. So it doesn’t stay live in that situation.
Okay, so how can we solve this problem? Well, we’re going to use a little trick. And the trick
is resurrection in Python ref counts. What does resurrection refer to? So resurrection refers to
the fact that when you’re doing ref counting in Python, if the ref count for an object goes to zero,
you can still resurrect the object from the dead by simply making sure that a new reference to the
object gets taken out while you’re deallocating the object. When this happens, CPython will say,
Oh, object is still live and will abort the rest of the deallocation process. With resurrection as our
tool, we now have enough tools to actually solve the circular reference problem once and for all.
Here’s how it works. So in the beginning, we’ll set things up just as we do today, where we have a C plus
plus object and a Python object. And the Python object has a strong reference to the C plus plus object,
but not vice versa. This goes on for a bit while we have references. And at some point,
the Python object is going to go dead. Whereas the C plus object is still live, because that’s the
situation we’re worried about. When the Python object goes dead, we don’t immediately deallocate it.
Instead, we look at the reference count of the C plus plus object and say to ourselves,
is this reference count greater than one? Because, well, if it’s one, then it’s solely owned by the
Python object in question. But if it’s greater than one, that means someone else has a reference
to the C plus plus object. And that means we shouldn’t kill the Python object. So when this
happens, we will abort the deallocation, and we will flip the ownership so that the C plus plus object
owns the Python object instead of vice versa, thus saving the Python object from getting deallocated.
And you know, because it has no incoming references, giving it the ownership in the only way that’s
possible. There’s one last thing, which is that C plus plus reference counting traditionally doesn’t
support resurrection, because it’s kind of a difficult thing to do in a thread safe manner.
So what C plus plus, so what we’ll do is if I ever use my C plus plus object to take out a new owning
reference to the Python object, and this shouldn’t be too hard to do because you had to call some API
with a C plus plus object to get the Python object in question, then you can actually just flip the
ownership back so that the Python object refers back to the C plus plus object once again. And then you can
do this as many times as you want, as many times as the Python object goes dead while the C plus plus
object is still live. And so we wrote this up in a patch, we put it in PyTorch master. And so now if
in PyTorch master, you say assign a variable to the grad field of a tensor, the grad field, by the way,
is stored in C plus plus. So it isn’t a good old fashioned py object field. It’s a actual field in
C plus plus. So you store a tensor in there, and then you delete all references to it from Python,
you will still retain, for example, the dict properties that you put on the tensor in question.
So no more lost py objects. So that’s it. That’s how py object preservation works.
Feel like you want a little more, perhaps? Well, let’s dig into a little bit about why this actually
works. And the first question that you might ask is, hey, Edward, so it’s kind of cool that there’s this
py object resurrection mechanism. By the way, it was Sam Gross who came up with this technique. He was the one
one who told me about it, and let me actually implement this in this way. So why does resurrection
exist in Python in the first place? And the answer is finalizers. What is a finalizer? So in Python,
what you have all these objects, and sometimes they go dead. And sometimes you want to clean up after an
object after it goes dead. For example, if you open a file, when the file object goes dead, you might want
to close the file in that situation. Of course, what you really should do in that situation is a context
manager to guarantee the file gets closed. But if you don’t use a context manager, the file will still
get closed when it gets deallocated because of the finalizer. So Python supports arbitrary finalizers,
you can write whatever code you want. If you want to write a Python object and write some finalization
code on it, you can just write the magic method underscore underscore del on it. Cool. So there’s
a problem, right? So finalization is when the object is dead, and we’re trying to get rid of it. So the
finalization code can do anything. So what happens if you accidentally, like, you know, or purposefully,
you know, put out a new reference to the object you’re being finalized somewhere else?
Hmm, well, that’s a bit that so for a while, this was kind of skeevy. And eventually, there’s this PEP
safe object finalization, which said, Okay, what we will do is we will resurrect the object when this
happens. So we will make this a valid thing to do. And we’ll just mark the object as Oh, this object has
been finalized. And so I’m never going to finalize it again. So so you have the environment that an object
only gets finalized every once. So this by this way, like, you know, we don’t have to worry about
objects being in strange, half deconstructed states, and then escaping into the outside world, because we
just run the finalizer, the finalizer resurrects it, we just stop the allocating. And then we wait until
later when the object actually becomes dead to deallocate it. So this is why resurrection works. But it also
poses a question for pi object preservation, which is, if finalizers can only run once,
I better not run my finalizers when I’m doing this one of these resurrection things. And actually,
it’s a little difficult to arrange for this to be the case. Because let’s explain how deallocation
works in C Python. So in C Python, when you define a any type of Python object, there are a bunch of
TP fields, which define the various behavior you want to do. So there’s like TP init that says what to do
during construction. And for our purposes, there’s one that’s very important. TP dialog. What is TP
dealloc? It just says how to deallocate an object when you call into it. And so when you like write a C
extension custom pi object, you’ll typically provide a TP dealloc that, you know, like looks into the C++
fields of whatever it is, you’re implementing in the pi object and actually deletes them so that you know, we
deallocate them. And at the end of the day, it actually also deletes the Python object altogether.
Okay, so that’s kind of cool. What about when you subclass a Python class in you know, say Python,
and this is relevant to tensor, because we don’t actually let people use the C bound object called
tensor base directly, we actually subclass it into tensor. Well, Python subclasses have their own special
deallocation implementation called subclass dealloc. And this deallocation method sort of takes care of all of
the random things that you know, Python objects actually support. So there’s a good reason why we
subclass tensor into a Python subclass, which is that if we didn’t do that, many things that people would
expect to work on objects such as you know, writing to arbitrary fields on the object, using weak references,
doing finalizers, all those things wouldn’t work, right? Because those things are actually handled
by the implementation of the Python subclass. And we would have to like manually replicate them
in our C implementation if we wanted them to work without subclassing. So we’ve got a problem,
right? So what happens when I deallocate an object, I call the tp dealloc for the most specific subclass
that the object is in question. And that’s going to be the Python subclass in the case of tensor.
And what does it do first? Well, it runs finalizers. And I don’t want to run finalizers because they might
be resurrecting this object. So what’s a poor person to do? Well, we need to somehow override the tp dealloc for
all subclasses of tensor base to make sure that they first check if resurrection is going to happen
and bail out entirely before the deallocation process has a chance to mark the object as having
been finalized. Do we have a way to do that? Fortunately, yes. In Python, you can define a
metaclass. What is a metaclass? A metaclass is a way of customizing the behavior of classes
when they get subclassed. So if you imagine, like a class constructor is something that gets called when you
construct an object, a metaclass constructor is something that gets called when you construct a
class as part of the metaclass hierarchy. So what do we do? We define a new metaclass for tensor base.
And so when we subclass tensor from tensor base, the metaclass gets run. And what it does is it just
overrides the tp dealloc to replace sub subclass dealloc with our own thp variable subclass dealloc.
It actually looks very similar to subclass dealloc, right? It still needs to clear out slots. It still
needs to deallocate the dictionary. It still needs to run finalizers. But before all that, it checks if
we are going to resurrect the object by looking at the ref count of the C++ object. It’s a little
unsatisfactory because I actually went ahead and looked at CPython and copy pasted all the code for
subclass dealloc to make this all work out. But it works out in the end because actually a lot of
Python binding code like Cython, for example, replicates this because remember what I said,
if you just do a very simple C object from Python, you don’t get dictionaries, you don’t get slots,
you don’t get any of that stuff. So you want that all working, you have to actually write code for it.
And so Cython, for example, does replicate all this logic so that it looks like it without you having
to subclass from Python. So that’s one of the complications that arise from doing subclass
preservation. What’s another complication? So another complication is that weak references are
a little bit of a problem. So I said earlier that we need to be able to intercept whenever a strong
reference is taken out to the pi object from the C++ object, because we need to fix up the ownership in
that situation. If the C++ object owns the pi object, I need to flip it back around. So the pi object owns
the C++ object. And ordinarily, it’s easy to interpose on this. But there’s one case you can’t
interpose on it. And that’s a weak reference. A weak reference lets you take a reference to an object
that you know, will go dead if that object goes dead. But if the object is still alive, I can use it to
manufacture a strong reference into the object. And there’s no way to hook into this behavior. So if
someone’s got a weak reference, they can get out a reference to the pi object, even if I’m still in this
flipped state where the C++ object owns the pi object. This is mostly harmless, unless then the C++
object goes dead while the strong reference from the weak ref stays live. And then you’re in this awkward
situation where the C++ object gets deallocated, because there’s no resurrection for C++ objects.
Fortunately, there’s a simple workaround for this situation. You just need to like,
ask to fix the reference direction. And so I added a new method to tensor that lets you do that if you’re
using weak references. But actually, none of our tests failed because of this. So I’m suspecting that
no one’s actually going to run into this in practice. One last thing. So Python has this thing called a
garbage collector. And actually, what it does is it makes it so that if you do have cycles in entirely
Python objects, you can actually garbage collect them in that situation. So they’re not actually
going to be lost to these either forever. By the way, this doesn’t apply for C++ shared references.
If you have a cycle there, you’re just flat out of luck. So GC cycles are kind of interesting in Python
because we also need to handle them correctly under the assumption of resurrection, right? If I have a cycle
in Python, but it turns out that if I were to deallocate this object, then I would have resurrected
it from, you know, some C++ object that’s live, that Python object needs to be treated as a root,
right? I can’t actually deallocate the cycle because that would just leave everything in a broken state.
But the way that the, you know, cycle, the way that garbage collection works is if I try to resurrect
it at the point in time, I’m deallocating, it’s too late, because I might have actually started deallocating
all the other stuff in the cycle because Python is just going to be breaking the cycle using TP clear. That’s the
way you break cycles. So what’s a poor person to do? Well, all you need to do is make sure that when Python is doing garbage collection,
any object that is resurrectable gets treated as a root. And ordinarily, a GC just has a fixed set of
roots that it knows to traverse down to find where everything is. But Python is special, it needs to
do a first pass, a pre-pass before the actual traversal pass in GC to determine what all the
roots are. And this makes sense because, you know, you could have arbitrary references to py objects from
random places in C++ that Python knows nothing about. And so in general, Python doesn’t know what your
roots are. So it simply defines roots to be any object that has a ref count greater than all the
ref counts coming into it from other Python objects. So if you just make sure that something gets treated
as a root, and that’s pretty easy to do, you just don’t traverse its members in that situation, then
you’re all good. And so we also not only do we override tp dealloc, but we also override tp traverse
in the meta class to make sure we check for resurrection before we traverse into the sub members.
Okay, so that’s how py object preservation works. I’m hoping to release a little sample open source
project that shows you how to do this trick, you know, in a very compact way, because I think this
will apply to any project that is binding C++ objects to Python. That’s all I wanted to say for
today. Talk to you next time.PyObject-preservation
EP27 C++-frontend
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today’s topic is a listener request,
namely a discussion about the trade-offs behind the design of the C++ frontend.
So before we start, I have to first explain what I mean by the C++ frontend, because there are a
number of different ways you can interpret this. In one sense, the C++ frontend is the tensor class
that is inside PyTorch and is used to actually undergird the implementation of all our kernels
and all the plumbing that is in PyTorch. So this is tensor provided by the A10 library,
originally developed by Zachary Dorito. And it’s a really important piece of what we think of as the
C++ frontend. So I’ll spend some time talking about the philosophy there. But there’s a second part to
the C++ frontend, and this was added after A10 by Peter Goldsboro. And what it is, is basically
everything else beyond, you know, just the tensor class. Because if you think about PyTorch’s
library, we don’t just provide a tensor, we also provide a module abstraction and an optimizer
abstraction that you can use to easily structure your neural networks. And you know, people use tensors
a lot, but they also use modules a lot. And so that matters a lot when you actually want to write real
code. But we’re going to start with talking about tensors, because that’s simpler, and it sets the stage
for some of the design constraints that happened when we were designing the rest of the C++ frontend.
Okay, so let’s talk about A10. So where did A10 come from? So A10 came from this idea that, hey,
we were writing all of our internal code in PyTorch in this very terrible language called TH, where we had
various macros for your tensor types. And it was all done in C. And you had to write your code,
and then compile it multiple times for every D type you wanted it to be supported on. And you
had to manually ref count. And it was all terrible. And so the model behind A10 was, okay, let’s use
C++ instead of C, and use the abstractions that C++ gives you to actually make a nice API for doing
manipulations on tensor. But it went a bit further. So there were a number of other tensor libraries in
C++ at the time, eigen being one of the most influential ones. And we didn’t want to do that.
We the idea that Zach had was we want to have a tensor type in C++, that is just tensor, it doesn’t record
any D type information, it doesn’t encode any dimension information. And the really important
thing about doing it this way, is now you can write polymorphic code on various D types and various
dimension sizes without having to template your code. Because, well, you know, when you’re writing
C++, if you have a type, and it’s got some parameter on it, like you’re you doing a vector, and it’s got
some, you know, type of the elements in it, if you want to write a function that is generic on the types,
you have to write a template function, because C++ is going to instantiate it for every copy of the
element type you use. And it gets worse and worse, because the templates don’t actually get type
checked, you have to wait until they actually get instantiated with the type in question before they
get type checked. So it’s just much harder to write code in C++, if you are using templates, that is until
C++ concepts come around. But you know, we were C++ 11 at the time. So oh, so much trouble.
Like, and one of the things that makes it really hard for newcomers to C++ to write C++ is the really
horrible, obscure template error messages. So if we just don’t put that information in tensor, if we type
erase tensor, then people don’t have to worry about that. So that was the like first main innovation of
a 10, which was don’t do templates, just type erase everything. And it’s okay, things will work out in
the end. Another really important philosophy that went into the design of tensor is we really wanted it to
look as much like Python as possible, right? So if you like wrote some code in Python, like, I have a
tensor.addb.mullc, right? Like that’s something you could write in Python, no problem. We wanted that to be
exactly the same way in C++. So people who came in not knowing very much C++, but needing to write their
code in C++, because remember, this was at the time we were trying to start moving all of our Python code
into C++. So we were in desperate need of C++ programmers. But everyone knows how hard it is to
actually find grizzled C++ veterans that know everything about the ownership model in C++.
There’s just like not that many of them. So the closer to Python, we could make the code, the easier
and more accessible it would be for people to start writing kernels in C++. And so one implication of
this is tensor, like AT tensor, as seen in PyTorch, is not the traditional notion of a C++ type, which is a
value type, where if I were to like do a copy construction on it, an actual shallow copy would
happen. No, it’s a reference type. So we actually organize most of the main user visible types in
PyTorch into two types, a tensor type, which is the reference type. So if you copy it, you just, you know,
are copying the pointer, and then tensor impl, the impl type, which actually contains all the metadata in
question. And so you’ll see this separation in storage, storage, storage, impl, and also in modules,
module, impl, module. So you get reference semantics, equality works the way you expect it to in Python,
and people are pretty happy. One last thing about the C++ API, which is that we want our calls to look a lot
like Python. And for the most part, function calls are the same. But one thing that Python has that C++
doesn’t is keyword argument support. So we needed some way to actually simulate keyword arguments. And
I’m getting my timeline a little bit mixed up here, because we added keyword argument support to the
C++ API after we actually did the initial version of A10. In particular, the reason why A10 didn’t have
keyword argument support was it wasn’t obvious how to do it. And the sort of most important structure
that gets used everywhere in PyTorch, tensor options is designed explicitly to let you do this sort of
keyword argument style, argument passing in Python. How does it work? It’s just a struct. It can be default
constructed to have nothing in it. And then you can set via setter methods, various attributes on it. So
like tensor options dot d type, blah, dot device, blah, we’ll set up things so that you actually get a
tensor options with that d type and device set, but maybe not the other keyword arguments. And we
actually designed tensor options to be a value class. So you don’t have to worry about like mutation or
someone mutating it under you. It always functionally returns you a new tensor options. It’s only two words
large. So it’s not a big deal to keep creating new copies of this tensor options.
Okay, so I’ve established the basic ground rules that you know, the A10 library wanted, right,
which is that no templates, don’t don’t do templates. So it means we need a type erase tensor and make the
tensor API look as much like Python as possible. We actually even wrote a manifesto about this like
about this writing C++, writing Python in C++. So with these two constraints in mind, let’s fast forward a
little bit in time to when Peter Goldsboro was working on the C++ front end proper, namely module
support. So at the time, there was a project going on at Facebook research, the StarCraft project, they were
doing reinforcement learning for StarCraft. And they have a problem, which is that, you know, what they
needed to do was they they needed they had a simulator for StarCraft, an actual game instance of StarCraft,
actually, and they needed to feed it information from the reinforcement learning model that they
were training at the time. And they needed this to go as fast as possible, because you know, like the
faster you can be the simulator, the faster you can actually do training. And so CPU overhead really
mattered here. And parallelism and multi threading really mattered here because they were running lots
of simulators. And this was just completely impossible to do in an efficient way in Python.
And so they actually started writing a little layer on top of the A10 library, which remember recall only
had tensors and that without that’s it, all it is is a tensor library, I’m called Autograd PP to make it
possible to do automatic differentiation on these things, and to you know, actually structure modules.
And so at the time, Peter Goldsboro was like, you know, hey, C++ front end is a really good idea. And there
are a lot of people who might be interested beyond the StarCraft project. And we took the, you know,
learnings from their version of the C++ front end and built it into the C++ front end that actually
you can use today as part of FITurch proper. So we ran into a few questions when we were trying to figure
out how exactly modules should work in C++. Like there are a number of problems. For one, we already
have modules in Python. If we want modules in C++, does that mean the Python module should call into the
C++ modules? Well, maybe that’s not such a good idea, because a lot of people take modules in PyTorch,
they copy paste them into the research code, and they hack on it. This hackability is really good when
you’re writing Python. And if we actually moved all the implementations into C++, then you know, well,
people can’t just copy paste things, right? They actually have to compile some C++, or like look up an old
version of PyTorch where there was still the Python implementation. So we decided we didn’t want to get rid of the
existing Python modules because hackability was really important there. Another question was,
could we write a transpiler to take these Python modules and transpile them into equivalent C++
modules? And that just seemed like too much complexity for things to be worth. So we decided, okay, we’re
just going to reimplement all the modules that are in Python in C++ for better or for worse, because now
you’ve got two versions of the code, you got to update both of them in this situation. We have another
problem when you’re trying to implement modules in C++, which is that, you know, Python has all of
this meta programming stuff. If you recall my previous podcast on torch.nn, I was like, hey, you know, what
does module do? Well, it tracks parameters. And really, the like, most important thing it does is track
parameters, so that you can collect them all up and pass them to your optimizer. But the way Python does
that is by overriding the meaning of setting attributes on the module so that it can like then,
you know, sideband, like recorded in some field that says what all the parameters of a module are.
Well, how the heck are you going to do that in C++? The answer is you can’t. So you need to adjust the
API a bit. So the way the C++ render works, right, is it asks you to register parameter when you register a
parameter. And that just sets up the extra metadata tracking necessary to tell what the parameters in
question are. Another problem, which is similar to the quarks problem from the tensor case, is that
modules also often have a lot of arguments that you want to like express like keyword arguments.
And unlike factory functions, which tensor options is sort of oriented towards, which have a fixed set of
keyword arguments to occur everywhere, every module is a little different. So there’s a bit of work in
the C++ API to make it easy to define, you know, options objects that you can, you know, use setters
to set in what the options should be, and then eventually pass it to the module in question to
make things work out. And one last thing, right, modules, we argued a lot about whether or not they
should have reference or value semantics. In the end, right, Python and C++, right, like these Python
modules should look the same as the C++ version. So all modules also are split in the module, module impulse
split. And that’s why there’s a macro that you need to call to actually, you know, bring the module into
the question. So what is what what’s the upshot? Well, we started off writing the C++ front end for
tensor. And we had some design principles, namely write Python and C++. And we extended it to modules
in C++, perhaps a little imperfectly, because modules are a lot more complicated. But we were still trying
to consistently apply this idea to the entirety of the C++ front end. And I would say that’s sort of like
the the main idea, right, like you’re not going to get exactly the best performance that you could have
gotten by writing really idiomatic C++, you’re going to get something pretty good, and certainly much
better than like, if you were writing Python, and had to, you know, worry about the gill. And that’s good
enough for a lot of researchers. That being said, there are some performance challenges to writing code in
this way. And actually, I’m Scott Walchuk, a engineer over in core infra, who has been on loan to us on the
project has been working on reducing overhead in our framework. And some of the stuff that raises a lot
of overhead is related to writing Python like C++. So let’s just check out a few of these. So one problem
that we have is that ref counting is really slow. Why is ref counting really slow? Well, Python ref
counting is actually really fast. But there’s a trick behind it, which is that because there’s a
global interpreter lock, Python ref counts are non atomic, because you can just assume that they’re
going to be protected by this lock. In C++, ref counts are typically atomic, because you want your ref
counted objects to work across multiple threads. So you know, you actually implement the ref counts as atomic
things. And incrementing and decrementing atomic fields, that is expensive, because you have to
tell the processor to actually send the cache line back to the main memory in question. Oh my god. So
like, that’s, that’s a huge hit. So excess ref counts are a problem. And one of the difficulties about
writing a code in the Python style, where you only have the tensor concept, which is a pass by reference
type of shared ownership type, is that, well, a lot of the times people are just going to start,
you know, doing ref count bumps willy nilly, because that’s kind of what you did in Python,
where it was cheap. Well, it’s not so cheap in C++. And we’ve actually developed a really interesting
way around this problem. So conventionally, the way you would have solved this problem in C++ is that
you would have, you know, made a strong distinction between the, the actual thing that contains the data,
data, and a shared pointer to that data in question. And then you would force everyone to use the right
pointer, whether it’s a raw pointer, or shared pointer, or unique pointer, or some arena allocated
pointer, you’d force everyone to like do all this juggling around. We came up the problem with, we’ve got this
tensor type, everyone expecting is expecting to be able to do const tensor ampersand. So we can we have to have an
actual tensor at the end of the day, can we reduce the amount of ref counts going on in this case? And the
answer is yes, because we actually implemented ref counting ourselves using an interest with pointer
class. And what we can do is we can build wrappers on top of tensor, for example, maybe owned tensor,
which dispense with the ref counting, because the ref counting ends up being, you know, an incref or
decref call. So you just skip the ref counting when you’re in this container type, depending on what’s
going on. So for example, if I have a maybe owned tensor, which is actually just a reference to some
tensor, it’s non owning, then I have the destructor of maybe own tensor, just leak the tensor when it gets
destructed. So don’t trigger the normal destructor of tensor, which would decref to skip the decref
entirely. And you can actually build a bunch of other things, there’s actually a PR out for also
exclusively owned tensor, right? So this is kind of like unique pointer. But unlike unique pointer,
it’s piggybacking off of a shared pointer. So you know, when you know you only have that pointer,
you don’t have to actually incref and decref it, but then you can promote it into a regular shared
reference. That’s very much like unique pointer in this case. But at the end of the day, it’s still
a tensor. And so you can still, you know, forget about all of these pointer distinctions and passed
around constant references to tensor without having to rewrite all your code. So yeah, I would say if we
were going to do this project again, we would probably think about not writing all our code in C++,
and perhaps writing it in some language, and then writing a compiler stack to compile down to the actual
machine code we want. And you know, figuring out how to make it run really fast. And because we
because compilation time is a huge problem, you don’t actually want to be like spending a lot of
time compiling. But that’s a huge infrastructure outlay. And I don’t think there’s any way we could
have gotten to the point we are today, not using this concept of writing C++ in Python. So I still
think it was a really good call. It saved us a lot of template headaches. It really made it possible
for a lot of people to write code in our framework in C++. But you know, like, there’s always something
better you can do. That’s everything I wanted to say today. Talk to you next time.C++-frontend
EP28 torchdeploy
Hello, everyone, and welcome to the PyTorch Dev podcast. Today, I want to talk about Torch
Deploy, a way of deploying direct Python code in production environments where you can’t
wait for the gill. So what is Torch Deploy? So Torch Deploy is our answer to a question
that we were asking, which is that, hey, it turns out that in a lot of cases, we don’t
really care that the Python interpreter is slow. Yes, the Python interpreter is slow,
but maybe it’s a very experimental model, or it doesn’t matter that much. And we just want to
be able to run it in a multi-threaded environment. That is to say, the only sin CPython committed with
this particular Python program is just that there is a global interpreter lock, which means you can’t
run it in a multi-threaded fashion. Besides that, Python is fast enough. And this is often true in
a number of cases. And I’ll link to an analysis which was done on a number of models showing that,
hey, you know, Python doesn’t actually really matter as far as performance goes. So if you want
to run a bunch of models in the same process, and being in the same process is pretty important because
it just simplifies management of memory. And you know, you can make sure things get shared in an easy
way, you don’t have to go to shared memory across processes. So single process, but you want multiple
Python threads running in parallel inside this process. How can you do it? Well, torch deploy is
the answer to this question. The use case of torch deploy is pretty niche. And we haven’t really tested
it that hard in production cases. But it is being tested in CI and PyTorch. And so if you’re dealing with
code that interfaces between the boundary between C++ and Python, namely C++ code that ordinarily
doesn’t call into Python, but you know, does you want it to call into Python, for example, dispatch to
Python, a project that I’ve been working on recently, then you’re probably going to run a foul torch deploy
as an torch deploy is going to have to make you think about how to structure your code correctly.
Fortunately, it’s not too hard. So I’m going to tell you a little bit about how torch deploy is
implemented. And then some of the consequences for when you’re designing stuff in PyTorch that might
interact with torch deploy. Okay, so what is torch deploy? So torch deploy is a way to run multiple
Python interpreters in your process, without them sharing any state, so that you can run them each with
separate gills. And technically, Python 3.9 sub interpreters are also an attempt at doing this sort
of thing. But sub interpreters are trying to work with a single copy of the Python in your address
space. And it’s sort of not complete, like they haven’t actually gotten it. So that each of the
sub interpreters is has it got its own state so that you don’t have to do the same gill to protect
everything. So torch deploy sort of takes a really heavy hammer at the problem. And it says, Okay, well,
it’s too hard to refactor CPython, so that the like interpreter specific state is separate, and I can
you know, create as many copies of it. So I’m just going to take the whole honking Python process in its
entirety, and stamp out multiple copies of it in my process. Ordinarily, you can’t do this because, you
know, Python is going to be some shared library. And if you load a shared library multiple times, well, the normal
thing to happen is you only load it once, right? The whole point of a shared library is a shared library,
you only load it once it like shows up in one place. And it provides symbols for all the things that,
you know, it defines as being exported. So what do you do with torch deploy? What we do is we build a
special version of Python, that’s got all of its stuff bundled up. So all the modules and all the Python
code that you need to actually run Python. But most importantly, it’s built hiding all of the symbols.
So you don’t actually export any symbols directly from it. There’s just going to be like a single fixed
entry point that we’re going to access with dl sim when we deal open this library. So we have this like
blob of code representing a Python shared library that has doesn’t export any symbols. And what we can do now
is we can, whenever we need a new copy of the Python interpreter, write it out to a new dynamic library
file. Because you know, remember, if it’s the same dynamic library, then the dynamic linker, the system
dynamic linker is going to deduplicate all of them. So write it to a fresh library name, and then deal open
it without resolving any of the symbols. And then manually use dl sim to pull out the one or two symbols
that you actually care about for actually doing access into the interpreter. And so all of this is
mediated by a interpreter class that sort of represents the like small set of things you can do
to actually run code in your specific Python interpreter. And the most important thing that it lets you do
is it lets you take I values PyTorch’s internal representation of, you know, like boxed values that take
any sort of shape or size, unless you feed it into the Py interpreter, so that they turn into Py objects
inside. So what does this picture look like? So when you load up torch deploy, and you have multiple Python
interpreters going, each of them has a corresponding dynamic library that is their own copy of the Python. And
because it’s their own copy of Python, nothing is shared at all. And so they can all have separate gills. It’s not just by
the way, the CPython library that’s in there, you also need PyTorch’s Python binding code, because the binding code
links directly against CPython’s API. And so like, because we’re hiding all the symbols, they can’t live in our
library itself. So those also get compiled into this binary. And we end up with multiple copies of most of
the code in torch slash c circ when you’re using torch deploy. So this is an important segue into some of the
limitations and consequences of torch deploy being set up this way, when you’re trying to write code in PyTorch.
So one really important thing is, because we’re loading multiple copies of the PyTorch library, Python, the
Python part of the PyTorch library, when we have multiple torch deploy interpreters, it’s important that these don’t
access any shared state. And that shared state actually can’t deal with multiple copies of the library hanging
around. This is important because we don’t actually want to have multiple copies of A10, the tensor library, or any
of the like pure C++ code, that C++ code, we want to have shared across all the interpreters. And in
particular, for example, if you have a code inside the Python library, that for example, registers an operator
to the dispatcher, that’s a no go under torch deploy, because remember, you’re gonna have multiple copies of
the torch deploy library, each of those libraries, when you load them, are going to run their static
initializers, and each of them are going to attempt to register whatever operator it is you are trying to
define inside them. And the dispatcher doesn’t like that, right? It only wants an operator to be
registered exactly once. There’s also another problem that shows up when you’re in a situation like this,
which is, let’s say you’re in some C++ code, it doesn’t really have anything to do with Python.
And you need to somehow get to Python. Like for example, you’ve got a C++ struct that was defined
inside PyTorch proper, but it has a possibility to contain a reference to a Python object that might
be associated with one of these PyTorch, these interpreters. And say you need to deallocate that
Py object when this happens. Well, if there isn’t a dynamic dispatch to the correct interpreter, you aren’t even going
to know which interpreter you should actually do the PyDecref on, right? Because each interpreter has its own
state, each interpreter might even have its own representation of the Py object in question. So you
need to make sure you can figure out which one you can actually get. And so in a previous podcast, I talked
about Py object preservation. And I mentioned how there was this thing that we needed to do, which is that when we
flip the ownership so that tensors own Py objects, we needed to be able to deallocate the Py object
when the C++ tensor died. And so to figure out which interpreter we the Py object is associated
with, we have to make an assumption. And the assumption we’re going to make is that for all
tensors in PyTorch, there is going to be exactly one Torch deploy interpreter that actually has a Py object
representation for this. This isn’t always used to be the case. In a previous implementation, we actually had it so
that every Py interpreter could have its own Py object. So it was a one to many relationship. And that was just kind of a
disaster, because you have to like go and deallocate each of the Py objects corresponding to the C++ tensor, if they
happen to be owned, and you have to take out the gill locks for each of them in turn. And there’s just lots and lots of
opportunities for deadlock in the situation. But if you can assume that any given tensor only belongs to a single Py object
Py object interpreter, well, one, you can still store the Py object on the tensor itself because it’s guaranteed to be unique. And two is,
well, because there’s one interpreter, you can also like have the tensor remember what the interpreter that it’s corresponding to is. And then you can always use that to like do virtual calls into to figure to do things that require the Python API in that situation.
So I’ve been using this multiple times for different things. So when we did Py object preservation, we use the Py interpreter object, which we’re storing on tensors, which points us to the correct interpreter for torch deploy.
What we are using that for is using that to decref the Py object when it goes when the C++ tensor goes dead. But in a more recent piece of work, dispatch to Python, we’re using the Py interpreter to figure out how to call into the Python interpreter, so that we can actually take a call to a C++ operator, and turn it into a call back into the Python interpreter. So what’s the idea? It’s the idea is that we have this dispatcher hierarchy,
it’s got all the C++ code, and maybe at the very bottom, you want to override the behavior of an operator and call back into Python. Well, how do you know which Python interpreter to call with torch deploy? Good thing the tensors know what the interpreters they’re corresponding to are. So you just look for a single tensor object that’s got a Py interpreter, and then use that to do the virtual call into the correct interpreter.
So there’s a pretty important corollary to this, which is that once you associate a tensor with an interpreter, it is always associated with that interpreter, even if the, you know, interpreter goes away, like, because we decided to unload it, that tensor is permanently associated with that interpreter.
And that makes it easy to make the interpreter recording thread safe, because there’s a hazard. The hazard is, if you have multiple threads, and they’re all trying to, like, basically allocate a Py object for a tensor at the same time, there’s no intrinsic synchronization to this.
And the fact that only one of them can win. And the fact that only one of them can win. And once they win, it’s permanent means that you can just do a plain old compare and swap and force the other threads to fail if they lose the race.
One last complication when doing these sort of virtual dispatch tricks, unlike traditional C++ code, where you sort of load up all your libraries, stuff happens, and then shutdown kind of happens at the very end.
And it isn’t really that important. And it doesn’t really matter if you clean up after yourself in the situation, because the process is going to die very soon. Torch deploy interpreters can be spun up and spun down.
And when they’re spun down, you will unload the dynamic library that’s associated with them.
And that’s important, because if you have any, like spare references to functions from that dynamic library, well, all those functions are going to become invalid once the library gets unloaded.
And so this is so we don’t actually use virtual methods to implement the PI interpreter object, we use a homegrown V table like implementation with an extra feature that lets you disarm the function pointers when the library unload happens.
So normally, you’ve got a bunch of function pointers, they all look great. And when you unload the process in question, we replace all of the function pointers with no op function pointers that live in the base library, so that if anyone else tries to interact with the Python interpreter, after it’s died, we don’t just you know, seg fault, we can do a no op operation in some cases when it’s benign, or raise a good error in the situation.
So a lot of tricky stuff going on here, but torch deploy is a pretty cool bullet in our toolkit for letting multi threaded Python processing happen in a single process.
That’s everything I wanted to say today. Talk to you next time.torchdeploy
EP29 CMake
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about CMake,
or perhaps want to talk about is too complimentary a word of it. Really, what you’re going to listen
to today’s podcast is me ranting about how CMake is terrible. And oh my god, it is terrible. And
there’s two parts to today’s podcast. So the first part is going to be short. And it’s basically
like, dude, I know you like have this preference about where you want to file to be. But like,
seriously, just follow this small set of rules for when you’re adding new files to PyTorch,
and you won’t run a file to CMake gods, and everyone will be happy, and you will not have
to deal with CMake. And the second part of the talk, which is going to be much longer,
because there’s just so much random shit that is wrong about CMake is like, okay, actually,
you got to do something about the CMake. Like you got to make a change, you got to add a
new library, blah, blah, blah, blah. How do you actually go about understanding the monstrosity
that is PyTorch’s CMake configuration? So in order, first off, what is CMake? So CMake is a build
configuration system. So what it does is you write CMake list files that describe, you know,
aspects of your build system, most importantly, you know, what your source files are, and what
libraries you want to build are, and then CMake will generate some sort of actual, you know, build
file, usually either make or ninja. When you build PyTorch, by default, we generate ninja files, because
ninja is way better than make at running builds. Anyway, so it generates an actual file that it hands
off to some other system that just knows about how to build things quickly. And the reason for this
two stepness is that CMake deals with all the grody information about, oh, you know, what are packages,
what are flags, like, how do you detect things? How do you, you know, generate files for both ninja,
but also Microsoft Visual Studio, depending on what platform you’re on, etc, etc, etc. And then a system
like ninja can be really simple, really short. And it’s just, you know, I’ve got this graphic
dependencies, how do I build it as quickly as possible in the correct order. Okay, so if you’ve
ever written any serious open source software, you may know of the thing where build systems are just
generally a complete disaster. And PyTorch’s build system is no exception. I like to kind of think of
it, this is because, well, you know, we’re all here to write software, right? Like we’re all here to write
the an awesome deep learning library. And so every moment spent not doing CMake is time spent
well, every minute spent doing CMake. And that ends up in the sort of very tragedy of the common
situation where the CMake is terrible. And it actually like legitimately like people would be
more productive if the CMake was better structured, but no one knows CMake, no one has the time to deal
with it. People are just cargo culting changes whenever they need to do something. And so things
just get worse and worse without anything, anyone working on it. So if you are being beaten down by
the establishment, and you don’t have time to raise a revolution, there are some easy things you can do
to reduce the risk of, you know, running into a CMake disaster. And I really only have one rule here.
And the rule here is abide by the existing PyTorch structure, and don’t try to do anything fancy.
And when I say do anything fancy, I mean, like add a new directory to put your files in.
Why do I give this advice? Well, I give this advice because the way CMake is set up, right,
is we have to do a lot of, you know, telling the build system what files to actually compile in.
And so sometimes the list of files that you want to compile in is written out by hand, like one by
one by one by one by one. So in some file in some directories, if you add a new file, you will need
to add that file to some list somewhere that says, Oh, here’s a list of all the files. And in some other
places, it’s done using a glob. And so if you just add the file to the directory, the glob will pick it
up. And in a very, very restricted set of cases, do we do a recursive glob that looks into all
subdirectories. So if so if you don’t want to have to edit the build system, then if you don’t add any
new files, that’s like a surefire where to make sure you don’t have to edit the build system.
Excluding that, well, if you don’t add a new directory, if you just add a new file, hopefully a glob will
pick it up. But if it doesn’t pick it up, you know, just find one of the other files in the directory,
grep for it. And that’ll tell you whatever file you need to edit it. And you know, cargo culting it in
that way usually isn’t too painful. Ah, but if you want to add a new directory, well, you’re actually
going to have to understand a little bit about how CMake works if a recursive glob isn’t picking it up.
So just, you know, if you don’t have time to deal with the build system, just don’t friggin add a new
directory to PyTorch. Yes, I’m sorry, like PyTorch’s structure sometimes doesn’t look so good.
Sometimes you really want to add a new folder because you think that it’s going to make things
organized better. And so if you really, really think this is important, then listen to the second
half of this podcast, or try to explain the method behind the madness of CMake lists. But if you don’t
have time, just don’t do it, please. Oh, and one more extra tip. So when you add a new file,
CMake has to actually pick it up. And when a glob is being used, CMake won’t automatically pick up
the change because it doesn’t repeatedly rescan the directory every time you build it, that would
be expensive. So you have to manually re-trigger CMake. And when you’re using setup.py to build
PyTorch, you can just pass the dash dash CMake flag to cause it to pick it up. This is one of the
reasons of the debate between whether or not a glob is better or a list of explicit files is better.
If you do a list of explicit files, you have to actually edit the CMake list to add the new file,
that’ll trigger a CMake rebuild automatically. But if it’s a glob, you have to like, you know,
pass dash dash CMake, just a little thing to be aware of if you’re ever adding new files to PyTorch.
Okay, so if I’ve duly scared you off of, you know, doing CMake modifications, that’s great,
you can stop listening to this podcast. So now I want to talk a little bit about like,
what the heck is going on with the build system in PyTorch. And so there’s a few things that,
like, are historically important to understand about why the build system is so freaking complicated.
So one is that, and this is an ongoing constraint, and you will have a very hard time getting rid of
this constraint, is that PyTorch needs to be built under multiple build systems. So it’s not just the
open source CMake that you’re building it, but there is also a buck based build system that is run
inside Facebook for building PyTorch for suicide. There is also a second buck build system that is
built for Facebook, but when you’re running PyTorch on mobile and other sort of exotic devices,
there is a third Facebook build system, which is run when you are building PyTorch for running it on
Oculus. And there is a fifth Bazel build system, which someone requested for us. And so we maintained
for them because, you know, buck, Bazel, they’re basically the same thing. But like, you know,
if you’re using Bazel, you need an actual Bazel build system. So there’s so many build systems,
and each of them sort of is re implemented. And there’s some work to be done to so we have some
stuff to like deduplicate a configuration between them. In particular, there’s this concept called
append file list, which lets us read out file lists from Bazel files into CMake. And we use this in a
few cases, not everywhere, but in a few cases. But in general, like when you are doing a build system
change, it is not just a CMake change, you also need to change all of the other build systems. And that can
be quite a tall order, especially if you don’t work at Facebook, and you have no way of running any of
the Facebook internal build systems. So find your favorite Facebook employee and make them actually
do the build system change for you. The second important thing to know about our build system
is that it is the unholy mash together of the Cafe 2 build system and the PyTorch build system.
Remember when I said that we merged PyTorch and Cafe 2? Well, this is one of the things we merged,
right? We took the two build systems, smashed them together. And we didn’t really get very far in like
refactoring everything. So for example, you might be wondering where the Torch library is defined would be
very reasonable for Torch to be defined in Torch slash CMake lists. Well, it is not defined in Torch slash CMake
lists. It is defined in Cafe 2 slash CMake lists. Why is that the case? Well, because it used to not be called
LibTorch, it used to be called Lib Cafe 2. And eventually we renamed it to LibTorch, but no one
bothered moving the CMake definition from Cafe 2 CMake lists to Torch CMake lists. I really hope these
parts of the podcast eventually become obsolete. But I’m not holding my breath because as I said, no one
really likes working on CMake. And the last reason I would say that our build system is very complicated
is a sort of intrinsic problem with CMake itself. So the CMake model historically is set a bunch of
global variables in a crappy imperative programming language and then stuff happens, right? Like literally
it’s like, you know, set this, set this, set this, blah, blah, blah, blah. Modern CMake involves,
oh, define a dependency graph, which, you know, says the library structure that you want to build.
But, but really like you’re still setting tons of variables along the way to like figure out how
you’re going to set things up. So what makes CMake hard to understand is that like there’s this program
and it’s setting a ton of variables. The order in which these variables are set matters. You’re sort of
stepping in and out of various subdirectories for different CMake lists. And so if you want to
understand what any given definition is, you have to understand the trace of all the possible CMake
files that were included before that might have set that variable in question. So that means that there’s
a lot of non local action going on. Like I said that the torch library definition is in cafe two dots
slash CMake lists. Where is the files that the torch library includes defined? Well, not cafe two slash CMake
lists. That one’s actually in the much more reasonably placed A10 circ A10 CMake lists. So you have to like
be willing to follow the breadcrumbs to find where things are defined. Fortunately for you, because CMake is a
really crappy imperative programming language. It also is very dumb. So for example, if you are looking for a
variable that is being set somewhere, you can just grep for that variable, and you will find it, you don’t
have to worry about like, oh, some sort of meta programming thing going on, generating these variables
on the fly, just search for the variable, and you will find where it is defined, I guarantee you. So
modifying CMake or like sort of understanding how the CMake works, usually just involves like, you know,
doing a lot of grepping around to find all the places where a particular variable is set.
One last note, not everything is in CMake lists.txt. We also have some .cmake files that contain various
configuration in the aptly named CMake slash folder. And there’s a lot of actually very important stuff
going on there. Like, you know, the stuff that’s responsible for calling our code generation scripts,
say previous podcast. So, you know, be sure to check those out as well. But I don’t really recommend
trying to sit down and read through all of our CMake end to end, although you’re certainly welcome to try.
And if you successfully do it, you’ll have a very good idea of how everything is set up.
But it’s usually better to just use this tactical idea of, you know, like, looking, like, find the definition
that matters. In CMake’s case, there’s actually really only one definition that matters, right? Add library.
Add library in CMake says define a library that is the thing that, you know, I want to build. So
every, like, you know, dynamic library that you see dumped in Torch slash lib, whenever you build PyTorch,
there’s going to be an add library declaration for that. So you can start there and then start looking
at what things refer to this library, like what properties am I setting on it? What files am I saying
are it? And then start tracing back the variables. And so you don’t have to worry about the ridiculous
folder structure that’s going on. Okay, so I’ve talked a lot about how our build system is terrible.
Let’s say that you are very enthusiastic, and you think you can help fix the build system. How might
you go about doing that? So there are a few avenues that I personally would go about looking at if I were
tasked with this task. So first, I would try to unwind the directory structure, actually try to put the
definitions of libraries in the places that make sense for them. And what you will find challenging
about this is that we actually don’t have that many libraries in open source. So for example, we have
this A10 directory, and you’d expect there to be a library named A10 in open source, but we actually
don’t have an A10 library at all. Why is that the case? Well, we used to, but it turned out that there’s no
reason to have a separate LibA10 dynamic library alongside the LibTorch dynamic library. This is
something that’s useful inside our Facebook build systems, but inside CMake for the open source binary
distribution, it’s not useful. So we actually just merge them together. So there’s a single LibTorch.so
that contains all the A10 code as well as the Torch code. So you’ve got this problem, which is that the
physical directory structure doesn’t line up with the dynamic library structure. And that might not be a
big deal if you can, you know, define A10 to be, say, a static library, and then you bundle up a bunch of
static libraries into a dynamic library. But in old versions of CMake, this was kind of buggy. And so
you’ll need to figure out like, um, what the, you know, uh, earliest version of CMake you can use to
actually do this properly is. Second is that there is this concept of modern CMake, right? Modern CMake
says, don’t set CMake CXX flags global variable, which twiddles all of the CXX flags for every target
defined in the CMake list, because that’s a global property. And you have no way of controlling the
visibility on per target basis. Instead, look for target underscore functions, which actually define,
uh, you know, a property, but only for a specific target. And I would probably start going and trying
to like reduce the visibility of everything. And that’s kind of a like tall order, right? Because
there are so many targets. And there are also a lot of different build configurations you can build
PyTorch under. So it’s a little not tricky to like, make sure you’ve gotten them all right.
Something that would be kind of nifty is if there are a way to, um, you know, basically look at the
output of CMake, right? Because as I said, CMake doesn’t actually do any building. It just produces
files that actually build your software in the end. So if there is a way to like run CMake, get the output,
and then just, you know, say, okay, I’m going to refactor CMake. And I’m going to like ensure that
the output is always the same. Or if I change the output, it’s done. So in a semantics preserving way,
right. And then I could like iterate on changes to CMake without having to actually go through the
rigmarole of actually building PyTorch under every configuration under the sun, I just need to like,
you know, make sure that I don’t change what the outputs and questions are. And so that’s like,
like, so for the simple case of refactoring CXX flags, um, now I just, you know, if I want to like put
these into targets, well, I can use the, um, you know, output and the make file to see, oh, where
were these CMake, these CXX flags applied in the first place. And just make sure when I do the
refactor, I’m continuing to apply it in all those cases, or maybe, uh, I’m removing it only in places
where I know it’s not necessary. Oh, and one last thing, don’t change things in the CMake randomly and
then pray that it works out. Like, yes, the CMake is really complicated. Yes, there are a lot of parts
moving parts to it, but like fundamentally CMake is a very simple language. Like it is basically someone
went into the process of designing a language without wanting to design a language. And so like, that’s why
the if statements also look like functions because it was like, Hey, I’m not a language designer,
but I’m just adding features. But the good side of that is that, um, CMake is actually simple and you
can understand it. And so if you need to make a change to CMake, just make sure you actually
understand the change you’re making and then do it. Don’t just randomly make changes and hope it works
out. That’s just going to waste a lot of time when you’re trying to do things. Uh, I could probably rant
for CMake a lot longer, but that’s really all I wanted to say for today. Talk to you next time.CMake
EP30 TorchScript
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about the
TorchScript compiler, also known as the JIT compiler. It’s a little bit hubristic to think
that I could explain an entire compiler stack in a 15-minute podcast, but I’m going to try anyway.
Unfortunately, there’s plenty of resources for you if you want to dig in deeper than, you know,
the short amount that I’m going to be able to talk about today. In particular, there is a really
good overview document in the JIT directory, also linked in the episode notes, that is basically
going to cover everything that I talk about in this podcast and everything else in more detail.
So my goal here is to sort of just give the big picture, tell you a little bit about the Torch
Script compiler, even if you don’t necessarily know anything about compilers. So it’s going to be a
mix of, you know, here’s what a compiler is, and also, hey, you know, here are some of the interesting
things that are going on in TorchScript specifically compared to other compilers in question.
So I’m going to structure this like you would structure a normal compilers course. Don’t tell
William Bowman I said that. He structures it in the other direction, which is we’ll start from parsing
and then go successfully lower as we, you know, progressively lower into simpler and simpler
representations until we get to the interpreter, which actually is responsible for running your
TorchScript programs. And so each step is like a traditional step in a compilers course, where,
you know, first we do parsing, then we do lowering, then we talk about optimizations,
then we talk about actual code generation. Although in the case of PyTorch, we don’t actually generate
x86 code from your TorchScript models, we just interpret them. And each of these steps, we’ll
talk about, you know, some of the things that are going on, and how to understand what they’re doing
from the perspective of a PyTorch developer. So let’s start by first setting the stage for what
TorchScript is about. So what is TorchScript? So TorchScript is a way of taking Python programs that
you’ve written, and re-representing them into a form that, you know, is not Python, but is an actual
honest-to-goodness IR that we can do optimizations on, which we can easily package up and send and run
on, say, C++ services that don’t lay against Python at all. You may recall in a previous podcast,
I talked about TorchDeploy, a technology for making it possible to run Python programs from multiple
threads inside a single server process. Before TorchDeploy, there was TorchScript, and TorchScript
took a much more direct route, which was saying, hey, if you want to run your model in a multi-threaded
fashion, we’re just going to get rid of Python entirely. And so in order to do that, we need to
actually have some way of representing our program, and have it be runnable from C++ without a Python
interpreter at all in the loop. Oh, and by the way, we also want to do some optimizations,
like Fusion, to make our programs run faster in this situation. So TorchScript, in other words,
is a graph mode for PyTorch. PyTorch is all about eager execution, but TorchScript actually lets you
take your PyTorch program and put it into a machine-processable graph representation that we can do
transformations on and that we can actually execute in this way. But there’s a step further with the
TorchScript compiler, which is that we want to actually be able to capture the control flow and
other features of people’s programs that otherwise you couldn’t get from, say, just tracing your
program. So in the very first version of TorchScript, we implemented getting PyTorch programs simply by
just running your eager PyTorch mode program and seeing what operators were called, and those were
the operators that we actually put into a trace. So TorchScript wants to be able to handle your beam
search or your while loop or your, you know, if conditional. It wants to handle all of those
things. And so it basically wants to capture a kind of, you know, high fidelity representation of
your program, even if, you know, on a single eager mode execution, you might go down one path or you
might go down different paths. So it wants to capture something that can describe all of the possible
traces of your program in this situation. So with that in mind, what this means is that when you talk
about the TorchScript compiler, you have to talk about an actual parser. That is to say, you know,
we can’t, you know, do the easy trick, easy way out of just tracing your code and getting out a
representation of all the things that got run in runtime. Because as I said, there might be an if
condition, there might be a while loop, and there’s no way to trace all the possible different
versions of it, unless you’re in a language that supports abstract interpretation, which Python is
not. Okay, so, so what does this parser look like? So we’ve got our Python code, and we basically need
a Python parser. And so in fact, there’s two parsers that TorchScript supports. There’s one parser that’s
written in Python, and it’s based entirely off of the stock Python AST module that lets you, you know,
take some Python code, and blurts out an AST. We also have a reimplementation of this in entirely in
C++. So it’s a lexer. That’s a thing that takes in a string and reduces into a bunch of tokens so that
the parsing stage, which organizes this into a parse tree, can do it more efficiently. We have a lexer,
and we have a parser that knows how to parse Python, implemented in C++. And remember, that’s because
TorchScript is all about being able to run PyTouch programs in contexts where Python is not allowed.
As a side note, this actually is very important code because we don’t actually serialize some sort of,
you know, random bytecode format when we want to save TorchScript models to disk. And, you know, remember,
this is one of the things that, like, TorchScript is designed to do. It’s, you know, take your model,
put it into some format so you can load it up into the model server somewhere else. We actually save
honest-to-goodness Python code as our serialization format because it’s easy to debug, it’s easy to modify
if you need to. You don’t need special tools to deal with it. But, you know, that’s only because we’re on
server, and it’s not a big deal to parse Python code when you’re loading up your model. On mobile, where binary
size is at a premium, see my mobile podcast episode, we don’t want to pay that. And so there’s actually a
different version of the serialization format that’s used by mobile, and that’s actually a, like, good
old-fashioned bytecode format that’s easy to parse in, so you don’t need, you know, something that
understands Python syntax to parse it. Okay, so you’ve done the parsing stage of your program, right? And so
given a Python program, now you have this AST that looks a lot like the surface syntax, but it’s, you know,
in tree shape, it’s easy to look at, and it’s got all of the language features from Python that, you know,
you actually support. So, like, you had a while loop with a break, you know, you’re going to have a AST
that has, okay, here’s the while loop, and then inside it, there’s a statement, and that’s the break
statement. And so the next thing you need to do in any honest-to-goodness compiler class is you want to take
this, you know, sort of direct reflection of the surface syntax as an AST and lower it, de-sugar it into a
simpler representation that’s easier to do processing on. This is just like, you know, the very standard
thing you do in compilers because people want tons and tons of features in their surface language,
right? Like, the more features, the better. Like, invent a new syntax, you know, do all sorts of fancy
things. And as a compiler writer, like, this is a big problem for us because we need to write code that
can work no matter what features you use. And so the easiest way to make our life easy is to, you know,
take all of this, the surface syntax that, you know, all our users want and then condense it down
into a smaller set of syntax that, you know, we only, we only have to worry about when we’re writing our
passes. So, uh, there, there’s this transformation. There’s a bunch of optimization passes because
sometimes we have to do non-trivial analyses to figure out how to like re-jigger things into the
simpler format, but eventually you get to what we traditionally call TorchScript IR. So what is
TorchScript IR? So if you know what LLVM IR looks like, TorchScript IR has a lot of similarities to it.
It’s SSA. That means that for any given variable you define, there is a single definition site for it.
So you don’t have to worry about, you know, you’re like, you’re an optimizing optimization pass. You’re
like, okay, who defined this variable? You don’t have to worry that there’s multiple possibilities,
like one in this if branch and one in this else branch. SSA means that, oh yeah, um, there’s only
ever going to be one of these things. Another thing about, um, the TorchScript IR format is, um, it does
understand, uh, conditionals. These are actually added after the fact, because remember, um, tracing,
you don’t have any conditionals. They all go away. Um, and the way they’re modeled is that, uh, instead
of a good old fashioned, uh, you know, CFG style setup where you have a bunch of blocks and they have
labels and then you have phi nodes for when blocks, um, enter in, uh, from multiple possible entry points.
Instead, what we just do is we, we, we, we, it’s more of a structured can flow flow control style
where like when you have an if statement, there’s two sub blocks associated with it that represent the
computation that gets run in the first case and the second case. And they, um, you’re responsible for
passing in the inputs. And then when you, uh, exit, you have to say all the variables you want to
return. And then the if statement itself does return values and it returns all of the, um, sort of values
that get carried out of the loop. So unlike, um, LLVM SSA, uh, we don’t have phi nodes. Instead,
those are sort of done implicitly via these, what, whether the, what they’re known in the literature
as basic block procedures in the situation. Two more important things to say about the LLVM,
uh, the, the, uh, TorchScript IR. So one is that, um, although we simplify the aspects of the Python
programming language, so we have less features, um, we still have a really big instruction set.
Every, um, there’s a bunch of, you know, like when you have an IR like this, there’s, there are going to
be operations, primitive operations or prim ops for short, which don’t have an implementation
inside of the IR itself. Instead, the compiler stack defines what these operations should be.
And, um, every operator that’s defined in PyTorch, recall the native functions.yaml podcast,
that that’s the list of all the operators. Every operator is a valid instruction inside of TorchScript IR.
This is kind of a pain in the neck for compiler writers who don’t really want to, you know,
deal with like over a thousand operators. And hopefully in a future podcast episode, I can talk
with Zachary DeVito about some recent work he’s been doing about Mintorch, which is reducing the set
of operations in PyTorch. But okay. So we have this really big primitive operator set, um, but it’s in SSA
form. We’ve got control flow in a structured manner that’s simplified. So there’s only a few
control flow operations. And one last thing is that this IR supports mutation and aliasing. What do I
mean by that? So when you write PyTorch programs, you can take out views on tensors, right? Like you can
say tensor open bracket, zero close bracket, and that’ll give you the zeroth row of your tensor. And if you
mutate that, like you say, um, X dot add underscore blah, then the base will get updated as well.
And TorchScript can handle programs that do mutation. It can handle programs that do views. And we don’t
have a functionalization pass that removes all of these things. So the IR needs to also know about the
concept that some operations might have side effects. No, you cannot move operations around willy nilly
because, you know, if you move a use of a tensor before you mutate it, that’s, you’re going to see
the old version, not the new version, um, that’s going to be semantics changing. So really like what
TorchScript IR is, and maybe this is, um, you know, not the best point in the design space, but it makes a
lot of sense. If your goal is to, um, uh, like package up as many Python TorchScript models as possible
into this representation is, you know, we support all the operators in PyTorch and we support a bunch
of control flow. We support mutation and views and, um, we, uh, but otherwise it’s an SSA format. So
it’s still possible to do optimizations on this. So once you’re in an IR, the next thing you want to do
are optimizations. And we do all the sort of basic optimizations like people optimization, that sort of
thing, but there’s two really interesting optimizations that are very specific to PyTorch.
So one is specialization. So what do I mean by specialization? Well, when you write Python code in
PyTorch, typically you don’t give it very detailed types, right? Like for example, you have a bunch of
inputs and they’re just tensors and you don’t really know anything else about what they are. Actually, in
reality, they’re probably all floating point tensors of, you know, dimension three, but you
don’t, we don’t know that when we’re parsing the TorchScript IR. And so there’s this concept called the
profiling executor, uh, rewind a sec. So this is a bad thing for us because if you want to optimize your
code, if you want to generate kernels, the more, you know, about what the D types are, the sizes are,
the dimensions are, the more, you know, about these things, the easier it is to generate good code.
Like let’s say you’re doing like fusion and you want to fuse a bunch of point-wise operators
together. Well, you can’t actually do it unless you know, for example, what the dimensions of things
are, because if the dimensions don’t match, you might actually need to do some broadcasting in
this situation. So what the profiling executor does is it runs through your code on some inputs and it
says, Oh, here are what the types of everything are. And then it introduces this information into
the TorchScript IR. Um, and it does so in an interesting way. So it’s not, um, the way you
would think, which is like, take your IR and then generate a specialized version of it. Instead, we
take the IR as is, and we insert a bunch of what we call guard, um, statements. And what these guard
statements say is if it is the case that something has this type, then do this, otherwise bail out and do
something else. And so inside the, um, segment of the code where the guard is okay, we actually now, um,
are able to optimize under the assumption that, you know, it’s floating point and has these sizes. And at the
same time, we haven’t changed the semantics of the program, because even if you feed it something that it
wasn’t expecting, you’ll still get, uh, a valid result. You’ll hit the bailout path in that situation. Another
interesting optimization we do is derivative splitting. And this is because, um, PyTorch
programs, uh, often are differentiated because you’re doing gradient descent. Now, unfortunately,
um, TorchScript, uh, can’t make use of the standard derivative definitions that are defined in
derivatives.yaml because, you know, those are basically just, they’re just C++. It’s glorified C++
in there. And TorchScript is, you know, this IR, it needs its own IR definition. So, um, unfortunately,
we weren’t able to like put the derivatives in a, um, form that could be used by both TorchScript
and the traditional old eager mode. So there’s a set of extra definitions for doing symbolic automatic
differentiation in TorchScript. But these definitions are not complete because as I said, there’s a lot of
operators in PyTorch and, uh, it, you know, it’s just hard to actually keep coverage with that.
So for the things that TorchScript knows how to symbolically differentiate, um, derivative
spilling bunches them all together so that it can, you know, go ahead and generate derivatives in
those case for everything else that it doesn’t know about. It keeps those separate so that we can
run the good old fashioned autograd system. Yes. Um, so we’re, we’re compiling your code, but you know,
we don’t necessarily compile everything away. And in particular, if you’re going to run, uh, AD code
in TorchScript, we still use the eager mode autograd executor in this situation.
And so those, um, things that don’t support symbolic differentiation, they’ll just go
through the normal autograd mechanism. And there’s a very complicated way of making sure symbolic AD
and, um, uh, eager AD work together harmoniously in the situation. And we should honestly write a
research paper about this, but we’ve been lazy and haven’t gone ahead and done it.
Okay. So you’ve optimized your program, right? So we’ve gone from parsing to, uh, lowering to IR,
and then we’ve optimized the IR. What’s the last thing? Well, uh, program is useless unless we
actually run it. So we need to be able to run our programs. And the way this works is, as I said,
we don’t actually, uh, co-generate x86 code from your TorchScript programs. Although maybe this would
be a good idea. And, um, some people have looked into it. What we do instead is we just have an
interpreter. So we take our, uh, IR and we compile it into a bytecode format. It’s a very simple bytecode
format. Um, it, all it does is it just does some register allocation and the register allocation is
really dumb because, uh, we, we don’t really, we’re not really storing things in like hardware registers.
We’re just sort of like using the registers in an easy way to keep track of what intermediates
are hanging around. But the thing that is, is important about the registers is that we use them
we need to know when tensors die because we need to deallocate them promptly. And so that’s
something that happens during the final, like, uh, compilation of, um, TorchScript IR into what’s
called code blocks. So we do that. And then to actually execute your TorchScript program, we do
something that should be very familiar to you if you’ve ever started the JVM, which is we just have
a good old fashioned stack machine. So what do I mean by a stack machine? So a stack machine works
is if you want to call a function, you push all the arguments you want, uh, to call the function
with onto some stack, right? And you call the function and that function is responsible for
popping off all those arguments and then pushing on the return values to the stack stack based machines
are very nice because they give you a uniform calling convention that doesn’t, uh, that works no
matter how many, um, arguments and returns you have. Like if you wanted to actually, uh, do it some
other way, then you would have trouble like, you know, finding memory to put all your arguments
or returns depending on what the situation was, because remember the interpreter doesn’t know
anything about what operator is going to execute. It’s, you know, running in a loop and going over
each instruction and being like, okay, now I got to do this one. Now I go, now, now I got to do that
one. And it doesn’t know ahead of time. Oh, this is a function that only takes two arguments.
And what are these arguments that we’re passing in on the, uh, stack? Well, these are I values,
um, which I’ve talked about in previous podcast episodes, right? It’s a box representation of,
uh, either a tensor or maybe an integer or some other primitive formats that just let us work
polymorphically over them in the interpreter. And that’s a whirlwind tour of the short script compiler.
I promised 15 minutes, you got 20 minutes, uh, but that’s everything I wanted to say. As I said,
check out the overview document in the JIT folder. I mean, it contains tons and tons of details,
way more information than I talked about in this podcast. That’s everything I wanted to say today.
Talk to you next time.TorchScript
EP31 TH
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about th,
the previous library that was used to implement all of the kernels in PyTorch.
This is something of a historical episode because there isn’t really that much code
that is still in th in PyTorch today. But it’s still a kind of interesting historical
example to look at. And we still do have some th code. So if you’re an unlucky person,
and you’re, say, trying to add a new d type or trying to deal with our storage Python bindings,
you might still need this knowledge one way or another. So that’s what this podcast is going to
be about. So what is th? Well, as I’ve mentioned in previous podcast episodes, PyTorch is not a
project that was written entirely from scratch. It took all of its code from LuaTorch, the previous
iteration of the framework, which was a bunch of C code bound to Lua. So we kept the C code,
that’s the th library, and we bound it to Python instead of Lua. And that’s how PyTorch came into
being. When you ask the question, why do we want to port all of our th code to C++? We have to
understand a little bit about how th was put together in the beginning. And the most important
constraint for the construction of the th library is that it was written in C. This posed a challenge for
the library in a few ways. One is that the tensor library th needed a way to write algorithms that
would be generic over multiple d types. Like suppose you’re writing a matrix multiply, and you want it to
work for both floats and doubles. In C++, you could use a template to templatize over the d type in
question, and then instantiate code multiple times for each version of the d type you want. But in C,
there’s no such mechanism, right? I had a friend in grad school who was like, yeah, C++ is a really
terrible language, but it’s really convenient to be able to have a reusable vector container that you
can use on different types. And so th had this problem. The problem was they wanted a bunch of
tensors that were for different d types, but there was no good way to actually write them all out without
actually having to write out all the code, you know, N times where N is the number of d types in your code.
And so the way th decided to solve this problem, and also the reason why th is kind of universally
loathed and something that we’re trying to get rid of, is that it decided that the problem could be solved
with macros. So here’s how th decided to solve the problem. Let’s imagine that you’re writing some C code
for an algorithm, say matrix multiply, and you want to write it in a way that it’s generic over the d type
in question. So instead of writing float or double inside your program, you instead write a scalar,
you say, okay, well, everything is some unspecified scalar type. And don’t ask me how it’s going to be
defined, but it’ll somehow be defined. And you write your code all in this generic way. When you write
functions that should be externally visible, you also use another macro, the th tensor underscore
macro to say, hey, I’m defining a generic function, I don’t know what its name is, I’m going to tell you
what the name of it is later. So the where we’re where we’re going here is we’re going to actually
give macro definitions for scalar and for th tensor underscore that basically expand these to the
appropriate thing. So if you’re doing a float tensor, then the scalar will be a float and th tensor
will become th float tensor. But if it’s a double, then scalar will be double and th tensor will become
th double tensor. And then the trick is, we have a C code, it refers to all these macros. And what we will
do is we will define the macros to be float, include the C file, this is very unconventional, right? Normally,
you only include header files, but here, we’re actually going to include the honest to goodness C file, include
the C file, and then undef the macros, redef them to the next D type we want to instantiate with, and then
include the C file again. And we’ll keep including the C file with different settings of the macros
until we’re done instantiating all of the D types that we want. So yeah, there you go. This is the most
important thing to know about th. And in terms of code structure, all the C files that are instantiated
multiple times in this way, they live in generic slash folders. And these aren’t all in th or thc,
there’s also a folder in torch ccirc for doing Python bindings that also is written this way. And so you
know, whenever you see the generic folder, that just means it’s the C code, C++ now, because we made it
into C++, that gets stamped out multiple times in this way. Doing things this way also meant that it was easy
to generate a new tensor type for every instantiation, we had a struct th float tensor and a struct th double
tensor, etc, for each of these things. And those were the those were also instantiated in the same way.
And this also caused some problems when we wanted to write generic code, because well, there’s these
structs are all different in all of these cases. And so one of the things that we did early on when
porting to C++ is we unified all of these different D type structs into a single struct that was
polymorphic, because while we don’t actually need to store floats or doubles directly in the struct,
we only ever store a data, a pointer to the data in question. So that’s something that you can easily
write a single struct that works in all cases for. I actually don’t think this macro instantiating
strategy that the old th libraries was too bad. It’s actually a pretty nice way of adding on a fake
parameterization system to a language that doesn’t natively support it, aka C. And I don’t really,
I can’t really think of other ways you could have gone about doing this. Actually, my PhD thesis at
Stanford was about Backpack, which was this module system we retrofitted onto another programming language
called Haskell. And it also operated by very similar ways. You had a bunch of sort of types and functions
that you left unspecified, and then you instantiated them with an actual implementation later when you
wanted to do the code in question. And why did we do it this way? Well, we did it this way because we
didn’t really want to make major surface changes to the language in question. So it turns out you could
do my PhD thesis and see just with macros. Who knew? There’s a few other things about the th code that are
good to know, although they’re less major than this macro system. So one is that th, because it’s written
in C, has to be manually ref counted, because you don’t have a concept of constructors or destructors,
which C++ programs use to implement RAII. RAII is probably one of the other sort of killer features
of C++ because who wants to do manual reference counting. It’s also a big problem though, because
with the automatic reference counting, you can’t easily tell when you’re doing these ref counts.
And so it’s easy to write code that does a lot of unnecessary ref counting. So, you know, it’s a
double-edged sword, right? Like when you wrote th tensor code, it was easy to get the ref counting wrong,
but at least you could see it all in one place. And then when it’s, you know, all implicit and hidden
away in these classes, it’s easy for people to forget, oh yeah, there’s actually cost to ref counting,
bumping willy-nilly. I guess this is one of the reasons why Linus Torvalds still writes all of
Linux in C, because C++ is just this terrible language that like has all of this, you know,
extra stuff that happens automatically, and it’s easy to forget about, and you write really slow code.
Anyway, so we had a manual ref count in C, and that was also a pain. And it was especially painful
when you had error conditions, because you had to make sure you freed all the temporaries when the
error condition fired. Because we were actually, in the old days, when it was C only, we would crash
the process when you hit an error like this. But very early on, when we started porting things to
C++, we were like, okay, we’re going to do everything at C++. And then when you hit an error, we’re going
to raise an exception. So we can recover it from it, and not just crash your Python process when this
happens. One last thing that’s interesting about th, and actually sort of has propagated its way to
our Aton ports, is a lot of the neural network operations that we supported have a lot of buffers
that get passed from forwards to backwards. So what are these buffers? Well, basically,
they’re extra outputs from a function in the forward pass that you don’t actually use. Like from the
perspective of a user, these buffers are invisible, you don’t see them, they just invisibly get passed to
the backwards function, where they get used. And a lot of the times, they don’t actually do anything
useful. They’re just like scratch space that the kernel in question uses. Why do these buffers exist? Well,
it turns out that back when we were in LuaTorch, we didn’t actually have a caching allocator for CUDA.
So allocating CUDA memory was very slow, and it was very expensive. And one of the first new pieces in
PyTorch that also was one of the really important pieces for making our CUDA programs run fast was adding a caching memory allocator.
So that so in Lua, you know, you really wanted to not have to allocate memory willy nilly. So if you allocated
this buffer, and then you saved it for later, that was actually a benefit, because you wouldn’t have to
do this allocation again later. PyTorch doesn’t have this requirement. So if you ever see these scratch
buffers being passed around, that’s just useless memory usage, and you should just get rid of it.
So that’s really all you need to know about th. I’m not going to labor on because we have a process
of porting th operators to ATEN operators that has gone pretty far. We’re very, very close to getting
rid of all the legacy th code, and no one else is going to have to have the c code inflicted on it.
There’s also there was also a lot of legacy code gen that was written specifically for the c library.
We’ve also gotten rid of all of that. So you don’t really have to worry about that anymore
as well. There’s one thing that I regret a little about porting all of our th code to ATEN,
and that’s the loss of static typing in call sites. One of the things that is kind of expensive in modern
PyTorch is when we dispatch so we have to go look at all the tensors and figure out oh is it CPU CUDA
and go to the right one for the right d type in that situation. th didn’t have this problem because
there was a separate type for every th float tensor, th double tensor, etc. And you always wrote code
knowing exactly what your d type was right because everything was in one of these c files where you’re
going to instantiate the macros. So calls in th, while they couldn’t get inline because inlining
isn’t really a thing in c, you could still actually just compile them as normal function jumps without
any fuss and muss. And we have swung back around to wanting to be able to do this in PyTorch proper
when performance matters, but it’s a bit harder because you know we don’t have we don’t really
want to template all our code and so it’s just kind of annoying to actually make sure these things work.
One thing we’ve been looking at is maybe we can use very small amount of you know just in time
compilation techniques. No, not the JIT compiler for PyTorch, but like good old-fashioned polymorphic
inline caches that might make it possible to like speed this up. But that’s something just
speculatively we’ve been looking at. Okay, so you know about th. That’s everything I wanted to talk
about today. Talk to you next time.TH
EP32 XLA
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about Torch XLA,
PyTorch’s integration with TensorFlow’s XLA compiler. What is XLA? XLA is an optimizing
compiler that sits below TensorFlow, our most favorite deep learning framework, and its sort
of purpose in life is several things. So one is it’s intended to have a lot of optimizations,
the idea being that XLA’s intermediate representation, HLO-IR, doesn’t need to have
that many operations, and then XLA will know how to compile them and optimize them in a way,
even though you may have expressed your programs as having lots of itty-bitty operations. XLA is
supposed to be able to fuse them together and give you good performance. The second thing XLA is
supposed to do, and the one that really made PyTorch want to integrate with XLA, is it’s the only stack
that can actually run Google TPUs. TPUs are a deep learning hardware accelerator that is developed
by Google. There’s a lot of free TPUs. Google loves getting people to use them, and if you want to use
them, you’ve got to use XLA. And so if you want to use PyTorch and use TPUs, well, Torch XLA is your
guy. Torch XLA has a lot of people who have worked on it, both on Facebook’s side and on Google’s side,
and two big people I want to call out who were very historically important to this development
are Ailing Zhang on the Facebook side and Davide Libense on the Google side. They both have very
big influence on the XLA project. Okay, so how exactly does Torch XLA work? There’s a big, big problem
when you want to take PyTorch, which is ostensibly an eager mode framework, and hook it up to XLA,
which is a graph compiler. It takes in a graph of operators that is already
preset and compiles it into some efficient form, which is that PyTorch eager doesn’t actually have
any graph representation. Well, see my previous podcast talk about TorchScript, where if you Torch
script your model, then you can get a graph mode representation of it. But one of the things that
going into the XLA project, they wanted to make possible was they wanted it to be possible for people
fashion PyTorch eager scripts, and run them straight on XLA without having to do very many modifications.
In other words, the goal of this integration was functionality, we wanted, you know, to have as much
stuff work as possible on XLA with as little work as possible that you want to do. This is a double
edged sword. I’ll talk about it more later in the podcast. So how can you run eager mode code directly,
while still feeding it to a graph compiler? Well, the big idea that everyone comes up in the situation
is to use some sort of lazy tensor. Now, I don’t mean lazy in the sense that you had a tensor and you
were going to materialize it at some point, but you’re just waiting for someone who actually needs it to
use it. XLA is Torch XLA is a really, really lazy operation, what it does is it doesn’t run anything
when you run your model until the very end when you want to do the optimizer step. And that’s the point
when the trace of operations that you’ve run during this period of time actually gets sent to XLA and
gets compiled. Or, you know, hopefully, we’ve already seen this trace before, because your program maybe
isn’t too dynamic. And that’s when it gets compiled into XLA and, you know, done. So what are we doing?
So you’re running a bunch of PyTorch ops, it all looks like normal PyTorch. But under the hood,
we’re constructing XLA’s HLO IR. And at the very end, that’s when we actually send it to XLA.
Lazy tensors in this way are very reminiscent of DINET, another framework, where the idea was,
you wrote your C++ code, you ran it, you ran every single example one by one. And then they had this
thing called automatic batching. So they’d look at all the traces, and then batch them up so that you
could run them more quickly. So Torch XLA also runs very much this way. When you run your PyTorch
program for a single iteration, we always are constructing the HLO graph from scratch every time
around. So actually, XLA does support dynamic execution, right? Like if you do something a
little different on the next run, you’ll just get a slightly different HLO IR, and we’ll just compile
it to some other thing in that situation. So what exactly goes into making a lazy tensor work? Well,
there’s a few very important things. So one is that you need to interpose into the calls into PyTorch.
when you call a bunch of operators. And where this interposition happens is pretty important,
because we want to also do training on XLA. And that means we need to be able to differentiate our
graphs. And Torch XLA takes the approach of reusing PyTorch’s Autograd engine. And because it reuses the
Autograd engine, what we need to be able to do is we need to run our program lazily, the forward pass,
and then also lazily run the backwards pass to generate the operations for all the things that
the automatic differentiation needs, and only then run the entire code in question. So Torch XLA gets
integrated in the dispatcher, because the dispatcher is the point that’s low enough in PyTorch’s stack
of functionality to observe both the forward and backward forms of the AD pass.
So once you get to the bottom, the XLA dispatch key, that’s just what processes tensors that are XLA
of the XLA device. So we go into Torch XLA. And what Torch XLA does is it takes all the arguments and
figures out how to construct a corresponding HLO IR node for the PyTorch operation that was done.
So basically, there’s a translation of the PyTorch semantics into the terms of the XLA semantics.
And you might imagine that we would construct HLO IR directly at this point in time, but that’s not
quite what happens. What actually happens is there’s a intermediate IR that gets built by Torch XLA,
and it’s intended to be very fast to build. And then once we’re done, we first check if,
this IR matches something exactly that we’ve seen before. And if that’s the case, then we don’t need
to do any compilation. We don’t need to translate into XLA’s HLO IR. We can just directly reuse the
pre-computed trace. Otherwise, they do a very simple elaboration into XLA IR. And this just makes it
possible to run XLA programs pretty quickly, even if XLA’s HLO IR isn’t designed to be built very quickly
and repeatedly in this way. And that’s really it. So most of Torch XLA is just the massaging of PyTorch
operators into XLA form, inserting the things that you need, smoothing over differences in semantics.
But deep learning frameworks are all very similar. So in a lot of cases, things match up pretty closely.
There is one place, though, that things don’t match up very closely. And that’s PyTorch’s support for
views. Recall, PyTorch’s support for views means that if I have a tensor, I can take out a view on
that tensor. And then no matter if I mutate the view or the original base tensor, the change is reflected
in the view or the base, respectively. So XLA doesn’t actually natively support this. It has some support
for aliasing and mutation, but not to the degree that PyTorch does. In other words, it doesn’t really know
about strides. Strides are a very PyTorchy, Torchy, Lua Torchy, Torch 7, you know, lineage thing. So how
exactly do we translate these PyTorch programs to XLA? Well, what we do is the functionalization pass
that I talked about in one of my earlier podcast episodes. So what we do is we keep track of all the
aliases when PyTorch makes them. And then when someone updates an alias, we just go and look at all the
other aliases and reapply the update in those cases. And this happens lazily so that we don’t
actually have to keep track of all the aliases that are on the tensor. This works pretty well. And so you
can mutate to your heart’s content, and we are still able to translate to XLA. I mentioned earlier that
Torch XLA’s integration favors functionality or performance. And another way that this is favored
is that XLA has a CPU fallback. Because PyTorch has a ton of operators and XLA, HLO, while cool,
doesn’t have that many operators. One of the selling points of HLO IR is it’s pretty small, and it’s easy
for backends to target. Actually, that’s why a lot of, you know, of the burgeoning new hardware
accelerators often target XLA, because that’s a very easy place to start. And well, you know, if you do XLA,
then you’ve got TensorFlow. And TensorFlow is a very important framework to support when you’re doing
this sort of thing. So PyTorch XLA has a fallback. So what the fallback says is if there’s some op and
we don’t know how to translate it to HLO IR, we’ll just go ahead and immediately run the XLA graph to
get out what the output would be, translate that output into a PyTorch CPU tensor, and then run the good
old-fashioned PyTorch CPU operation, and then go back and put it back into the XLA graph. So, you know,
that’s not going to be very fast, right? Like, you know, you don’t, you’re seeing less of the graph to
optimize, and you have to, you know, go ahead and like, if you were on TPU, you have to move it back
to CPU so you can do the fallback. But at least your program runs. And in a lot of cases, that’s all you
need. You didn’t care that much about performance. You just needed to get it working in the first place.
That being said, sometimes all of these conveniences can make it hard to make your
Torch XLA models go fast. So we’ve had some experience working with people who wanted to
get their stuff running on TPU. And one of the themes that happened is that sometimes their code would
just run really slowly. And why was that? Well, oh, okay, there was, you know, a if statement somewhere
inside their model. And that was causing Torch XLA to have to recompile many, many different traces
every time it went one way or the other in the if statement. And yeah, you have to like rewrite your
model a little so that the traces don’t change over time, so that you can reuse the XLA traces. And
that can be a little challenging. It’s a bit different than say Jax, where Jax provides you this JIT
combinator. And what the JIT combinator says is you’re going to run the JIT combinator once on this model that
you’re going to run. And whatever it is that you traced at that point in time, that’s what you’re going to
have compiled. So there’s no expectation that things are going to work dynamically. There’s no
expectation that, you know, every time you go through a new batch, you’re going to JIT again, like, you
know, obviously, you JIT once and then run it many times. For better or for worse, right? Okay, I want
to talk about some nuts and bolts about general PyTorch development, you might have had your eyes glaze
over, because you’ve never, you know, interacted with XLA. And whatever, like, do I have to care as a
PyTorch developer? And the answer, unfortunately, is yes, because XLA is in our CI. And so if your
PRs are not passing XLA CI, well, we are not going to let you land them. That being said, there are some
peculiarities to the XLA CI. XLA lives in a separate repository, because we have a lot of Google people
who work on it, and they all need commit access. So it’s in a separate repo from PyTorch PyTorch,
which only Facebook people can directly land to. So how did we set up the CI? Well, there’s the right
way to do it. And we did the wrong way. But it was pretty easy, which is PyTorch will pull whatever
the master build of XLA is at any given point in time for your peers. Crazy, right? Like you’re never
supposed to do that in CI. But that’s what we do. And what makes it work is we have a lot of dedicated
people on call for XLA, like Iling, like Jack Sao, who, when someone has a PR that’s making a change in
an operator, and that operator is affecting XLA, because there’s some translation in XLA, and now
it’s changed, and it needs an adjustment, you can just sort of send up the bat signal be like, hey,
we need some XLA work. And usually an XLA PR will show up, you know, in short order. And then what
just needs to happen is you land the PyTorch PR. And then once the PyTorch PR is landed, the XLA PR is
landed as well. The XLA CI has some pretty nifty features. For example, they have this thing called
torch pin. So like if you’re making an XLA change, and it needs to be against a specific pull request from
PyTorch, well, you add this torch pin magic file that says a PR name. And then when your CR runs,
it’ll be run with respect to that PyTorch’s pull request and not master in that situation. And yeah,
sometimes this means that we break the XLA bill temporarily when things land. And usually, if that
happens, you just are like, hey, you know, is there an XLA change? And usually there is. So the XLA
change lands, and then everything’s green again. That’s really the most important thing, like just
knowing who to talk to, to resolve XLA errors, and someone will help. Don’t worry, you don’t have to
know everything about XLA. There’s also some cool stuff coming up in the space of XLA integration.
So one thing that Brian Hirsch has been working on is an external code gen in PyTorch PyTorch that XLA
can use. And we’ve actually landed most of the pieces of this. Previously, XLA actually had its own
sort of homebrew code gen with a homebrew parser for native functions.yaml that generates
all of their definitions, because there were a lot of boilerplate to write, especially with CPU
fallback, right? Because every operator needs to have a CPU fallback. And it’s very, very boring.
You just translate all the tensors to CPU, run the CPU operation, translate them back to XLA.
So we have a shiny new code gen in PyTorch. And we’ve been, one, trying to make it possible for
people outside of PyTorch PyTorch to make use of our code gen, and also provide a much nicer,
you know, backend generic mechanism for overriding operators in the way XLA wants to. Because actually,
what has happened is XLA is our most famous and most successful backend extender of PyTorch. And
people were actually copy pasting XLA’s kind of janky code gen for their use cases. So Brian’s got this
new thing. It’s pretty cool. We’re working on moving the users from n equals one to n equals two.
And there will soon be lots of documentation and more pitches about it. Another cool thing that’s
coming up is Alex Suhan has been working on refactoring Torch XLA into what we’re calling the
lazy core. Because XLA is this lazy tensor functionality, which, you know, like records what
functions got run when you’re doing in PyTorch. And this is something that a lot of other backends
want to use as well, right? Because anytime you have a graph backend that can’t run things in eager
mode, by the way, don’t do that. Like, hey, hardware accelerators, support eager mode, support streams,
it’s a good idea, really good programming model. But let’s say that you can’t, right? Well, you need
something like XLA’s infrastructure for recording the graph so that you can actually run it. And so lazy core
is the part of XLA that, you know, doesn’t have any of the XLA lowerings, but has that generic
infrastructure for actually recording lazy tensors. And so he’s got a branch that which has split these
out into two pieces. And Brian and Alex are working on merging this into PyTorch core so that generally
people can use it. Okay, so that’s a whirlwind tour of XLA. That’s everything I wanted to say for
today. Talk to you next time.XLA
EP33 Expect-tests
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about expect
tests. Expect tests are a form of tests that have a characteristic property, which is that when the
test fails, you can automatically ask the test framework to update the test for you to accept
the new output. And so sometimes they’re called golden tests, because there’s some golden version
of the output, and you can refresh the golden version based on directly what the test tells
you. So imagine that you’ve got some program, and under some cases, it raises an error, you might
not want to hard code the error as a string to compare against because well, what if someone
edits the error message, then you’ll have to go and find all the places and update you know, the error
message that has been hard coded in those places. And often what people will do is they’ll write a
regular expression instead, in this case, to match for some singular important piece. But you know,
if you completely rewrite the error message, that’s no good either, because the regex will probably fail,
and then you’ve got to go and go to each of the sites and manually fix them all. So what expect tests
say is that, hey, when this happens, when your test fails, you just rerun using expect test except
equals one as an environment variable or something like that. And then the expect test will automatically
go through all of the sites that were wrong and modify them so that they have the new output in question.
And then for example, you can just run git diff to go look at the changes that were applied and see if
you like the changes or not. It makes it really easy to write tests that track internal implementation’s
details really closely while still making it not so painful to update the tests as those implementation
details change. Expect tests long predate PyTorch and this podcast. My personal story with expect tests is I
first ran into them while working on GHC, the Haskell compiler. GHC had expect tests and the way they were used
was to test the error messages that GHC gave because, you know, one of the things that Haskell is very
famous for is a very strong type system. And so we work very hard when you have a type error to actually
give you some useful information in this case. And, you know, there are tons and tons of test cases
testing what happens when things are mistyped. And, you know, when Simon Payton Joes goes and rewrites,
you know, exactly how unification works or, you know, some other major subsystem in the type system, chances are a lot of error
messages are going to wobble. And having accept tests meant that it was really easy to just go ahead and update all the error
messages and then just like eyeball them and see if they made sense or not.
I actually, during my PhD, um, ported, um, a version of this mechanism over to Cabal’s test suite where,
you know, Cabal also, you know, has a test suite where you want to run lots of different, um, Cabal
packages and see if they compile or don’t compile. And it’s kind of a pain to, you know, write exactly what
you expected it to happen and accept tests made this easier to do. I’m also reminded of a conversation that
I had with Ron Minsky at Jane Street, um, where he was describing to me some of the stuff they were
doing with except, uh, except tests in, um, Jane Street. And, um, their model was that, um, unlike
what GHC was doing, which was, we had these files that were the expected values. And so when we refresh
the tests, we just updated those files. What they were doing was they were actually storing the expected
strings inside the test files directly. There are a lot of good reasons to do this. Imagine you’re
writing a test case, right? So you like set up some functions, you do some operations, and then you want
to like check that the output, um, makes sense. It’s much easier for a code reader and a code reviewer
to understand what the test is doing overall. If the actual expected value of, you know, whatever
string comparison is directly in the test file, right? Because you just read what the test setup is,
and then you read what the output is. Now, this is a little challenging if you want to, you know,
do the main property of expect tests, which is that you can actually update the test output automatically.
So what you basically have to do is you have to write some code that knows how to update,
you know, your OCaml code, or in PyTorch’s case, your Python code, and rewrite the source code
so that you actually can put in the new value in question. But if you can get past your distaste for
doing this kind of thing, it’s really, really helpful and makes accept tests a lot easier to understand.
One of the, um, sort of complaints, uh, with, um, you know, expect tests that, you know,
are in an extra file is they’re just these random files. And when, um, you know, you have to update these test
files, um, it’s basically, you’ve got this big directory of all these expected things, and they’re all wobbling,
and you have no idea why they’re wobbling or not. Actually, a lot of Onyx tests, when we originally wrote them,
we’re done in this way, because it was easy to do it this way. And I don’t know, I think they’re not very
interpretable, and people decided they didn’t like it very much. But, you know, the answer is just put
the expect test directly in your source code. There’s another important function to doing it this way,
because, uh, you, when you are putting it directly in your source code, you don’t want your expect
test to be too long. And figuring out exactly what you are going to test against, what you’re actually
going to record in your test file, that’s pretty important, right? Because you don’t want it to be
too long, but you don’t want to be too short to not capture the important things you want to do. So
actually, when you are going in and writing an expect test, you need to think about what exactly it is
you want to output in this situation. And so when I’m bringing up a new subsystem, and I want to write
some accept tests, I’m usually going to actually spend some time designing a text format that describes
the internal state of my system in enough detail that I can actually test the important things, but, you know,
condensed enough so that someone reading over the code can understand it. And that’s one of the first things
that I do. And then, you know, I tweak it as I go ahead and, you know, see more test cases.
So as you can imagine, expect tests have been a little sort of pet project of mine regarding testing
for a while in PyTorch. I added a really simple version of them that just wrote things to files at the very
beginning. People sort of used it, they didn’t like it very much, because things are out another file.
And when I was writing GH stack, see my previous podcast about stacked diff development in PyTorch.
One of the things that I needed to do was write a test suite for GH stack. And this was kind of not so
easy, right? Because what is GH stack doing? GH stack is taking a bunch of commits and then pushing them to
GitHub. And like, well, for one, like, how do you even test in this situation, right? You don’t want
your test suite to be creating tons of repositories and issues on GitHub. By the way, we solve this by
like creating a crappy fake in memory implementation of GitHub’s GraphQL and REST, like API, so that we
could fake implementation so that the tests could actually run against them. But that’s a story for
another day. And then once you’ve done all that stuff, you need to actually like stand back and be
like, hey, what the heck happened? Did GH stack actually create the pull requests that were necessary
and push the commits that were necessary? And that was sort of the point where I was like, okay, I’m actually
going to sit down and spend some time writing a module that will help me do inline expect tests.
And then I’ll also sit down and write a representation for the state of Git repositories and pull requests
so that I can write these tests in a straightforward way. And so I did. And the way that the implementation
worked was that when a test failed, we would catch the exception that was raised by the error.
And this is this backtrace would contain a line number to the call to assert expected inline that
that’s the name of the method on the test class that expects tests give you a backtrace with the line
number of the thing in question. And then what we do is we go open up the Python file, go to that line
and search for the string in question. And so there’s a convention that I picked for our expect test
implementation, which is that we only ever substitute triple coded strings. In principle, we could
substitute single coded strings, but then it might be easy to end up in a situation where you have like
multiple strings on a line. And so then it’s like, which one do you replace triple coded strings don’t
really have that problem. You’re not very likely to have multiple triple coded strings in one line.
So we find the triple coded string, and we do the substitution on it. And then we write out the
Python file back to the end. And that’s basically the crux of how it all works. There’s a funny
implementation detail about pre Python 3.8, which is that prior to Python 3.8, the actual line number for
the backtrace is for the end of the statement in question. So if you have a multi line statement,
it gives you a pointer to the last line of the statement question. So you actually have to run
the records backwards, you you’re like, Okay, well, starting from the end, look for the string in
question, and then do the substitution. Details. But once you do that, all that, you have a implementation of
expect tests, you have, you do a simple good old fashioned string comparison to check if the value
equals the string in question. And if you don’t like it, or you want to update it, then you just do this
regex on the source file Python source file, and it updates it. And then you can go and take a look
in your favorite git diff viewer to see what the changes. And this is really, really easy, and makes it
super great for like, you know, writing tests without having to laboriously write down all the
things you expect. So I like expects tests a lot. They’re really powerful. And they let you write tests
in a lot less time, especially if you write a lot of tests that involves setting up some state,
running something and then looking at what the results are. A few words of advice for when you’re
setting up accept tests. So I’ve already talked about some of the common problems, right? So one is that
you don’t want the representation to be too long, because then it’s going to be like, oh my god,
like, what is all this stuff. So you want the representation to be actually legible by humans.
And that means you have to spend some time designing it. Some other more basic things that you need to be
careful about when you’re doing expect tests is one, you need to make sure the outputs actually
deterministic, right? Like if you’re putting a timestamp in the output, that’s bad because well,
it’s going to change every time and your expect test is just not going to work. If you’re like writing
the output format from scratch, this is not a big deal, right? You just don’t put timestamps in.
But sometimes there’s non-determinism in your algorithm. Like for example, automatic
differentiation in Python and PyTorch runs in a multi-threaded fashion. So it’s not guaranteed
what order your backward nodes will run. And so if your trace involves recording logs when things get
run, well, just be aware that that is non-deterministic and you might have to do some canonicalization,
for example, to make sure the expect test all works out.
Sometimes you can just sort of mask out the text that you don’t like. So it’s like,
hey, I know this thing is non-deterministic. So before I do the string comparison,
I’m going to go ahead and replace it with some placeholder token. And that token, you know,
will always be the same no matter what I’m doing. By the way, it’s a pretty good idea to make sure
your code is deterministic. It pays off in a lot of other ways. And so ease of use with expect tests
is just yet another payoff. Okay, so nuts and bolts of using expect tests in PyTorch. The default test
case that PyTorch provide already contains expect test functionality. So all you need to do is call the
relevant function. And the most common one you’ll use is self dot assert expected inline. Assert expected
means it’s going to be an expect test. And inline means that you’re going to put the string directly
inline inside of your source code program. There’s also variants that work for if an exception is raised,
what you expect the expect exception text to be, just check the expect test module in PyTorch to see
what API options are available to you. The module that implements expect test itself is actually pretty
self contained. And I copy pasted it between gh stack and PyTorch because I didn’t feel like making a
separate package to do this. But if this is code you are interested in, shoot me a tweet,
and I’ll figure out what I can do about actually publishing it for real.
That’s everything I wanted to say for today. Talk to you next time.Expect-tests
EP34 vmap
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about vMap.
vMap is a feature that was popularized by Google Jacks, which lets you write code without thinking
about batching, and then automatically make your code batched. So let’s imagine that you want to,
you know, add some tensors together, and then do a matrix multiply on them, maybe run a convolution
on them. What vMap says is you can write it as if you were writing this computation on a single
batch, really no batch dimension at all. And then you can vMap over it to so-called vectorize it,
that’s what the v in vMap stands for, so that all these operations transform into their vectorized
operations. And in many cases, many ops in both PyTorch and NumPy and Jacks are automatically
batched in the sense that if you attack on extra dimensions to the beginning of the tensor, the
operations semantics will say, okay, I’ll just treat those as batch dimensions and process them.
There are a lot of operations that don’t do that. For example, operations that only take a single batch
dimension or operations that change their meaning when you add more or less dimensions. Matrix multiply
is a particularly bad offender in this front. And vMap makes it so that you don’t have to like worry about,
oh, yeah, if I want to add a batch dimension, I can’t use matmul anymore, I have to use bmm instead. And all it says is no,
just write the single example version, and I will automatically translate it into the batch version as
necessary. So how do you implement vMap? There are a number of different ways, but I’m going to talk about the particular
implementation that PyTorch’s vMap implementation uses, because it’s the one I know best, and it’s most relevant if
you want to develop PyTorch. So in PyTorch, when I want to vMap over a tensor, what I do is I introduce a new concept
called a batched tensor. A batch tensor can be thought of as a regular tensor, but with some of its dimensions marked as
being so called batched dimensions, which don’t participate in normal computation in the way that you normally imagine.
So let’s imagine that I’m talking about a square matrix, right, you know, A by B, and I want to batch it, so I have a batch
dimension on it. So ordinarily in PyTorch, if I asked what is the dimension of this, you know, batch by A by B tensor,
I would tell you three. But with a batched tensor, the batch dimension is considered a private implementation
detail, and so I don’t get to see it. So if I ask what the dimension of a batch tensor with one of these batch
dimensions in it is, I actually only get two, because logically, when I’m looking at this tensor, I want to
just be able to do single operations on it. And so when I say, hey, what’s the size of it, I should see only a single
instance in question. But under the hood, what the batching tensor is doing is it’s translating your
operations on this domain, this single example domain into the multiple example domain. And that’s why, you
know, we do need to have a tensor that stores all of the various batches in question, you just don’t get to see
it as a user. This distinction between logical and physical dimensions is very helpful, because it helps you sort of
keep straight what is going on in the logical universe, namely what you see as a user, and what is going on in the
physical universe, aka what operations are actually happening. So give another example, when you want to do a sum, a
reduction sum, you can say what dimension you want to do the reduction on, right? So let’s imagine once again, you’ve got a
two dimensional tensor, and you want to do a reduction on the first dimension, so dim equals zero. So if you have a
tensor that’s, you know, a by b, you just say, okay, sum, open paren dim equals zero, and that’ll do the reduction in the first
dimension. But what if this tensor gets batched? Well, if this tensor gets batched, then it’s not correct to write dim equals zero to reduce the
0th logical dimension, because what that’ll do instead is reduce over the batch dimension. And those of you who have seen some of the marketing copy for name tensors may recognize this as like a similar problem that name tensors were trying to solve, right? So name tensors answer to the solution is, okay, don’t say that you want to do a reduction over dimension zero, say that you want to do a reduction over the height dimension, let’s say.
What vmap says instead is, no, no, no, no, you can still use numeric designations, we just won’t actually ever, you know, make the batch dimensions visible to you. So you can say, oh, I want to reduce over dimension zero. And if you have a a by b tensor, that’ll be a, you have a batch by a by b tensor, that’ll still be a, maybe a batch one by batch two by a by b tensor,
it’ll still be a. And the vmap process will adjust the index, so that from the logical idea of zero to three or four or whatever it needs to be, depending on where you’ve inserted the batch dimensions, when you’re doing the actual interpretation on the inside.
And really, that’s all there is to it to the vmap implementation in PyTorch. So we have a vmap dispatch key, you don’t know what dispatch keys are, go listen to one of my earlier podcasts about the dispatcher.
We have a vmap dispatch key, which interposes in on vmap when you want to do an operation. And when you have one of these batch tensors, which get created when you use the vmap operation, right? So when you vmap over a tensor, on the inside of the vmap, we give you a batched tensor, which will do the batching for you.
And when you, when you do, you hit the vmap dispatch key, it does the translation from the logical into the physical thing in question.
And then, you know, it redispatches and the physical operations just get handled in the same ordinary way you used to see them handled.
Another way I like to think about this problem is that I’m doing a sort of functional transformation on my API calls.
And this is, this is very much the Jack’s interpretation for vmap, which is that I’ve got my program, it has all of these calls to add, mull, matmull, whatever.
And what the vmap call does is it transforms this into a corresponding vectorized program, vadd, vmull, vmm, assuming those were actually operations, which they typically aren’t.
But like, if you had a vectorized version of add, um, and a vectorized version of matmull, um, you just translate to those versions, but otherwise your program stays very similar.
So, like, like, in a sort of very mathematical sense, you’re in this sort of world of single example functions.
And there’s this extra world of, um, multi-batched functions.
And there’s a mapping of every function in the sort of single example world into the batched world.
And so, as long as you, like, say how to do this translation, then, uh, you can just take your program of single example calls and then project it into this other world.
And, um, like, if you were a Haskell-er, you’d call this a type of functor.
It’s not a functor on hask, per se, but it’s a functor on, um, valid tensors.
If that didn’t mean anything to you, don’t worry.
But, like, the picture that I want you to have in your head, right, is you’re taking all these function calls and you’re replacing them with vectorized function calls.
And you might do this multiple times if you, for example, vmap multiple times.
This looks pretty different from the physical implementation, right?
Because the physical implementation, um, keeps track of what batch dimensions are on tensors.
And, uh, what it does is it actually, you know, it’s a little more efficient.
It, like, collapses all levels of vmaps into a single batched tensor.
But there’s another implementation you could have done for vmap, which is you have a single batch tensor, which handles a single batch dimension.
And you just repeatedly wrap each time.
And so if you, you know, did a vmap of a vmap of a vmap, you would end up with batch tensor containing a batch tensor, containing a batch tensor, which contains an actual tensor.
And so in this way, you can think of this sort of like as you’ve got this chain of, uh, control where the first call hits the top level batch tensor, which does a transformation and then transforms that operation into a vectorized operation, which then passes into the second batch tensor.
And then when you have, well, the second batch tensor is asked, hey, I’ve got this vectorized operation, can you vectorize it again for me?
And you end up with a vectorized, vectorized operation and so forth and so forth until you bottom out and there’s no more batching to be done.
By the way, this is what, um, when Jack says that its functional transformations are composable, this is what is meant, which is that when you apply the transformation, um, to the operation, you get back a thing that you can apply the transformation again to.
So it’s like, it’s like, it’s an endo functor, in other words, and it’s really profitable to, um, realize that even if the implementation involves these like batch tensors and, um, you know, they’re doing all this bookkeeping and they’re intercepting operator calls, it’s really helpful to think about the actual semantics as just morally replacing these operations.
So whenever, like, I’m in a situation where I’m like, I’m not sure what vmap is supposed to do in this case, instead of like trying to run a batch tensor, like object in my head, instead, I just think about, oh, well, you know, like, what would I like modify these API calls to look like when I did it this way.
And that usually tells me what I wanted the behavior to be.
So to give an example of this, um, a classic problem when you’re doing v mapping is how to handle random number generation.
So let me explain what the problem is.
So let’s say that you’re doing a v map.
And at some point during the v map, you make a call out to a random number generator.
So you like say torch random, give me a buffer filled of random numbers, and then maybe say add it to one of these batch tensors.
And so there’s a problem, which is what do, what is the semantics of this for each batch in the batch tensor, do I separately generate random numbers, and then, uh, you know, perturb them all differently.
So this is like sampling noise, and then you’d want the noise to be different across batch dimensions.
Or am I sampling the noise once, and then applying the same noise to every batch in question, sort of shifting everything exactly the same way.
And so there is probably something that the naive implementation of your code would do, um, that is to say replicate the random numbers, uh, in each case.
But that’s not a good way to think about what you actually want the semantics in the situation to be, right?
So if we think a little bit further, and we say, okay, well, you know, what kinds of transformations to the API calls do I want to have happen in this situation?
Um, we quickly see that the replicate the noise the same way everywhere corresponds to when I don’t modify the random number generation call.
So I just do a plain old stock random number generation call.
I modify the ad into a vectorized ad.
And what that is going to do is broadcast the random number generator, which we call wasn’t modified at all.
So it’s going to be made at the logical size, not the physical size.
And that broadcasting is what causes the random number generation to be reused for every batch.
Whereas the case where I, um, do a new random number generation for every single batch corresponds to transforming the random call into a call that, um, has a batch dimension.
And then I don’t have to do broadcasting when I add things together later.
And so there’s two reasons why this is a really useful way of thinking about it.
So one is that it gives you a way of thinking about how you might actually implement this.
And the way you can implement this is by doing a mode key.
So normally the problem is, is that, uh, dispatch in PyTorch is based on the types of tensors.
And so Randon has a hard time dispatching to batching, uh, batch tensor VMAP because it doesn’t take any tensors as input.
So it doesn’t know, oh, what the VMAP should be.
And we have a way of working around this, which is a so-called mode, um, which is, hey, when you turn on this mode,
like AMP, automatic mixed precision, see previous podcast, all operations are affected by this, even if, you know, there’s no input, uh, dependence at all.
In JAX, this is called omni-staging, if, if you are curious.
So if you make, uh, VMAP a mode, then you can interpose in Randon and then like look at what the state of your, um, you know, VMapping is.
And then, you know, uh, generate the Randon appropriately.
And this is pretty nice because it turns into sort of the common way to fix this ambiguity, which is,
if you wanted the random number generation to be generated once per all the batches, make sure you generate it before you actually call the VMAP.
So make sure you call it outside of the VMAP.
And if you call it inside the VMAP, we’re just going to assume that you wanted the random number generation generated anew every time because, well, you’re doing it inside the example in question.
And that, that maps very nicely to the mode cell implementation.
JAX solves this a little differently.
They force you to pass an explicit random number generator object to disambiguate these cases, which does disambiguate the cases and is more expressive.
But if you’re like a very mutable person, um, uh, moving things before and after function calls sort of makes sense as a way to control when effects should happen.
It’s like flipping a coin, right?
Like if you want to flip a coin once or you want to flip a coin many times inside of a loop, well, you would either, you know, flip it once outside the loop or you would move it inside the loop to flip the coin many times.
So, you know, the analogy of VMAP as a loop also works here, even though there’s side effects involved.
So what are some things that are wrong with the current implementation of VMAP in PyTorch?
So there is one big problem, which is that, um, it is not fully composable.
So VMAP is set up in a way that it is composable with itself.
So we can VMAP as many times as you want and BatchTensor knows how to handle this.
And it composes with Autograd in one specific way.
Namely, if you wanted to VMAP your code and then run Autograd on it, that’s okay.
And that’s supported by PyTorch.
Um, and this is because dispatch keys have a fixed order, so you can’t reorder them.
Now, the problem is sometimes you want to run Autograd first and then VMAP over the Autograd.
And this is very useful for doing this thing called per sample gradients, which I’m not going to explain in this podcast, but you can look it up if you’re interested in it.
But composing them in this different way.
And no, it’s not the same thing.
These, these operations are not commutative.
So, like, whether or not you do VMAP first then grad or grad then VMAP has implications on the performance of your code.
So to solve this, Richard Zhou, the original author of VMAP in PyTorch, and Horace have been working on a new version called Functorch,
where instead of being forced to have a fixed order that, um, transformations like this are applied in PyTorch, um, bash then VMAP, um, instead you just have a stack of transformations like Jax.
Functorch is unabashedly, um, taking a lot of inspiration from Jax and let you compose them in whatever order you like.
And that’s pretty cool.
And, um, you know, Jax has a lot of good ideas there.
There is a good thing about our implementation though, right?
Which is that because we compress all VMAP layers into a single representation, um, we have to go less loops through, like, the translation.
Because we can just do the translation all in one go.
It makes our batching rules a little more complicated, but, um, it reduces the sort of fixed overheads in question.
And so for PyTorch, we do care about this because we’re in an eager mode framework.
We don’t usually ask people to use a JIT combinator to, like, get rid of all these fixed overheads.
So there’s still a utility to this, but sometimes you do want, like, wild flexibility and then being able to compose things in whatever order you want, uh, however you like, um, is a useful capability.
So I hope I’ve explained a little bit about how VMAP is implemented and some of the various ways that I think about VMAP and also other sort of sorts of functional transformations in PyTorch.
By the way, there’s an old podcast about functionalization.
You can also think of that as a functional transform in the same sense as VMAP.
That’s everything I wanted to say for today.
Talk to you next time.vmap
EP35 Random-number-generators
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about random number generation in PyTorch.
Random numbers are a very important component of deep learning.
You use them when you initialize your weights.
You use them when you use layers like dropout, which will randomly zero out connections.
And in general, the concept of stochastic gradient descent is predicated on this idea
that you’re going to sort of randomly process batches in your input data set,
and, you know, this randomization, well, how do you do it?
You use the random number generators in PyTorch.
So there’s some basic facts about random number generators in any sort of numeric library
that PyTorch chooses to, and the most important concept is that although you normally in idiomatic
usage just say torch.randan, and then you just get a vector full of normally distributed random
numbers, which actually is happening under the hood, is that there is a random number
generator, an explicit generator object, and you’re just using the implicit sort of global
generator in a situation.
But really, you can create these objects explicitly and use them to sort of separate the random
number generation in question.
And so when you want to know a bit more about how the random number generator in PyTorch is
implemented, you want to look at these generator objects, which are implemented differently
depending on if you’re generating numbers on CPU or CUDA. And these contain the important state and the
important functions for interacting with the state for the various algorithms that you’re going to use
to generate random numbers.
The most important devices on PyTorch are CPU and CUDA, and we use different algorithms for
them. Sorry, so that means that if you train your model or test your model on CPU and then move it to
CUDA, you’re going to get different random numbers. We’ve talked idly about maybe implementing CUDA’s
algorithm on CPU, but no one’s done it so far. So on CPU, we just use a good old-fashioned
Merseni Twister RNG. That’s a pretty high-quality pseudo random number generator. It isn’t
cryptographically secure, but it’s fast to run. A lot of people use it, and it has pretty good
statistical properties. On CUDA, we use a different RNG called Philox. So Philox is used in CUDA because
it has a really interesting property. Its internal state can be entirely represented as a seed and then
offset into the random number stream that was generated by that seed. Why is this an interesting
property? Well, a Merseni Twister traditionally involves some sort of random number generator state,
and then every time you sample random numbers out of it, this state changes so that you move some of
the random bits around, and then you do the same thing over and over again. And so the state is bigger
than the seed, which typically is just a 64-bit integer, which means that it’s easier to have
a higher periodicity. That is to say when the random number generator starts looping over itself in
that situation. So Philox doesn’t need to have some state that you’re going to put random numbers in.
Instead, it will just calculate the state right off the bat when you start your CUDA kernel based off the
seed and the offset. And this is important because it means that we don’t have to persistently
keep a CUDA tensor around representing the rnj state of a Philox tensor. Instead, because the seed and the
offset are totally are very small, they’re just, you know, a single 64-bit integers, we can send them
every time we do a CUDA launch directly using the, you know, scratch space that CUDA
kernels allow for sending small amounts of data directly to the kernels without having to do a device, a
hosted device copy. So what happens when you use a Philox RNG? Well, we first query the generator object
representing a CUDA, CUDA RNG, we get out the seed and the offset, the office tells us how far along we’ve gone in the
random number state, we send these via our CUDA kernel launch to CUDA, you use CURAND init to initialize a local
scratch space. So okay, I lied, there is a scratch space, but you just reinitialize it from scratch. And this is okay,
because what’s going to happen right after that is CUDA is going to like use it over and over again,
because you’re going to do something like fill a entire random number, entire buffer full of random numbers so you can
amortize the cost of this state initialization. And then back on the host side, the host is supposed to
statically know how many random numbers your algorithm is going to use. And this is usually not too hard to figure out.
Like for example, if you’re, you know, filling a random vector full of random numbers, the amount of random numbers you’re going to use is
exactly the length of that vector times, you know, however many random numbers it takes to generate a
single element. So you increment the offset by however many random numbers you would have used.
And so the next time you launch a kernel, you’ll start at the next part of the random number generation
stream. And you don’t have to worry about, you know, reusing old numbers in the old case.
Speaker 1: There’s also some fancy stuff for handling CUDA graphs. This is a bit of a digression,
but I just want to put it out there, which is that CUDA graphs, which are a way of recording a bunch of CUDA
kernel launches, and then launching them directly without having to pay for kernel launch costs or any
of the, you know, sort of code that PyTorch has to run to actually get to the CUDA kernel launch.
Speaker 1: Those hard code the parameters that you launch kernels with. And so what that means is that
the seed and the offset are traditionally hard coded into the kernel launches. And so if you want to then
rerun these kernels later via CUDA graph, you would replay exactly the same random number generators. So there’s a
little trick that we do, which is we, when you’re doing CUDA graphs, there’s an extra bit of CUDA
memory that we do to add an extra offset that you can use to basically program your, you know, CUDA graph
fixed seed and offset otherwise to go to some other offset, because you want to run your code again, but
with different random number generators the next time. Okay, digression over. At some point, I’m going
to do a podcast about CUDA graph support in PyTorch, but this is not that podcast.
Speaker 1: So I’m, so we have generators, we have a CPU generator, we have a CUDA generator,
these generators use the, you know, impl idiom that tensor and storage also use. And you may notice
that CPU state and CUDA state are pretty different. So in fact, there’s two different generator classes,
and you know, they, they inherit from a common interface, but this interface doesn’t actually have
a virtual method for getting random numbers. And if you think about it, this makes sense because,
well, you know, like what good is a virtual method that like directs you between CPU or CUDA
when like on CUDA, you can’t even call virtual methods. Like that’s just not a thing you want to
do in CUDA. So like, although like standard object oriented design would say, oh yeah, you know,
you want some method that can get you a different random number, depending on what generator you’re
using. In reality, what you need to do is you need to refine the type, you need to figure out which
kind of generator you have at the very beginning of your kernel. So you cast a generator into a CPU
generator, recruit a generator, and then just directly access the fields based on what you need. And so
that’s how most of our kernels are written, right? So you hit the kernel, you have this type erase generator,
you figure out what generator it is, now you have a, you know, more specific CPU generator,
and then you use the fields directly. One random side note, our random number generators
do have locks on them. And we never really agreed whether or not PyTorch’s generators are thread safe
or not. Historically, we did protect them with a mutex. This is like back to the TH days. So
they’ve kept the mutex as time has gone along. One common anti pattern, which you should be careful
about, is the mutex is just protecting the RNG state. So if you’re like doing something like Philox,
you don’t actually need to hold on to the RNG lock for the entirety of your CUDA kernel launch.
You just need to take out the lock and then update the offset and then you don’t need the lock anymore.
So, you know, try not to like lock the entire things, right? The lock is just for accessing
the internal state. But at some point, we should probably figure out how to get rid of the locks,
because they’re not really adding much. You probably should deal with locking concurrent access to a
generator yourself if you’re sharing a generator across multiple threads. In Python, this is hard to
do because, you know, there’s a global interpreter lock, so you’re usually not running on multiple
threads anyway. And that’s most of the important stuff about the generator state in PyTorch.
Right. There’s these generator classes. They contain the state necessary for generating random numbers.
And then various kernels use that state to actually, you know, run the algorithms and output, you know,
random floats or random doubles or whatever it is that you need to do. There’s some interesting stuff
also on the front end, which is how to generate random numbers given a, you know, like sort of uniform
a set of random bits. Right. Like, for example, if you want to generate a random double, you can’t just
take a, you know, a random integer and then cast it into a floating point bit pattern directly, because
that would just be totally non-uniform. Right. Because like most of doubles bits space is taken up
encoding NANDs. So you’d get NANDs most of the time. So there’s like a bunch of algorithms for doing
this sort of thing. And I’m not really going to really tell you about all of them. You can like read through the
the source code and like check them out there yourself. They’re, they’re actually pretty short
and they have cool names. And like, you can read the Wikipedia articles about how these things go.
There is one thing that is kind of interesting that I do want to point out. And that’s when we want to
generate normally distributed values. So like your good old fashioned torch dot rand n, um, we use this
thing called the box muller transform. And the way the box muller transform works is that, um, you, uh,
sample two uniform doubles between zero and one, and then you sort of look at what the, um, the sort of,
uh, angle and the, um, length of the vector pointed by these things are, and you can use that to get
out the, um, you can use this to get out the, uh, normal, normally distributed samples. But the thing is
that to do one of these box muller samples, you have to first sample two doubles and you get out
two new doubles. And that’s a little awkward if you, you know, only wanted one normally distributed
double. So the way that this, um, works is actually most of our RNGs have an extra little bit of state,
which is a cached normally distributed value. And so, um, if you be, because it’s like, okay,
well, I got these two random numbers, but I only needed one of them. The next time I ask for it,
I’ll give you that instead of having to like sample two doubles to produce only one, that would be bad.
And, you know, you want to reduce the amount of RNG you chew through in this case. That’s like,
that’s why there’s, you know, these, um, next normal, uh, fields on the generator state. It’s for
dealing with normal numbers and normally distributed numbers and, you know, normal distribution is really
important. So like, it’s worth special casing, this kind of situation. Another thing that is
kind of interesting about, um, you know, like transforming these random numbers is that, um,
the boundary conditions can be pretty nutty. Like, you know, people actually care, uh, when you’re
sampling a floating point number, if you’re zero inclusive or exclusive, and if you’re one
inclusive or exclusive, and this is like, because like dividing by zero is pretty bad.
And like, yes, maybe this only happens, you know, uh, one in every two to the 32 times, but like,
yeah, that’s bad. And we’ve had a bunch of like very nasty bugs where like, if you like run the thing,
like 40 million iterations, like once upon a time, it gives you an impossible value. And
you know, over time we’ve fixed a bunch of these. So that’s another thing that like,
you have to be careful about when you’re working on random numbers.
Okay. So that’s most of everything I wanted to say about random numbers. There’s one last thing
I wanted to say, which is sometimes you want to build, uh, what if you want to like take your own
RNG and then sort of re-implement all of the functionality in PyTorch on top of it? Like,
you know, basically plug in your new, uh, like cryptographically secure RNG instead of
mercenny twister and then like get out normal numbers and, you know, exponential distributions
and all that stuff. Well, this is something that Pavel Belovich needed to do for, um, CSPNRG,
which was specifically for cryptographically secure random numbers for, um, some of the crypto projects
that are going on on top of ATEN. And, um, they, so this is kind of tricky, right? Because as I said
earlier, there’s no virtual interface for getting numbers. If there was a virtual interface for getting
numbers and the performance was acceptable, you could just, you know, virtualize the generator
object and then swap out your own generator object whenever you wanted like a CSPRNG, or just, you
know, want to do something besides mercenny twister, but we can’t do that because that’s too slow.
We need direct access to the generator state when we’re doing one of these vectorized things,
because we’re doing it in a fast loop and, you know, we need everything to inline in that situation.
So what’s actually happened is all of our transforms, our random number transforms,
our templates. And, um, so once you define your custom, uh, RNG, you, um, instantiate all the
templates for your RNG. And then that ensures that everything gets in line and you get a fast
implementation in the situation. And so all you need to do is just make sure your generator has a,
um, you know, uh, distinct dispatch key. And so we’ll make sure you will dispatch to your particular,
um, you know, random number algorithms instead of anything else. That’s pretty nifty use case of
the dispatcher. Um, Sebastian and I used to argue a lot about whether or not generators should have
dispatch keys or not, but like, this is pretty nice. So I like it personally, at least. Okay.
That’s everything I wanted to tell you about RNGs. Talk to you next time.Random-number-generators
EP36 TensorAccessor
Hello everyone and welcome to the PyTouchDev podcast. Today I want to talk about tensor
accessor, a way of accessing elements for tensors when the dimensionality and dtype are known.
In previous podcasts I’ve talked a little bit about the API design principles behind our C++ API
and one of the characteristics of tensor in C++ is that it is completely type erased. You get to
know you have a tensor but you don’t know what its dtype is and you don’t know what its dimensionality
is. Doing things this way makes polymorphism easy because you don’t have to write templated code
but this type erasure has costs, namely performance costs and it’s for this reason that like other
C++ libraries that do tensor computations often do in fact encode this information directly in. So for
example Igen, a very well-known and popular library known for its fast implementations of kernels,
uses fixed dimensions inside the tensor itself. So what’s the problem? So the problem is when you
don’t know what the dimensionality of a tensor is and what you don’t know what the dtype of a tensor is,
in order to do operations on this tensor safely, you have to do dynamic checks.
So if you want to, you know, retrieve an actual element like an honest goodness single element
from the tensor in question, you are going to have to say you’re going to want to fetch it into some dtype
like float or double. And technically speaking, unless you provide an unsafe API, you need to test that the
dtype of the tensor actually matches what you want to read the element out of. Otherwise, you can read out
a complete garbage silently in this situation. And so if you think about like the data pointer API in tensor that
actually does in fact do a dtype check whenever you do this. Similarly, when you want to index into a
tensor, well, if you don’t know what its dimensionality is, then you have to actually write code that knows how to
loop over all the indices you want to do and multiply with the strides in question. And so, you know, because
you don’t know how many dimensions there may be, right? So you can’t write a fixed index calculation in this
situation, you have to write a loop that can handle all the sizes in question. And so if you’re a tensor
iterator, and you know, you’re doing a lot of hard work to make sure you can write an algorithm and work with
arbitrary dimensionality, that’s cool. And tensor iterator is kind of complicated, but it does that all for you.
But if you’re just writing a good old fashioned kernel, you probably don’t actually need this
generality, you probably only are writing kernel that only works for some set of dimensions, etc.
So if you want to do lots of low level manipulations to data in your tensor, and you don’t want to go
through all the overhead that tensor would be and yes, you could write a loop over a tensor and then say
directly x open square brackets, index close square brackets equals blah. But trust me, you really don’t
want to write your kernel that way. It’s really, really slow, because each of these indexing operations
is actually going to give you another tensor back, even if it’s actually a scalar, it’s a single number,
you’re going to do an entire dynamic allocation. And that’s the case. So if you want to do this sort of
thing fast, what do you do? And so the sort of like very easy way to handle stuff in a situation
is to get out a raw pointer and do the manipulations on it. It’s the obvious thing to do, right? Because
you know, what are CRAs? Well, CRAs are, okay, they’re not exactly the same thing, because the like,
type size is different. But like, a CRA is basically a pointer to some memory. And then you know,
you just operate on the memory. So what do you do? So if you have a tensor object, you can call data
pointer to get out a raw pointer, that is going to give you a fixed d type. So it’s going to check
what the d type is. And then you can just poke it, you know, index into it the same old way you’d have
indexed into any sort of array, and, you know, work with the data in the tensor that way. There are a few
implicit assumptions that are going on when you do things this way. So one is that you are probably
assuming that the data in question is contiguous. Why are you probably assuming that the data in question
is contiguous? It’s because handling strides is actually a pain in the ass. And so you probably
aren’t going to go through all the rigmarole of doing strides exactly correctly, with the pointer
in question. If you do it this way, you’re probably more likely to just, you know, directly compute some
linear index, or you know, you have a one dimensional tensor, and you just can index directly, and you’re not
going to handle that. So whenever I see kernels that are written directly using raw data pointers,
I usually assume that they are assuming contiguous inputs. The only exception is if I’m like FFIing out
to some external library, where they have to take a data pointer, and then they take them into strides
as the other things in question. So raw pointers, very easy, but typically only used for contiguous tensors.
But what if you want to do some accesses, and you happen to know that you want to handle strided
things directly, you don’t want to actually go through the process of taking a possibly non contiguous
tensor, you know, allocating memory to contiguify it, and then run your kernel on it. Contiguifying a
tensor, by the way, you know, is kind of slow, and it uses up memory. So if you can just directly fuse
your computations directly on the input tensor, that can save you quite a bit of computation. And this is
what tensor accessor is for. So what is tensor accessor? Tensor accessor is a specialization of
tensor, where the d type and the dimension of your tensor are fixed. However, we don’t make any claims
about the sizes or stripes. So the sizes and stripes continue to be, you know, sort of built into the
class in question. And so if you look at what the representation of tensor accessor is, it’s very simple,
it consists of a data pointer, it consists of a deep, the sizes, and it consists of a pointer to
the strides. In fact, tensor accessors are really lightweight, and they don’t involve any dynamic
allocations, because they’re also non owning, unlike regular tensors, which, you know, guarantee that the
data pointed to stays live, the lifetime of the tensor in question, they’re non owning, so they’re really
cheap to allocate. And lastly, right, as I said, they have statically known d type and dimension,
the statically known dimension is important, because it means that we can implement index calculation
without doing any loops, right. So like how it’s implemented in PyTorch is, it’s actually a recursive
template, where, you know, like the tensor accessor for n is computed by doing the tensor accessor for n
minus one, and then, you know, adding on the indexing for the last dimension that we’re processing. And
then there’s a base case for tensor accessor, 1d tensor, where you can just linearly index in that
situation. By the way, this is a nice thing about being in C++, in the battle days of th, these fast
indexing operations were manually specified for every dimensionality. So there’s like a 1d fast index,
a 2d fast index, 3d fast index, 4d, and so forth. Tensor accessor also optionally supports declaring
the pointer as restrict. What that means is a pointer that’s restrict is guaranteed not to alias with any
other pointers that are in scope. And sometimes that can unlock easier compiler optimizations. We use this
very rarely, but it’s often useful in CUDA, where non aliasing is a useful guarantee. There’s also a
variation of tensor accessor called packed tensor accessor. So I said tensor accessor is non owning. So
it, you know, contains a pointer to the sizes, which are actually stored in the good old fashioned
traditional tensor in question, and a pointer to the strides, which are also stored in the old tensor
in question. But sometimes we want to send these like, you know, raw pointers plus metadata to CUDA
kernels. And with CUDA kernels, you have to send all this information. If you have this pointer to some random CPU
memory, well, of course, your CUDA kernel is not going to be able to access it because CUDA kernels can only
access CUDA memory.
So you have to pack everything up into the parameter list that and you know, is going to be sent along with the CUDA
kernel launch, and packed tensor accessor basically just packs all of the sizes and strides along with the data
pointer directly into a, you know, compact representation. Remember, it’s fixed dimension, so we can allocate
precisely the amount of fields we need to actually do this sort of thing. And then, you know, you can ship them all to CUDA
all at once so that CUDA can then use these to compute the indexing. And for CUDA, like computing indexing is pretty cheap because,
well, you know, it’s CUDA, and you’ve got tons and tons of little processors that are doing these computations in parallel.
You’re more likely to get hosed by, you know, memory bandwidth, because you know, you’re accessing stuff all over the place.
So let’s just step back a moment. So suppose you’re writing a kernel in PyTorch, and you need to actually do some
manipulation on the data in question. Well, um, there are a few things you can do, right? One is you can like directly
use the tensor API. And that’s okay, if you’re going to just call a bunch of other like sort of accelerated
operations, but it’s a bad idea if you actually want to do like element by element operations. Then there’s raw pointers,
which are sort of the easy and obvious way to do things, but they don’t, uh, do any of the bookkeeping
for strides for you. So usually people only do them when they assume contiguous inputs. So you’ll see,
you know, um, run contiguous on the input and then get out a raw pointer and do something with it.
And finally, tensor accessor knows about sizes, knows about strides, and so can let you do fixed
dimensionality indexing on tensors that might have, you know, wacky layout without having to do the,
you know, sort of indexing math all by yourself. It’s handled for you automatically under the hood.
One, uh, current limitation of tensor accessor is that we don’t define any operators on them. So once
you go from a tensor to a tensor accessor, uh, you can’t like a view the tensor and you can’t,
for example, reshape it. Actually, we had an old version of packed tensor accessory called THC device
tensor. That was part of the THC library. And this, uh, tensor did have a bunch of operations on it. And
there’s no reason you can’t implement these operations in particular, anything that’s a
view really good match for tensor accessor, right? Because tensor accessors are non owning anyway.
So you’re usually just fiddling around with the size and strides. So this would be a really nice
feature to add to PyTorch. No one has really done it yet, but it would be useful. Another thing that
I’ve been thinking about is, um, sometimes we get to know that a tensor is some dimensionality,
um, fairly early in the stage of a sort of multi, uh, operator composite function. And it would be
nice to not have to keep, you know, doing the dimensionally check, uh, locally at the kernel
site whenever you need to use it. Like it would be nice to like do it once and for all at the beginning
of a composite kernel and then pass on this information statically to the kernel you’re going to call
later on. Of course there, this is rife with difficulties, right? Like if you want to be
polymorphic over the D type in this way, your kernels have to be templated, but it’s a kind of
interesting problem about, you know, like how can you write code that doesn’t need to be template
instantiated, but can still propagate, um, type information like this. And so maybe, you know,
having some sort of, uh, like fixed dimensionality, but the D type isn’t fixed tensor type might let
you do that, but I don’t know. That’s something that I’ve been thinking about. That’s everything
I wanted to talk about today about tensor accessor. Talk to you next time.TensorAccessor
EP37 Anatomy-of-a-domain-library
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about the anatomy
of the domain libraries that we also work on here at PyTorch, namely TorchVision, TorchAudio,
and TorchText. I’m not going to talk about the libraries in particular, any one library in
particular, but TorchVision is definitely the most well-developed and most featureful domain
libraries. So many of the things that I’m going to say are based off of things that I know about
TorchVision. All right, so here’s a question. Why do we have domain libraries? Like why isn’t PyTorch
just one giant repository that contains, you know, how to do distributed computing and how to do
operators and how to do profiling and tons and tons of stuff? And then, you know, what’s, you know,
throwing in a little bit of, you know, image processing operators or, you know, text processing
models? Like, you know, multi-head attention is in PyTorch Core. Why can’t everything else be?
And so there’s a few reasons why the domain libraries exist as separate libraries from PyTorch
Core. So one is that in particular domains like image and text and audio, they’re often very domain
specific gadgets. These gadgets don’t really make sense in any other context. Like, for example,
in Vision, you need to have a JPEG decoder because you are commonly working with images that are in JPEG
and you need them to be actually pixels so that you can start doing, you know, deep learning on them. And
it would be pretty strange for PyTorch to come with a JPEG decoder and a, you know, wave decoder and,
you know, every, you know, file format known under the sun. So the domain libraries exist because
there’s a lot of extra stuff that you need to actually do work in one of these domains. But we
just don’t want to keep shoveling everything into the main PyTorch library because that makes your wheels
bigger. And it just, it can get pretty out of control, especially because there are lots and
lots of things you might want to do. So domain libraries give us an easy escape valve where we can say,
oh yeah, you know, this stuff is great. We want to support it, but it just doesn’t go in the main
PyTorch library. It’s going to go in one of these extra libraries. And yes, sometimes there are operators
that like, you know, might be useful in multiple domains, but usually it’s pretty obvious where they
should go. The point about JPEG decoding is also another good point because another thing that you
often need when you’re working in a domain is there are a bunch of other libraries that you might
actually need, like FFmpeg or LibAV or Pillow, et cetera, et cetera. And once again, it would be
pretty suboptimal if, you know, when you installed PyTorch, it also got all of these dependencies along.
So another good metric for, you know, should I make a domain library or, you know, should I not
is, you know, are there any dependencies you need? If there are no dependencies, well, PyTorch might be a
good place to put it because, you know, PyTorch tries to keep a very slim dependency set, only the
bare minimum that you actually need. We actually even got rid of our NumPy dependency in 1.9. This
was accidental. We didn’t actually mean to do this, but when people realized this is what happened,
okay, sorry, we broke a bunch of people’s code, but like, it’s better for PyTorch to not actually
have a required dependency on NumPy. So if we don’t even want to depend on NumPy, well, we certainly
don’t want to depend on FFmpeg, and domain libraries also let us do it this way. Another reason why
domain libraries exist is they actually have different contribution models than PyTorch main
repository. If you’ve ever submitted a pull request to PyTorch PyTorch, you may notice that after, you
know, code review and all that regular stuff, someone actually has to go ahead and import that
diff into Fabricator. That’s Facebook’s internal CI system. And then only then, like, there’s some,
you know, complicated LAN process that, you know, if you’re external to Facebook, you’re not really
privy to, but eventually, maybe a week or two weeks later, your PR gets merged. Oof, that takes a really
long time. Hopefully, like, it’s not too bad. One of the things that I’ve worked on a lot is making it
easy for open source people to work on PyTorch. But yeah, that can be quite a bit of a lift.
Unlike PyTorch PyTorch, all the domain libraries don’t directly sync with Facebook. So we actually
have many external contributors who have direct commit access to these repositories, you can land
stuff a lot easier. And sort of, there’s a sort of calculation we’re doing here, which is that
why does PyTorch PyTorch, you know, insist on every, you know, commit you land, also immediately going out
to Facebook production when you land it? Well, it’s because PyTorch has a lot of moving parts, there are a
lot of systems that depend on it. And so we can help de-risk this by like continuously deploying
our changes, and like just seeing as soon as possible, when things break, because there’s a lot of moving
parts, there are a lot of interactions. It’s better to, you know, learn about them early rather than,
weekly release where like, oh my god, there’s so much stuff, and we broke everything, and no one has
any idea what broke what. But domain libraries, one, have less applicability, right? Like you’re not going
to use a text library if you’re doing a vision processing task, unless you’re like, you know,
doing labeling or something like that. Deep learning is all sorts of, you know, interesting cross-pollination.
But, and furthermore, there’s way less code in there, right? Like it’s mostly stuff that
is specific for the domain question. And so it’s not so bad to just periodically sync.
In this case, it can be a little bit troublesome, but it’s less bad. And so, you know, sort of these
repositories live on separate ends of the scale. So yeah, if you want to move fast, and you want to like,
be able to like, you know, sort of work on things very rapidly, it’s usually a lot easier to do that
inside domain library than outside. Okay, so that’s a very developer specific viewpoint on domain
libraries. And the next question I want to answer is, what does a domain library do, right? Like,
so when I talked about what PyTorch is as a project, well, what do we do, we give CUDA accelerated
operations that have automatic differentiation, and you know, a bunch of like extra stuff to make it
possible to do stuff around it, like, distributed and stuff like that. So domain libraries are really
very similar to many of the things that we do in PyTorch core, right? So one of the bread and butter
things for a domain library is it implements operators, like ROI align that don’t make sense
in a general context, but are very useful in the, you know, context of the domain in question.
Actually, in the old days, even Torch Vision used to be a pure Python project. So actually,
these operator implementations would just become compositions of stuff you found in PyTorch core.
But as time went on, you know, there’s a need to have accelerated kernels. And so Torch Vision and
most of the other domain libraries are proper C++ libraries. And they come with actual optimized operator
implementations for various situations. And these are also done with autograd support, because obviously,
you want to train your models. And yes, we provide CUDA kernels, because GPU acceleration is a really
important thing of what, you know, makes deep learning tick today. So that’s very normal. But
there’s also some operators that you’ll find in a library like Torch Vision that are unusual, like not
sort of what you’d expect to see. So for example, one of the things you need to do a lot in domain is you
need to be able to encode and decode the file formats for your domain, like, you know, the JPEG example I gave
earlier. And as I said, you know, what a domain library is doing for you is it’s getting all the
dependencies. So most of our domain libraries don’t actually implement the nuts and bolts of
encoding and decoding, because there are plenty of good open source libraries for doing this. But what
you know, the domain library is going to do is it’s going to take care of getting the dependencies for you,
either, you know, like, because there’s some other conda package that does it for you, or, or maybe it’s some
library that’s very annoying, like socks. And, you know, you like if you had to install it yourself on
Windows, that would be really annoying. But fortunately for you, Torch Audio actually just bundles it with
the binaries in question. So you just can use Torch Audio directly and get those implementations. And,
you know, sometimes we even like create custom objects for representing various concepts in them,
there’s a API in PyTorch called Torchbind for representing these things. And so you know,
it’s both the data model, as well as operations for working on them.
There are a bunch of other things, though, beyond operators that a domain library does. So for example,
domain libraries often come with models, and it’s especially important, they come with pre trained
weights, pre trained weights are wonderful, right? Because not everyone can be Google and have a
bazillion, you know, TPUs to like train your model. Well, yeah. So, you know, pre trained weights lets
you you know, if you don’t have that many GPUs, you can like use something that someone trained on a big
data set, and then like, go and fine tune or like, you know, try to put things together that way. So,
you know, envision, there are plenty of vision models, like the good old fashioned ResNet, but then a lot of
more modern models. And, you know, Torch Vision, the intention is to actually track, you know, the models as
things go on, and just be a one stop shop, like, okay, you’re a researcher, you need a, you know,
reference implementation, because you want to compare against some baseline, cool, Torch Vision’s
got you covered. Or, you know, maybe you want to take some model and then tweak it. Well, you can also
look in Torch Vision and get the models that way. Similar to models is data sets, right? We talk a lot
about in deep learning how, you know, models are the stuff we’re training, but a model is only really as
good as the data you feed it. And there are a ton of, you know, well known data sets that, you know,
are done for various tasks. And Torch Vision makes it easy for you to like, get all those data sets in a,
you know, uniform API, and then feed them to data loader, which you know, you can use to kick off the
rest of your PyTorch program. And you know, like, they even have reference scripts, right, to like, show
you how to do the end to end training, you actually need to establish a baseline, or you want to do some sort of,
you know, ablation study or something like that. There’s a few other things that like are not as
obvious. So one is that, as I said, domain libraries often need various dependencies, and they take care
of making sure all these dependencies are available for you. And one of the important things that, you
know, makes this possible is we actually do distribute binary packages for domain libraries,
right? Like this is, this is probably like one of the hardest things about like running a domain
library is when release runs around, and you need to build binaries. And like, building binaries is
very complicated, because you need to do it on all the platforms, and you need to get all the
dependencies, you make sure they’re linked correctly, and stuff like that. And so working inside a domain
library, that is one of the things that they do for you. It’s also one of the reasons why it’s a little,
it’s a little hard, we’ve been stuck at three domain libraries, plus a few experimental ones for a
while, because it is a lot of bring up to get all the packaging going. But it’s one of the value
ads of, you know, working inside one of these domain libraries. And finally, and this used to
not be true, but it is increasingly becoming more true, is our domain libraries are compatible with
deploying to mobile. At some point, I should do a podcast about, you know, what’s going on with mobile
and PyTorch. But like, suffice to say that, you know, you can take your PyTorch models and run them on
the phone. And we are doing this at Facebook. And domain libraries, right? Well, they contain stuff for
doing images and audio. Well, those are very much the types of things you’d might to do on your phone.
And so actually, you know, Torch Vision is compatible with actually running on the phone, despite being in
a separate repository. You know, that’s kind of ridiculous. And I don’t have time to talk about
how that all works. But it’s pretty cool. And it’s another one of the things that a domain library does
for you. So I talked a lot about, you know, why the domain libraries exist and what they do for you.
And I want to go back and reexamine this question, which is, well, you know, it sounds great to have
the domain libraries being these separate modules that are external from PyTorch PyTorch. What did we
give up when we did this? And in particular, the thing we gave up is that these libraries have to be
loosely coupled with PyTorch. This should be a familiar conundrum to anyone who has ever had to deal
with a system where you were do you had multiple components that had different release cycles,
right? Like if you are in the situation, you’re not in a mono repo where everyone is running off of the
latest version of everything all the time. Well, you know, you can’t just land a change to some base
library like PyTorch PyTorch, and then expect to be immediately able to use it in your library,
right? The base library has to go ahead and do a release. And then you have to go and update your
stuff to actually use it. That being said, PyTorch is not very ABI compatible. So we whenever we do a
new version of release of PyTorch, we always do a new releases of all the domain libraries as well.
So we do have some level of coupling, right? Like so if you’re looking at like TorchVision CI,
it actually runs against PyTorch nightlies, right? And because the APIs that the domain libraries use
don’t change that much, most of the time this is working. And actually, the PyTorch main CI
itself also actually tests against TorchVision. So and one of the CI jobs, we will go and we will
build TorchVision from scratch. Remember, TorchVision is not that big of a library. It only does stuff for
vision. It’s not like, you know, a gargantuan library like PyTorch is. So it doesn’t take that
long to compile. And then we can quickly test and make sure that stuff works. But there are some APIs
in PyTorch, which sort of move a lot that we change them a lot like tensor iterator. And, you know,
it would actually be kind of useful to be able to use these tools in domain libraries. But then stuff
will break all the time. So people just don’t do that. They only work on the stable APIs. This would
be kind of nice to like make some improvements on. Like maybe sometimes you might want to, you know,
write a new binary operation that’s very specific for vision. But today, mostly, if you need something
like that, you’re just going to go land it in PyTorch itself. And so you know, it’s just a little
hard to coordinate changes across multiple repositories. So people will, people generally
have evolved the code to not require this in this way. I’m almost done with talking about domain
libraries. One last thing I want to say is that, you know, when you’re working on domain libraries,
the users matter a lot. I’m so I’m here I’m talking to, you know, you developers who, you know, like
like writing code and don’t know that much about machine learning, right? So when you’re working on
domain libraries, right? You’re very close to the actual research that’s going on the domain, right?
Because I talked about how like the libraries provide models, they provide data sets. And like, so you need
to actually be keeping track of what’s going on on the research side. A really good example of this is
Francesco Massa, the main maintainer of torture vision. Francesco does a wonderful job taking care
of torture vision. And he also does research on the side or maybe half like, you know, it’s one side is
tortures and the other side is research. There are a lot of really cool papers that Francesco has been
a part of and you know, this is like, I think of this as one of the like, big reasons why torture
vision is so successful is that we have someone at the helm, who you know, knows a lot about implementing
framework stuff, but also knows a lot about the research stuff. Me, I’m, you know, always in core,
like, you know, C++, you know, core abstractions land. And I actually don’t have to train models very
often in my job function. But you know, in domains, you gotta be doing that sort of thing.
That’s everything I wanted to say for today. Talk to you next time.Anatomy-of-a-domain-library
EP38 Default-arguments
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to do something a little
different. Normally when I do these podcasts I talk about various aspects of PyTorch, but this
time I want to talk instead about a general programming languages concept in this case,
namely default arguments, which is interesting in its own right and has interesting implications
on various design problems that we have in the PyTorch library. So to start, I have to explain
what a default argument is. Chances are you know what default arguments are, but I just want to
spell it out for a moment. So default arguments are a feature in many programming languages whereby when
you are writing a function, you have a bunch of arguments and some of the arguments don’t have
to be specified. Instead they have defaults, usually specified at the definition site, and those
defaults, if you don’t specify the argument, the argument takes on the value from that default. So
Python supports default arguments. It’s in fact the only way to implement overloads in Python without
using some sort of like fancy decorator business or anything like that. The point of default arguments
in language design is one, they’re a very compact way of defining overloads, right? Like so normally if
you want to write a function that has that can take three arguments, four arguments, or five arguments,
you have to write each of these overloads as separate definitions and implement them differently.
But default arguments say, oh, I can define the four overload version in terms of the five just by filling in
the last argument with the default and then calling the five in that case. So it lets you write one definition
instead of end definitions. Another important function for default arguments is they give you the ability
to retroactively add more functionality to an API without breaking backwards compatibility with any clients
who were using it before, right? So if you have a bunch of people who are calling the function with two arguments,
you want to add a third optional argument, well, if you make it a required argument, all those call sites break.
But if it’s defaulted, if it’s optional, then all those original call sites keep working, and they will just,
you know, do what the default functionality is in that case. And we use this feature a ton in PyTorch,
because every once in a while, we want to add a few more knobs to, you know, some sort of function or other.
And, you know, if we had to create a new function, every time we wanted to do this, we’d have tons and tons
of functions, and it would be hard to find things. So being able to add extra features onto the existing
names, because the names are a limited namespace is very useful for us. And this is inextricably tied to
another language feature that Python supports, which is keyword arguments, right? So keyword arguments lets you add
new functionality, and also do so in a non positional way. So you like can just say explicitly, what variable
name you want to specify for the argument in question. So put it in other words, default arguments are a way of
canonicalizing multiple overloads to the maximum arity a function may take. Let’s unpack that statement.
So what do I mean by canonicalization? Well, you know, to canonicalize means to put something in a form
that is the same, no matter how you express it, right? So when I take a two argument function,
and then fill out its default arguments, so it’s five argument, I’m doing a canonicalization process,
I’m canonicalizing all my function calls so that no matter how many arguments they took, I always see
them with five arguments. And arity is the technical term for number of arguments a function takes. So max
majority just means that we always canonicalize to the maximum number of arguments. I’m
emphasizing this because I’m going to flip this around later in the podcast. One more thing I want to say is that
default arguments imply overloads, but overloads are a more general concept than default
arguments, right? So like in C++, you can define overloads manually, anything you could have written using default
arguments, you can also do using overloads. And in some languages, there’s just no overloading at all.
So it’s not a question of do they support default arguments or not? No, no, no, there’s just no
overloading. And common reasons why people don’t want to put overloads in their language are it makes
type inference more complicated, or you know, it just sort of makes it lets people write code that
might be too complicated, you know, it’s too overloaded. So Haskell is a good example of a language
that doesn’t have overloading. But another one is golang. They also don’t believe in overloads.
So default arguments are pretty handy. We use them a lot in PyTorch. But we’ve also had a lot of trouble
that has come from, you know, sort of taking the very Python centric approach to default arguments.
So I want to explain a few of the problems that we’ve run into over the years. So one problem,
which I have also mentioned in the serialization podcast, is that we have a forward compatibility
problem with TorchScript serialization. Okay, so what’s going on here? So when you write a PyTorch
model, and you TorchScript it into some sort of like representation for the IR in question, when we
serialize it, in old versions of PyTorch, we serialize these with all the default arguments written
out, right? Why do all the default arguments show up when you serialize? Well, remember that default
arguments are canonicalization mechanism, right? So by the time we’ve gotten to the TorchScript IR,
there’s actually only one representation for any given call. And that’s the canonical form,
which is the max arity, as I said. And when we serialize, and we look at these function calls,
well, they’ve already been canonicalized to max form. So the simplest and easiest thing to do
is to serialize them back out into the TorchScript model format with all of the arguments, because
that’s what the input IR had. And this is a forwards compatibility problem. So forwards compatibility
refers to when you do something, does it work with previous versions of the software? And so the
problem is, if I add a new optional argument to a function, I will start serializing code that has
this argument explicitly filled in. But old versions of PyTorch won’t have that argument, and they will
choke when that new argument shows up. So this is a sense in which like canonicalizing in this way,
like reduces the amount of, you know, implementations in the back end that are possible, right? Like
previously, if I had a function that could only deal with four arguments, as long as I passed it only
four arguments, it would be fine. But once I pass it this fist argument, even if it’s the default value,
even if it would have behaved exactly the same way as the four argument version,
I’m stuck because, you know, the back end, the server doesn’t actually know that this is the case.
So the TorchScript serialization FC problem is one manifestation of troubles with default arguments.
But actually, there are other manifestations as well. So let’s talk a little bit about XLA and
backend extensibility. So backend extensibility says that you can define your own custom device on PyTorch,
like XLA or, you know, anything else, and then define implementations for all the operators in PyTorch.
And how do you define these implementations? Well, you define the max arity implementation for any
given function. So if a function has a bunch of defaults in it, you have to write a function that
handles all of the defaults. So what do you think happens when I add a new defaulted argument to a
function in PyTorch? Well, the backend extensions all break, right? So like, if you’re,
if you’re remembering that like XLA, you know, things break in XLA, well, that’s because usually
people are adding new things to the schema. And because our current API, if we’re doing backend
extension requires you to implement once again, the max arity implementation, whenever this happens,
someone has to go to XLA and add support for new argument in question. And that can even be just as
simple as like testing if it’s the default value. And if it’s not raising an error, but they have to
intervene because the APIs require you to provide the max thing. So it’s, it’s strictly BC breaking
from the perspective of the server. One last example is let’s say that you’re in FX. So FX is our
transformation framework in PyTorch. And you want to, you know, do a bunch of ad hoc transformations on
your model to like get it into some other form. Maybe you want to shard it or, you know, you know,
you want to view some things. Very common feature for FX passes is they’re very specific. They’re
very domain specific. So you’re not like trying to write a general pass that’ll work in all cases.
There’s probably some particular use case you’re looking at, and you’re going to ignore most operators
and only the few operators you really care about are the ones you’re going to do. And so if you’re
doing one of these FX passes and previously an operator like had two arguments, you might write your FX
fast under the assumption that when this operator shows up in your IR, there’s going to be two arguments
in it. And once again, if I add a new optional argument to it, and so now it gets canonicalized in
the IR to have three arguments. Well, oh, no, all your old, you know, code doing this transformation pass
is broken because well, it wasn’t it didn’t know how to it doesn’t know how to deal with this third
argument, even though this third argument, most of the time, if you use the default values would have
been semantically equivalent to the two argument version. So all three of these examples are the
same side, they’re just, you know, different sides of my three sided dice, which isn’t a thing, but
right, it’s, um, the problem is default arguments are really good for maintaining client compatibility,
they’re really good for maintaining compatibility with the caller of code, but they’re really bad at
maintaining compatibility with the so called server, the implementer of the code, right? Because,
uh, under the sort of Python model, uh, you have to deal with all the arguments, because immediately
what happens once you have called one of these defaulted, uh, functions is you get all the
arguments and now you’re expected to handle them all. Okay, so how did ToroScript solve the
serialization FC problem? I think I claimed in a previous podcast that it wasn’t solved. It actually
is solved now the the fix landed within the last few months. And they did a very, very cool and useful
hack. And this is canonicalization to low arity. So what do I mean by that? So imagine that, you know,
I’m doing one of these calls. So I might go through Python and Python is going to go ahead and canonicalize
to max arity because I don’t have a choice. That’s how default arguments work in Python. And it’s going to
went my way through my system. And eventually I’m going to get to serialization time. And I’m going
to be like, Hey, I need to write out some code that represents this argument in question. What should I
write out? And so what, uh, canonicalization to lower arity says is, Hey, let’s look at the arguments
and see if they’re actually the defaults. And sometimes, you know, they’re going to be dynamically
computed. So I’m not going to know. So I, I have to actually, you know, pass in a real tensor.
A lot of times they’re constants. And so I can just compare the constant against what the default
is supposed to be. And, Oh, look at that. The last two arguments are actually the defaults.
And so in this situation, what I will do is I will chop off those arguments and serialize the lowest
arity, um, version of the function that accurately describes the semantics of the column question,
right? So basically like, look at the suffix of arguments, all the defaults get dropped and there you
go. And so you can see that this solves the FC problem because even if I, you know, call some
code and it fills in the default that was new in my new version of PyTorch, and then, you know,
wasn’t supported by the old version, as long as it’s the default value, PyTorch will know to remove that
argument, um, in the end. And then I will end up with a, you know, lower arity function that my old
extension knows how to do. And we can actually play this trick again for like all the other cases we
haven’t yet, but like one of the reasons I’m recruiting this podcast is, um, a recent realization
that we should apply this same technique in those other cases. So like if you’re a backend extender
and, you know, you have written a function that only knows how to deal with some amount of arguments,
ideally we would, you know, chop off the defaults so that your code would still work in that situation.
And of course, um, this is kind of hard to do in C++, but we are working on this new Python,
backend extensibility mechanism called Torch Dispatch. And there we actually can do this,
and it’s not too hard to do, and we should do it. So there you have it, right? So default arguments
are this way of canonicalizing your function calls to their max arity form, but max arity is bad
for servers, right? It’s good for clients, it’s bad for servers. And so what you want to do instead
is if you are transitioning back across the sort of abstraction boundary to the extensibility point
on, on the backend, um, a good technique to apply in this case is to re-canonicalize back to lower
arity, chopping off the default arguments that are not necessary. There’s like a sort of meta lesson
that I took from this, right? Which is that, you know, we designed our, uh, API, our JIT schema API
off of Python language design because PyTorch from the very beginning was a Python language, um, library.
And so we assumed that overloading was possible. We assumed all these things and doing that,
you know, gave us a very nice, easy to use API for users. And it was kind of bad for backwards
compatibility and forwards compatibility. Right. And, you know, when a lot of people complain
about, um, how Golang like doesn’t give you any toys and like, it doesn’t let you do overloading
and, you know, it’s really ugly writing code in Golang. Um, but I kind of do think Golang has a point,
right? Which is that it’s simpler to do backwards compatibility and forwards compatibility if you don’t
have any of this stuff, right? Because if you don’t have default arguments, then like, if you want to
add a new version of the function that has another argument, you’re just going to make a new
function for that. And you just don’t run into any of these problems, right? Like the language design
of Python, um, puts you into a situation where you have to remember to re-canonicalize to lower
arity. But like, if you have just separate functions, you don’t have to deal with that. But of course,
doing it this way is ugly and verbose. And so at PyTorch land, we want the best of both worlds. So,
you know, we need to strike a balance and the hack of, you know, going to lower arity is a pretty
good balance in my opinion. One last thing, which is that my PhD thesis was basically on exactly this
topic. And I was very happy. I didn’t have to worry about overloads because Haskell doesn’t have
overloads. And like, once again, like very easy. And we, we had to deal with like other stuff like
type classes, type classes. Oh my God, such a, such a pain. All right. So that’s it for today. Um,
I want to explicitly credit, uh, Dmitry Julgakov. Um, uh, we had a chat before this podcast recording
and he helped, uh, me solidify some of the things that I want to say here. Um,
that’s all I wanted to say for today. Talk to you next time.Default-arguments
EP39 CUDA-graphs
Hello, everyone, and welcome to the PyTorchDev podcast.
Today, I want to talk about CUDA graphs, an NVIDIA mechanism for reducing kernel launch
overhead and, you know, sort of putting all your CUDA kernels together into one megakernel
that you can run really fast.
So why does CUDA graph exist, right?
So to understand this question, we have to think a little bit about how the CUDA programming
model works.
So the way the CUDA programming model works, and see my previous podcast about enough CUDA
to be dangerous, the way the CUDA programming model works is we have a bunch of kernels that
the CUDA, you know, GPU knows how to run.
And you run your host code, regular old CPU code code, and you figure out what kernels you
want to run, and you queue them on a stream.
And, you know, like whenever the CUDA driver gets these kernel launches, it actually goes
ahead and runs them on your GPU.
And so if your data is really big, and, you know, like it takes a long time to run various
things in the GPU, after a short launch latency, the latency that it takes to get to the first
CUDA launch, then you will basically just queue a bunch of kernels to be run on the stream.
And, you know, CUDA will just go ahead and try to, you know, run them as fast as possible
when the previous work gets done.
But sometimes, um, your code is too small, and it runs too fast, or maybe NVIDIA’s graphics
cards are way too fast.
And you’ve got a problem, which is you just can’t keep up with the GPU, you can’t feed it
enough to keep it utilized.
And, um, you know, when you’re in this regime where your tensors are really small, and you
have a lot of itty bitty kernel launches, the kernel launch overhead actually can be pretty
killer.
And so CUDA graphs are a solution for this problem.
What a CUDA graph lets you do is it lets you take a whole bunch of kernel launches and bundle
them up into one giant mega kernel launch, so you don’t have to deal with the kernel launch
overhead.
And, you know, you can, you’ve gotten rid of all that overhead, you’ve gotten rid of the
overhead of running the host code, so your CPU overhead is also lower, your CPU utilization
is also lower.
And then you can just go ahead and, uh, you know, run this over and over again.
Okay.
So that’s the concept behind CUDA graphs.
But if, um, I told you, Hey, uh, I need you to go implement CUDA graphs for me.
Um, you might think about it a bit and then you might realize, actually, this is not so easy
to do, right?
Like, so normally, um, and like, if you’re say ML commute at Apple, uh, you know, this is
what you actually did.
Normally what you would imagine is, Hey, you know, I want some sort of graph representing
the entirety of the computation that I want to do.
And then I’m going to feed it to some sort of, you know, internal engine, et cetera.
And that’s going to, you know, go ahead and, you know, compile it into one monocernel that
you can go ahead and send to NVIDIA.
But there are no such graph representation exists for CUDA, right?
Like CUDA was designed from the very beginning as a streaming, uh, API.
And so what’s actually going on, right?
Is like in PyTorch, we’ve got loads and loads of Kuna kernels all over the place.
They, they don’t even necessarily have to be, um, you know, like have a, uh, publicly visible
name.
They can be in an anonymous name space and they’ve got all of these, like, you know, parameters
that you’re calling them with, right?
Like all the tensors that they want to operate on various, you know, parameters that you’re
passing on the parameter buffer to the kernel, like, you know, whatever scaler you want
to multiply things by or anything like that, like how the heck would you actually assemble
a graph like this?
And so CUDA graphs, like, you know, many other wonderful technologies, such as the JIT Torch
Script Tracer requires you to go and run your CUDA kernels first and record a CUDA graph that
you actually then can run again in the future.
That being said, there is a API in CUDA graphs for explicitly, um, building CUDA graphs and
doing modifications to modifications to them after the fact, but that’s not the preferred
way of generating a CUDA graph.
The preferred way of generating a CUDA graph is to actually run your code once, and then you
actually get a bunch of CUDA kernel launches.
And by the way, like when you do these CUDA kernel launches, um, you know, we’re going to record
everything about how you launch them, right?
So like what tensors you’re passing to them, what parameters you’re passing to them, all of
that, we’re going to just record as is.
So that means that it’s totally hard coded.
Like if you use some CUDA memory inside your region of CUDA, uh, calls that memory is going
to be the very same memory that a subsequent run of the CUDA graph is going to use.
Because remember, um, Nvidia has no idea what the meaning of the parameters you’re passing
to the CUDA kernels are.
Like it’s totally flexible.
You can, you can pass anything you want.
You can pass any structs you want.
So CUDA has no way of actually just swapping out pointers if you want it to like, you know,
use different memory the next time you run it.
So when you’re doing CUDA graphs, you have to like, you know, make sure that you allocate
your memory in a persistent way so that the next time you want to run your code, you can
reuse that memory for that.
So the model behind CUDA graphs, right, is that you, you run your CUDA code, um, with a
special setting on the memory allocator so that, you know, it gets kept for later.
And then, uh, once you get done, you get this CUDA graph and, um, for whatever the input
CUDA tensors are, you have to go fill them in with whatever the new inputs you want to
run.
And that situation is, and then you can say, okay, Nvidia, go run your CUDA graphs and bang,
bang, bang.
It’ll go ahead and run the kernels exactly as they did previously.
Oh yeah.
And one last thing, because, um, you know, how exactly do, um, CUDA graphs know, uh, what
kernels to actually record?
Well, actually they’re stream based.
So remember, um, the stream in CUDA is this queue that keeps track of all the operations
and what ordering they need to run in.
Right.
So if you put things on the same stream, they’re guaranteed to run in the order they got put
in the stream.
Of course, if you have multiple streams, then they can run in any order.
And it’s a little hard to use streams correctly, uh, because like, it’s a very like fine grain
form of parallelism and like sometimes physically your GPU just can’t do it, but it is a useful
API.
And so CUDA graphs, um, when you record, you’re not recording globally, every CUDA launch, you’re
actually recording CUDA launches on specific streams.
Um, and, um, PyTorch is not that great at being very stream friendly.
Like, so, you know, PyTorch by default runs on the default stream.
The default stream synchronizes with everything.
It’s very easy to use.
You don’t have to worry very much about it, but, um, like, you know, sometimes you want
to have streams and then you have to actually write your code differently.
And it’s easy to get this wrong because if you forget to do it and someone runs your code
on the default stream, chances are things are just going to work out.
So, you know, M.
Currly, who is the NVIDIA guy who has been working a lot on CUDA graph support in PyTorch.
Um, he’s also had to fix a bunch of stream bugs, especially in our Autograd engine, um, to
make everything all work out.
So that’s basically most of what you needed to know about CUDA graphs, right?
So they, um, they are a way of running a bunch of CUDA kernels all together at once and, um,
they hard code all the parameters.
So that just leads to some, you know, UX problems that you have to be aware of if you want to
use them.
I want to recap something that I talked about in the random number generators podcast, which
was about the Philox random number generator, because this has a very interesting
interesting interaction with CUDA graphs.
This is kind of bonus material.
So like, I’ve already said the most important thing about CUDA graphs, but this is, I think
this is interesting and I want to talk about it a bit.
So I said that, you know, everything gets hard coded and in particular, um, the random number
state gets hard coded when you run your CUDA graphs.
Okay.
Think about it.
Right.
So what I said in the RNG podcast is that the CUDA RNG state actually lives on CPU.
It doesn’t live on CUDA.
It lives on CPU.
And you, um, just, uh, you pass the seed and the offset directly in the kernel parameters.
And then, uh, on the CUDA kernel, it actually sets up the Philox state and then does sampling
on it.
And it’s pretty cool.
And it’s very nice.
And it’s a complete disaster for CUDA graphs, because what that means is you’re actually going
to get the same random numbers every single time you run your CUDA graphs.
And okay, maybe that’s okay, but like, usually that’s not okay.
And you really do want different random numbers every time.
So how the heck do you solve a problem like this?
So clearly you need some way of actually feeding in what part of the sequence or the seed or something
like that inside CUDA memory, because, well, you know, you’re going to totally hard code the,
um, you’re going to hard code the parameters, right?
So it can’t be anything passed in the parameters.
Well, there’s only two ways you can pass information to a CUDA kernel, either by the parameters or
by memory on the CUDA device.
So if it can’t be in the parameters, well, it has to be on the device, but then, uh, how
exactly can you get it to the device?
Like, do I have to, um, you know, when I launch my kernel, uh, first do a host, a device copy
of the RNG state to CUDA memory, and then, uh, run the kernel that way.
Uh, that doesn’t sound so great.
Um, to be fair, it wouldn’t be that bad because remember it’s all async.
And so, um, you can trigger this, uh, well, as long as the host memory is pinned, which
is not too hard to arrange, you could just trigger it asynchronously and then like have the transfer
happen whenever like CUDA gets around to doing it.
But there’s a better way to do it.
And the better way to do this is to pass in a pointer to a little bit of CUDA memory that
doesn’t say what the seed or the offset should be, but instead is an offset correction.
So what’s the idea?
So we’re going to put on a restriction.
The restriction is that, um, if you want to use CUDA graphs with RNGs, uh, RNGs, you have
to reuse the same seed because the seed we’re sending up with the parameters.
So the seed is hard coded.
We can’t do anything with it.
But what you just want to do is right.
When I do subsequent calls to the CUDA graph, all I want is to, you know, advance the random
number stream, however far, uh, you know, I had advanced, you know, via my previous consumption
as well.
Right.
So there’s only this, you know, extra bit of information, just the offset that I want in
the situation.
So what I can do is, uh, so when I’m running normal PyTorch code and there’s no CUDA graphs
involved, I’ll send a little bit inside the parameters field saying, Hey, this is a non-capturing.
You can just do use the seed and the offset directly, and you don’t have to do anything
about it.
But let’s say that I am in capturing mode.
Then I’ll do a different bit and I’ll send a pointer to the memory.
That is the offset that I want to do and say, Hey, Hey, Hey, um, when you compute the RNG
state, use the seed, use the offset, but also use this extra offset read out from memory
to like do the adjustment.
And at the very beginning, the adjustment is zero, right?
Because like whatever the seed and the offset were at the time I was recording is the correct
one.
But then later when I want to rerun the CUDA graph, all I need to do is do a, you know, uh,
host to device, um, setting of that little bit of offset to be whatever the current state
of the RNG is.
And now I can run my CUDA graph and, um, the CUDA graph is going to read out the, um,
you know, the offset from this memory and now offset the random numbers exactly how I need
them to be.
And there’s one last thing I need to do this, right?
Which is I need to know how many random numbers my CUDA graph consumes, but that’s not too hard
to figure out.
You just record what the RNG state was at the beginning and what the RNG state was at the
end.
Um, this was not obvious to us at the very beginning.
And, um, you know, I’m clearly Natalia and I like spent a while thinking about how to actually
solve this.
Um, but I think this solution is very elegant.
Um, and it’s just, you know, once again, it comes out of having to solve the problem of,
well, CUDA graphs, hard code, everything in the parameters, actually in an old version,
apparently someone was actually going into the CUDA graph post facto and editing all of
the RNG parameters to update them to the new thing.
This was terrible.
It was a bad idea and like needed to solve this problem.
Okay.
So that’s the end of the fun technical digression.
So CUDA graphs.
So like, how can you actually use them in practice?
So we’re working on landing the last PRs that actually give a nice user API, but there is
something, you know, that is very important about CUDA graphs, right?
Which is if you want to deploy them, you want to use them in a production setting, you need
to be able to run your code, um, you know, initially to actually get the CUDA graph in question.
And so this is why like, um, things like torch deploy are actually very important for CUDA graphs,
right?
Because like, if you want to use CUDA graphs to like do say GPU inference, because that’s
a situation where overhead matters a lot, you still need to bootstrap the CUDA graph at
the very beginning.
And then, you know, then you can run it.
And, you know, if you, uh, uh, you don’t, if you can run Python code in your environment,
and that’s what torch deploy is all about, then you can just run the slow Python code to get
the CUDA graph, but then pass it off to some C++, uh, you know, engine that just repeatedly
runs the CUDA graph, uh, in the future.
Right.
And that, that’ll be really good.
And, you know, you, you, you use the Python for the slow initialization and then everything
else doesn’t even need to touch Python at all.
And that’s like, I think one of the main draws of CUDA graphs.
All right.
That’s everything I wanted to say for today.
Talk to you next time.CUDA-graphs
EP40 Functional-modules
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about functional
modules, a way of taking NN modules and turning them into purely functional stateless versions
that you can pass parameters into explicitly. Before I start with this podcast, I had to explain
why this is something that we’ve been thinking about recently. So one of the projects that is
going on in PyTorch is Functorch. Functorch is a sort of replication of Jax’s functional transforms
but on top of PyTorch. And one of the problems that is challenging for Jax is the way they have set up
these functional transforms like grad and vmap require you to explicitly specify what arguments
you want to vectorize over or differentiate over. And this makes it challenging to do a NN module style
API like what PyTorch has. I have a previous podcast about how NN modules are designed. The short version
of it is that, you know, why do NN modules exist? They exist because we want an easy way of keeping
track of all the parameters for various modules in question. And so rather than forcing people to like
remember what all the parameters are, you can just put them as properties in the module, and then the
modules will collect them all together. And then you can pass them to say the optimizer when you want to do
the steps. This is really, really convenient. And you know, NN modules are a very enduring part of PyTorch’s
front end API design. So what causes the problems with the functional API in question? Well, to answer this
question, let’s look at the sort of very most basic operation that you can do on a PyTorch program, namely compute
its gradient with respect to the parameters. Now, if you think about how this is done in PyTorch normally,
what you do is, you know, you have your modules, you get your input from your batch, you feed into the
modules, out pops out some final loss, and then you do dot backward on it, right? It’s a very imperative API,
the dot backward triggers the automatic differentiation. And then all of the parameters get a grad field
populated. And that’s what the optimizer will read out for when you actually want to, you know, do the
step update. So there’s no need to know anything about the parameters in question ahead of time, or no
need to actually, you know, collect up a list of all the parameters. Everything will just get put where you
need them to be directly on the object itself. And so when you want to do optimizer updates, all you need
to do is iterate over the list of all parameters. And of course, you know, what does NN module do, it lets you
easily get a list of all the parameters. Now let’s flip this over and think about what it would look
like to have a version of grad, which is an actually functional API, because this is what Jax provides,
we also have a functional version of grad. And sometimes it’s very convenient, because you don’t
want to actually be mutating your tensors, you just, you know, want to get the sort of mathematical
conception of a gradient, right, take a function, and then compute the function that gives the gradient
for you. When you’re doing sort of higher order business, this is often the easiest way to
conceptualize your program. So in this setting, instead, what you have is you have a function,
and you say, okay, I want to differentiate the output of this function, with respect to some of the inputs
of the function. And now the implicitness of NN modules is a downside, because well, you know,
your function normally has takes in a bunch of arguments. And if you have a function that takes
in everything as arguments explicitly, you can just say, okay, I want to differentiate the first
argument, and the second argument and the third arguments, which would just happen to be the
parameters in most cases. But with an NN module, these arguments aren’t arguments at all, they are
living implicitly inside of your NN module objects. And unless you have a pass that knows how to look
into the NN modules and say, hey, actually, there’s also live inputs, input arguments, in this module
object you pass into me, there’s no way that it actually will know about these things. And so it will
look to the sort of, you know, function as if these are just tensors that you’re accessing, you know,
sort of from out of scope, they’re like free variables from your function. And, you know,
normally, you don’t differentiate with respect to free variables, except, like, you know, the whole
point of training your model is to, you know, do differentiation with respect to the parameters.
So actually, if you use torch.autograd.grad, you can do this, and the correct thing will happen.
And there’s a trick that the autograd engine does in order to make this all work out, which is that
when you do a .grad, you have to specify explicitly what arguments you want to differentiate with respect
to. And it doesn’t matter if you actually pass them to the function or not, because you don’t pass
in a function, you just pass in the output. And then the autograd engine knows, like, for every input you
passed in, look for, you know, the uses of it in the in the history. And that’s how things get
implemented. So there’s no, there’s no higher order function per se. Instead, we’re just sort of relying
on, you know, very detailed knowledge of the object identity to, like, work out what the function it is
that you wanted to differentiate was in that situation. And this trick works okay for grad.
And it doesn’t work so great for, say, Jacobian. So if you like, try to do this for Jacobian,
it just doesn’t actually work. You can’t compute Jacobians on functions that involve NN modules.
There’s also other examples of this being a problem. So another example is when you want to ensemble
model models. So what is ensembling? So ensembling is the idea that more heads is better than one. So
if you had one network that, you know, was computing the answer to your problem, well,
it might improve the performance. If you have multiple copies of this network, all with different
parameters, and you run them all on the input, and then you sort of decide based on some voting mechanism,
which one you like best. And sometimes this actually is helpful. And there’s some theorems
that talk about, like, you know, idealized situations like this, where they show, yes,
in fact, doing an ensemble is strictly an improvement over each of the models individually.
So when you want to ensemble like this, you would ideally want to run the computation vectorized
if all of the modules in question were exactly the same, right? Because each of them is doing the
same thing. And you just really want to vectorize over the parameters. So you’d like,
you have this parameter, but it’s not just a single parameter, it’s a stack of parameters,
one per each of your modules. And that’s what you want to vectorize over. So there’s another
functional transformation that lets you do this. It’s a vmap. But to vmap a function,
you have to pass in what arguments you want to vmap over. And once again, if these parameters are
actually parameters in your NN module, there’s no way to pass them in because your NN module is just
directly accessing the parameters on that module. And, you know, your vmap has no way of sort of
interposing in on it. Because the way most of these transformations works, the way that like a grad
transformation works, and the vmap transformation works, right, is that when you say you want to
differentiate with respect to or vectorize with respect to some argument, we take those arguments,
and then we wrap them up in some sort of special object, like a batch tensor or gradient tensor that
says, hey, we want to do some extra work when you do operations on this. And, well, if those things
are completely in the middle of nowhere, on top of a module, there’s no way to actually update them.
So how do functional modules work fix this problem? Well, a functional module is is a proposal that says,
okay, given this NN module, what I want to do is I want to split it. And the way I want to split
it is I want to first take out the parameters, right? Because one of the most important things
a module does is give you, you know, a way to track all the parameters. And then I want to somehow,
and I’ll give a example of how you could implement this, somehow have a version of the forward code
for each of these modules. But instead of accessing the parameters that were stored on the modules
themselves, instead get the parameter values from an extra argument that is passed in explicitly to the
modules in question. And so you can see that if you have a way of, you know, taking a regular NN module
and turning it into this functional version, that also solves your problem of V mapping or grading over
it, because, well, the parameters are now explicit arguments. So you can just, you know, V map over
them or grad over them, and you’ll get the thing you actually want to do. So how exactly could you do
this? So Albin, you know, has this very simple way to do it, right? Which is, if you want to, you know,
run a module like this, you need some sort of dummy module, you get in all your parameters,
you sort of edit the module to replace the parameter settings with the explicitly passing parameters,
and then you just run forward. And you know, if you need this operation to be idempotent,
you should reset the state of module to whatever it was before when you’re done. So that’s a very
cheap and cheerful way to implement modules in this way. And of course, you know, it might also be useful
given one of these functional modules, and then a list of its parameters, it might be useful to
reconstitute it back into an original NN module if you don’t need this functional version in this case.
So this is a possibility. We’re not super keen on it. One of the reasons why it’s a little fuzzy to
work with is it sort of, it gets rid of this notion that NN modules are objects with a, you know, sort of
persistent identity, right? Because, you know, NN modules are built out of, you know, good old-fashioned
Python object-oriented programming. And in, you know, object-oriented programming, when you have a object,
you know, that object has some distinct identity, and it’s not fungible with another object that just
happens to have all the same properties, but, you know, is a different identity of object,
right? Like, if you mutate one of them, you don’t expect the other one to get mutated in this case.
But with a functionalization API, you’re expecting to be able to, like, take these modules and then,
like, decompose them into their parts or recombine them back into, you know, an NN module.
And you’re expected to sort of not necessarily care that the new NN module you got back is not the same
thing as the one you had before. And that is a little bit different from how the existing APIs
in PyTorch work. There’s also other ways you could go about dealing with this problem, right? So another
idea, which is a sort of API idea is, um, imagine that you are writing one of these functions, right?
And instead of directly, um, instead of directly calling into the module via some, you know, sort of
global variable, instead, you might be required to pass in the module as an argument into the function
in question. And so the module, right, has a bunch of code, but it also is a glorified container that
contains a bunch of tensors. And so you ought to be able to say, Hey, I want to V map over one of the
parameters in the module in question, or I want to grad over one of the parameters. So, so like sort of,
instead of just having to like get deal with the tensor or list of tensors module of tensors, also fair
gain. Of course, once again, you still have to make sure that the, if you’re doing any wrapping
or anything like that, you actually make use of the wrapped version of the tensor in, in the internals
of your function. And this is why I sort of like the, the trick that we do in PyTorch for grad,
which is we say, actually, um, the sort of association of a tensor input as actually being an input is
independent of the uses in question, right? There’s some independent weak map that keeps track of the
things that are going on. That might actually be a better way of implementing, uh, like this extra
behavior rather than wrapping objects in this way, because then you can make sure that all uses of the
object, no matter if it happens to be stashed somewhere else, will be run with that metadata in
question. So it’s very different than how Jax and how PyTorch actually implement a lot of the things right
now, which is like, you know, create a tensor, which wraps the other tensor in question.
One downside to this weak map approach is it puts a lot of, um, uh, stress on how well your language
supports weak references, because like if you just used a normal map and, you know, when you, uh, whenever
you like did new operations, you kept, uh, adding things into this map, you would obviously leak memory
in the situation because you’d never deallocate anything. So you need to make sure that, you know,
when tensors go out of scope inside your program, they also get removed from the weak map. Maybe some
sort of hybrid approach where, you know, inputs are done via the weak map, but, um, intermediate
results are done by actual wrapping. Maybe that is an easy way to make sure that the memory management
works out okay in this case. As a parting note, I want to mention how the Jax ecosystem does with this
problem. So, um, Jax can’t do NN modules the same way PyTorch does. And so they have a library called
Flax, which, um, you know, gives a module like abstraction and sort of the key idea for their work
is they just want to completely avoid, um, the Python object-oriented insanity. So they’re just
sort of translating, you know, the code you write, which looks kind of object-oriented, but is done
via data classes under the hood into usual good old-fashioned pure function calls that Jax knows
how to transform in an easy way. And so Flax actually has its own version of VMAP, which directly
takes the module as an argument in this situation. Okay. So that’s what’s going on with functional
modules in PyTorch. If you have any thoughts, this is very much something that is in progress. Um,
Richard and Horace have been working on it. So if you have any comments, please let us know on the issue
that I will post in the podcast notes. That’s everything I had to say for today. Talk to you next time.Functional-modules
EP41 Double-backwards
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about double
backwards, the way that PyTorch implements higher order differentiation in PyTorch.
What’s higher order differentiation? Well, normally, we think of differentiation as just
the thing we do in order to figure out, you know, how we want to update our gradients and parameters.
And, you know, as machine learning people, we just leave it at that, right? Like,
it’s just an optimization problem. But, you know, differentiation comes to its roots in
calculus, right? Like, it talks about the rate of change of quantities. And if you can talk about
the rate of change of a quantity, you can talk about the rate of change of the rate of change of
a quantity, and so forth and so forth. So like, you know, in high school calculus, right, you you can
have a function that models position, differentiate it, and you get velocity, differentiate it again,
you get acceleration, etc, etc. So what are some use cases for higher order differentiation in deep
learning? Well, there are many use cases of this, actually, it’s actually a very popular feature,
although it doesn’t show up in like simple models. So one good example of this is this concept called
gradient penalty. The idea behind gradient penalty is that sometimes when you are working on your model,
you will have a example that causes the gradient to have a really, really, really huge step. And maybe
that’s bad, right? Maybe you just don’t want to do that. Maybe you want to make sure that any given
input doesn’t influence the state of your parameters too much. And so the bigger the gradient,
the worse the solution is. Well, if you’re just, you know, doing a good old fashioned single order
differentiation on your program, then there’s nothing you can do, right? Because you just compute the
gradient. And then well, you got your gradient, maybe you can just clip it before you actually apply it.
But what you can do, if you have higher order differentiation, is you can actually apply a
penalty, you can say, hey, so I want to reduce this loss, but I don’t want to reduce this loss,
if it will cause the gradient to blow up too much. So I can have a like combined loss that takes into
effect both the, you know, loss in question, whatever it is that I want to train on, you know,
the accuracy of my network, but also will successively penalize, if you know, the gradient gets bigger and
bigger. And I can then, you know, via the magic of automatic differentiation, find the exact
quantity that will minimize my, you know, sort of joint loss involving the true loss, as well as the
penalty on gradient. And how do I do this? Well, I have to do this with higher order differentiation,
right? I have to first differentiate my program to get the gradient. And then I have to use the
gradient with my regular loss and differentiate again, to find out how I can minimize this combined
loss. Another example of higher order differentiation being useful is in metal learning. So what’s the
concept behind metal learning? Well, meta learning, as the name suggests, is learning to learn. So it’s all
about, you know, training a neural network to train a neural network really good. And what does this
often look like? Well, you know, normally, when you think of how you differentiate a model, you have a
training loop. And what you do is you, you know, you run your model forwards, you run the model backwards,
you get the gradient, you apply the gradient to the optimizer, and then you go back to the loop and you go
again. And then there are going to be some hyper parameters associated with this training loop. And
typically, you just have to find those by like, just trying a bunch of things, you know, like,
change the hyper parameter, and then try again. Well, in metal learning, what you’ll do is you’ll,
you know, run this training loop. And then this training loop itself, you will run an optimizer to
optimize, you know, some hyper parameter, maybe some aspect of the model architecture. And that in that
so the entire training loop is embedded inside a bigger training loop, which is training, you know,
the overall, you know, how well the neural network learns in this case. And once again, you know, you
have to do a normal gradient computation inside the inner training loop. And then the outer training loop
needs to, you know, do a gradient again, on the inner training loop. And one last thing, it’s commonly
the case that you might need to compute a Hessian, when you are, you know, doing some mathematical
applications. That’s the square matrix of second order partial derivatives. Second order means you need
a higher order differentiation to actually compute this value. So you just can’t do it unless you have
support for this. Okay, so how does double backwards actually work in pytorch? So this is going to be a
long explanation. So I’m going to take it in parts. So the first part is I want to first explain how
regular 80 works. If chances are, if you’ve, you know, seen any like in depth tutorial on pytorch,
you already know this, but it’s going to set the stage for double backwards. Next, I’m just going to
introduce how exactly the double backwards user API works, because it says something about the
implementation. And then finally, I’m going to tell you how double backwards. Okay, so let’s get
started. So how does regular automatic differentiation works? So what’s the model? So the model is you have a
bunch of parameters, these parameters are written with requires grad true. And then whenever you do operations that
involve these parameters, you record information about what operations were done, you know, in the literature,
this is called, you know, writing it to the Wenger list that, you know, records all the operations. So we record
operations as we execute them. And then when finally, we call backwards on the loss, we traverse this graph in
reverse order, and run the operations in, you know, sort of backwards order, computing the derivatives, propagating it
through until we get the gradient in the end. There’s a lot of math that explains why it goes backwards, and you know, what
exactly the meanings of these operations are, but this is not sort of relevant for just understanding how double backwards
works. There’s two other details about this process, which are worth noting. So one is that whenever we like process this graph
backwards, we actually eagerly deallocate the recorded gradient info, whenever we’re done processing it. And why is this the case? Well, because
normally, in a normal training loop, you run backwards once, and then you just use the grads that are accumulated into the
parameters to actually do the optimizer update. So you don’t actually need this, you know, sort of reverse graph
anymore, right? Like once you’ve used it, you’re done with it, and you don’t need it anymore. So we can save memory by just
deallocating it as we go along. And it’s also really useful because if there are reference cycles, well, deallocating the
grad info can break those reference cycles. Second, is that there’s something very interesting that goes on when we run the
backwards, which is that the backwards formulas for various functions may involve uses of the inputs in
question, right? Like if you multiply a times b, the gradient is grad a times b plus a times grad b. And if a was a
parameter, well, technically, requires grad equals true says that you’re supposed to record grad info for this
situation. But we don’t do that, because like, it’s very unlikely that you’re actually going to, you know, run
backwards again, right, you’re going to throw everything away, and then run your pytorch program
again, on the next batch in question. So by default, we disable the propagation of grad info, when we’re
actually executing backwards. Okay, so hopefully, you can see where this is going. So when you want to use
double backwards in pytorch, the user API for it requires you to do two things. So one is it says, okay,
first, you have to pass this flag called retain graph, what does retain graph do? It says, don’t get rid of
the grad info as you process the backwards in question. Why, you know, is it important to retain
the graph info? Well, it’s because you know, when we do a double backwards, we might need it again in
that situation. And the second thing they tell you to do is to pass in create graphs equal true when you
run backwards. And what does that do? It says, okay, actually, please do report gradient infos as you
compute the gradients through the backwards graph in question. And once again, why is that useful? Well,
it’s because you’re going to want to differentiate it through later. And so what double backwards then
says is, okay, so you you run backwards with these two arguments. And then at the end of doing the
backwards, you get a grad, but this grad actually has a grad info on it, it has recorded all of the
history necessary in this case. And you can now use it as part like, for example, gradient penalty,
right? So now that you have the grad, you can add it to your loss. And then this entire mondo thing,
you can actually just go ahead and do another backwards on it. And this is why we call it double
backwards, right? You call backwards once you get some grads, you do some stuff with the grads,
and you call backwards again. And that’s the double in question. Sometimes I find this process a bit
mind bending. And one of the things that like sort of helps me retain my sanity when this happens is I
imagine that actually, when I run the backwards the second time, I don’t actually care about the first
backwards, as in I can reason about the second backwards without making reference to the first
backwards. Why is that the case? Well, let’s imagine that instead, we were doing a functional
transformation on our program. So once again, I’m using the sort of Jack’s terminology, but it’s
really useful because it gives a good idea intuitively of how this all works. So in my basic PyTorch
program, I write explicitly a bunch of operations that perform the forward pass forward operations
one by one by one, right? Like, you know, take my parameters, you know, do some convolutions on them
with the inputs, etc, etc, until I get a loss. And this is my program. And then in PyTorch, you just
have to write dot backward, and then it gives you the backward. But when we, you know, tell people about
how 80 actually works, we say, you can imagine this backward call expands into a second program,
like imagine copy, pasting in the second program after your first program that goes ahead and runs
all the steps, but backwards and with all the operations replaced with their gradients. And so,
you know, this, this composite program involves running a bunch of stuff forwards, running stuff
backwards. But if you look really carefully at all the operations in question, they’re just good old
fashioned, you know, operators on PyTorch tensors. The backwards functions are not anything special.
when you take the gradient of a multiply, it just uses multiplies and adds. When you take the gradient
of a convolution, you get, you know, convolution backward, but convolution backward is a good old
fashioned function. And more importantly, it itself has a gradient. So you can differentiate that as well.
And so whenever I have a double backwards that happens in the situation, I just imagine this,
you know, big graph, right, that has fours and backwards. And then I just forget that I
ever knew that, you know, this was a separate fours and backwards. I just imagine that some poor grad
student in the 80s had to like manually derive all the backward steps themselves. So I’ve just
written it all out. It’s this opaque program. I know nothing about it. And then I just apply
automatic differentiation to this program again. And well, lucky me, you know, what does 80 know how to
do? Well, it knows how to handle any, any program that consists of a bunch of operations that,
you know, primitively, I know how to differentiate, that gives me my double backward program.
And actually, when I’m like reasoning about what graphs look like in double backwards, I like, you
know, writing a simple gradient penalty example, you know, writing out the backwards, and then writing
out the backwards of that, you know, fours backwards program, and that gives me a graph. And I can use,
usually use that to reason about some of the weirder things that happen in this situation.
So on the one hand, we’re done, right? Like double backwards is just, you know, doing backwards
again, you know, what’s the big deal. But actually, there’s a reason why higher order
differentiation is kind of mind bending to implement from a, you know, if you’re just purely looking at
it from a Wenger tape perspective. So one of the things that is really mind bending is that when you
do higher order AD, you actually need to reuse things from the graph of the first 80. That that’s
why we had to do retain graph, we’re not allowed to throw away any of the grad infos for the original
program. Because when I’m looking at my backwards program, well, you know, one is eventually things
go back to the loss. And sorry, not the loss. Eventually, things need to make use of various
parameters that may have been defined by the original graph in question. So going back to the multiply
example, right? The gradient doesn’t only make reference to the gradient of x, and the gradient
of y, it also makes reference to x and y, like derivative formula says, hey, you need to know what
these quantities are from the original network to actually compute the gradient in the situation.
And if those things require grad, then when I use them in the backwards graph, then I need to,
you know, go keep going past them, right? Like, like, it’s a data dependence. And when I do backwards
on my like, like composite program, that sends me back to the original graph. And so that’s pretty
important. And because it’s very interesting, what happens in the situation, which is that when I go and
I traverse parts of the backwards graph that were used again, for the backwards in question, I have to
flip it again. I’m not explaining this very clearly. So I’m just going to leave you with a very impressionist
picture of what happens. So you have a forwards graph, right? When you differentiate it, the graph
sort of turns upside down, because your backwards graph is exactly the same thing as the original
graph, but going back in the reverse direction. When you differentiate that graph again, well,
you flip it back. And it looks just like your good old fashioned forward graph in question is actually
the linear approximation, because, you know, you’re not only doing the slipping, that’s the part where
we’re doing reverse mode AD, but you’re also taking the linear approximation. So what happens,
in other words, what happens in double backwards situation, is you end up having to recompute the
original forward graph, but with different parameters for, you know, what the inputs are, because they’re coming
from different places in your graph. And this is one of the reasons why symbolic automatic
differentiation, which might want to be done by systems that aren’t tape based, have such a, you know,
sort of, like, it’s actually really tricky to do it all correctly, because there’s all of this stuff going
on. And you can’t just assume that, like, you know, when you had some program that you compiled
for forwards, that’s it, that’s the only thing you need to compile it for the sort of transformation can
reuse things, you know, unpredictably. And one of the really nifty things about PyTorch’s design for double
backwards is it can support an unlimited number of double backwards operations, as long as you don’t ever
clear the graph when you do these things. And sometimes in other situations, when you want to,
like, optimize, you have a problem, which is that you need to know ahead of time, how many times you’re
going to differentiate your program, because if you don’t know that, then you can’t actually, you know,
safely get rid of variables, because they might get reused at higher levels in question. All right,
so I won’t claim that you will completely understand double backwards at this point. But hopefully I’ve
given you the main idea, right, which is that, you know, you don’t have to think about double backwards
as this mystical thing. It just is running backwards again on a program that happens to have been generated
partially by backwards. But this actually causes some very intricate behavior, if you actually want
to dig into it, but at a very high level. And when you look at the implementation, that’s all there is
to it. All right, everyone, that’s everything I wanted to say for today. Talk to you next time.Double-backwards
EP42 Intro-to-distributed
Hi everyone and welcome to the PyTorch Dev Podcast. Today I’m doing something a little special, which is that I have Shen Li from the Distributed team over at PyTorch here to come talk to us about PyTorch Distributed. Shen, do you want to introduce yourself?
Hello everyone, this is Shen. I work on PyTorch Distributed Package. Super happy to be here.
All right, Shen. So I just want to get started. Can you just explain to us what Distributed Training is and why it’s so important for PyTorch?
Of course. I would say Distributed Training is using multiple GPUs or machines to collaboratively train the same model.
By the way, this is just my personal view, not an official performance definition. I can try to elaborate on that statement.
Yeah, tell me a little more.
Yeah, let’s start from the motivation side. Why do we need multiple GPUs or machines to train one model?
Well, it is because driven by the advances in deep learning applications, people are using larger and larger data sets to train larger and larger models.
It’s possible that the data does not fit in one machine, or maybe the model does not fit in one machine.
Or even if they both can fit in one machine, you might still want to leverage more resources to finish training within a shorter period of time.
So that’s why we might want to use multiple GPUs or machines to train a model.
Okay, so let’s say that, you know, I happen to have a giant cluster of machines in my back pocket, and I want to use them all.
How do I go about doing that?
So there are, like, a lot of different tools out there.
It depends on, like, whether you are a framework developer or an application developer.
If you are an application developer, you can choose the right tool in PyTorch.
There are DDP, RPC, pipeline, etc.
If you are a framework tool, then there are, like, a lot of different things you need to consider to make sure that the distributed training can work efficiently.
So when going beyond one GPU and one machine, the communications are, like, inevitable.
And communications are usually very slow.
And so when you’re working on that, if a communication blocks computation, there will be, like, low device utilization and, hence, low efficiency.
So that’s, like, one challenge you need to handle if you want to work on distributed training.
Would you say that dealing with the cost of communicating over nodes is the biggest problem when working on distributed training?
I guess that’s one of the biggest problems, because the main delay of distributed training comes from two sources.
One is, like, computation, and another is communication.
And there are actually a lot of tools just trying to handle that.
One fortunate thing is that since communication and computation are using different resources, so they can actually overlap.
They can basically run concurrently.
So that’s, like, one benefit we can try to explore to speed up things.
So earlier in the podcast, you told me that if you were a user of distributed, you had a bunch of options for what you could do.
So what are, like, I think one of the things that people find bewildering about distributed is how many things you can do, like, how many different options for setting things up.
Could you just, like, tell us at a high level, like, how these all get put together, how you decide to choose one or the other?
Oh, sure.
Yeah, there are, like, a lot of different options.
For data parallel, there are vanilla data parallel.
Hang on a sec.
So tell us what data parallel means in this context.
Oh, sure.
With vanilla data parallel training, each GPU holds the replica of the model and consumes a split of the input data.
And models are synchronized using communications.
So basically, models are replicated and the data is sharded.
And the entire model gradients and parameters are communicated across replicas to make sure that they are synchronized.
So this is, like, vanilla data parallelism.
And vanilla model parallelism is the opposite, where data will flow through all devices and model is sharded across devices.
And the communication is only responsible for transmitting the activations and its corresponding gradients at model sharding boundaries.
And, of course, there are, like, more complicated data parallel and model parallel schemes.
And there are also, like, hybrid parallel schemes that combines both data and model parallelism.
So that’s, like, a very high-level description of data and model parallelism.
And beyond that, there are, like, advanced versions.
Like, in PyTorch DDP, it’s a vanilla data parallel plus some optimization.
I can try to go a bit deeper into that.
So, as I mentioned, like, communication might be one of the main things people need to deal with if you’re working on distributed training.
And one, like, natural thing to optimize distributed training is try to overlap communication with computation as much as possible.
And because communication side communications are the main sources of delay in distributed training,
And, overall, the communication delay will also, like, grow with the cluster size.
And since they are using different type of resources, it’s often possible to run them concurrently.
And, actually, existing distributed training technologies like DDP are using such optimizations.
What DDP does is that, say, when you are synchronizing the gradients of layer I in the backward path,
You can just go ahead and do the computation on layer I minus one to compute the gradients.
In that way, the computation and the communication can overlap.
So, are you saying that you, like, sort of, there’s a train of computation going on?
So, at any given time, each of the layers is processing a different set of data in this situation?
Not exactly.
So, yeah, we can try to open up the backward path a bit and see how the communication got plugged into the backward path.
So, in the backward path, we have a layer, we have a model of multiple layers, right?
And the backward path is going to flow from the last layer all the way back to the first layer.
And it’s like a stack of layers.
And then the communication’s responsibility here is trying to make sure that the gradients on all the model replicas are the same after the backward path.
So, how we can do that?
Like, one solution is that we just run the local autograd engine and making sure all the gradients are ready on each process.
But they’re going to be different because they are consuming different data, input data.
And then we can basically run, say, an R-reduce to communicate the gradients to make sure that they are the same.
But this is going to be slow because you see that there’s no overlap at all for computation and communication.
Because I’m waiting for the gradients to get completely computed before I start the next batch of processing.
Is that right?
Yeah, exactly.
And in this case, basically, the GPU is going to be busy for a while to do the computation.
And then when you start the communication, and then the computation resource on the GPU is going to be idling, just waiting for the communication to finish.
So, basically, at any individual point of time, there’s only one type of resource that is busy.
And this is bad.
This is what we are trying to avoid.
Okay, so how do I fix it?
So, in DDP, the solution is that we are organizing the gradients of the model into buckets.
So, for example, if you have, say, 20 layers in your model, it’s possible that you organize, say, last five layers into one bucket, and then the next five layers into another bucket.
And then, when you finish computing the gradients for the last five layers, you can put the gradients of those five layers into the bucket, and then kicking off the communication of that bucket.
And at the same time, you can continue to do the computation for the gradients for the next five layers.
So, in this case, the computation of the next five layers and the communication of the last five layers will be wrong in parallel.
So, to put it in other words, the thing that happens, right, is that although your model has a lot of parameters, we manage to compute some of these parameters before other parameters.
And so, if we can go ahead and start updating those parameters that were already done computing before we’re done with everything else, we can get ahead of having to wait for everything to be done and then doing the synchronization in that case.
Yeah, yeah, exactly.
That’s a very good summary for data parallelism.
And going back to your question, like, what are the options of distributed training in the market?
And there are, like, other things like pipeline parallel, sharded data parallel.
They actually, many of them are actually exploring the same basic idea of trying to overlap communication with computation, just like what DDP does.
But they do that for different things, like for pipeline parallelism.
What pipeline parallelism do is that for every mini-batch, you’re going to divide one mini-batch into multiple micro-batches.
And then the model is basically sharded across multiple devices.
And then you’re going to fit the first micro-batch into the first model shard.
And then when finished the computation on that, you’re going to move the activation from the first device to the second device.
And then you can fit the second micro-batch into the first model shard, and et cetera.
So the pipeline is basically going to run.
And in this way, it is able to basically keep multiple devices running in parallel.
And also, when you are doing computation, say, micro-batch I, and you can concurrently launch the communication for the activations generated by micro-batch I minus one.
So in this way, the pipeline can work and make computation and communication overlap.
But it’s actually based on a similar idea of trying to overlap things.
So to summarize, it seems that, like, first, at a high level, you have to decide what you’re going to paralyze over.
Are you going to paralyze over the data?
Or are you going to paralyze over the parameters?
But even once you’ve made that choice, there are a bunch of optimizations you can apply for overlapping computation.
And all those optimizations result in tons and tons of possibilities for how you can go about doing your distributed training.
And I’m guessing, like, it’s different depending what model you’re trying to train as well, right?
Right, right. I think that’s a correct statement.
And one thing I want to add is that, yeah, initially, you need to make a decision.
Whether you need data parallelism or model parallelism.
And usually, when models are small, data parallelism will be sufficient.
And when models are large, you usually want to combine model parallelism with data parallelism.
Because the data set is usually very large.
And if you just have, say, one model replica in the entire cluster, it is possible to get up to a higher speed.
But usually, having a higher data parallel width will also have you to speed up training a lot.
All right.
So I want to turn our attention now towards the state of distributed in PyTorch.
Because I think the discussion that we just had could apply to a distributed framework anywhere in, you know, like TensorFlow or PyTorch or any of the other deep learning frameworks.
So what is different about PyTorch distributed?
Like, how did PyTorch distributed come to be the way it is today?
We started working on PyTorch distributed, I think, since 2019.
And the first feature we developed is distributed data parallelism.
And at that time, data parallelism is the most dominant distributed training technology.
And so far, it’s still the most dominant distributed training solution.
And later, when with the advances in the community, and people started to deal with larger and larger models, we started to realize that vanilla data parallelism is not sufficient.
Because the model cannot, it’s possible that model won’t be fitted into one GPU and one machine.
So we started to think about, oh, we need to add a model parallelism feature into the package.
And that’s when we started to think about, oh, how do we do a generic model parallel solution in PyTorch?
And we come up with the idea of adding, say, RPC remote procedure call.
Basically, what we are trying to do here is to make sure that everything that user usually used for, say, forward, backward, and optimizer in local training can be represented using distributed APIs.
And that’s also what we are focusing on when developing the RPC package.
So basically, with the RPC package, what you can do is you can wrap some part of the model into user function and use RPC to launch that function on a remote worker.
And RPC will be responsible for things like serializing and deserializing the tensors and also take care of the distributed autograd.
And also there is a counterpart for counterpart optimizer for the local optimizer where the distributed optimizer can automatically reach out to all the participating processes and getting their parameters updated.
So that’s like the second step we take after data parallelism.
And after that, we are also starting to build more and more higher layer features on top of RPC in PyTorch because RPC is a very low level raw API, which is flexible, but it’s not that easy to use.
If you want to use RPC, you will have to do things like decompose your model and write a lot of code to make it work.
And ideally, we want to make sure that when you have a model that can train locally, the same model, maybe with a larger size, can train on distributed environment as well.
So we need to have higher level APIs to make that happen.
And things we added so far are like pipeline parallelism.
And we’re also working on things like intralayer sharding to make sure that you can not only shard in the model based on the operator boundaries, you can also say shard one operator and play that across multiple processes.
One of the themes in my podcast has been that PyTorch, you know, originally was designed as an eager mode framework.
And so whenever we build any features, you know, we always try to figure out how it can work on eager mode first and then other modes of operation come later.
And, you know, some of the things you described, right, like building the higher level API for distributed, you honestly have a harder job than some of your competitors who, you know, can assume there’s a graph representation because you need to work with eager mode PyTorch.
Yeah, that is true.
And actually, that’s the ongoing discussion in the team.
We are thinking like which, we are thinking with, we are collaborating with the compiler team and we are thinking about like which layer that we should be able to like extract the graph from the forward pass.
And based on that to divide the model shard, to divide the model and do the model placement.
So far, we don’t yet have a good answer.
But things like TorchFX and JET IR can also, can definitely be helpful.
But we haven’t decided yet whether those should be the solution where we build on top of or we need something else.
All right.
Well, thank you very much, Shan, for joining me.
I’m hoping that we can do more of these interview style podcasts in the future.
Thanks, everyone, for listening.
Talk to you next time.
Talk to you next time.
Bye.Intro-to-distributed
EP43 API-design-via-lexical-and-dynamic-scoping
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about lexical scoping, dynamic scoping, and how these programming
languages concepts relate to library design in PyTorch, specifically with regards to backwards
compatibility and other questions.
When I talk to people about working on PyTorch, sometimes I get questions from people who knew
me before I joined the PyTorch project as a Haskell developer working on compilers.
And they’d ask me if I was doing any programming languages stuff here in machine learning land.
And I’d always be very happy to answer people and say yes.
In fact, I use programming languages concepts all the time as a developer on the PyTorch project.
And today’s podcast about lexical and dynamic scoping is an example of how I use these concepts
from programming languages to reason about some actually fairly complicated API design questions
that, you know, as a Python library, PyTorch has to answer when we want to, you know, talk
about how we’re going to design an API in question.
So to start with, I need to explain what is lexical scoping?
What is dynamic scoping?
So lexical scoping is, so when we talk about scoping, we’re typically talking about how to
resolve what the meaning of a variable is.
So when I have a function, and I refer to the variable x, you know, how do I know what
x is?
Lexical scoping says that the value of x is whatever is lexically closest that defines the x in question.
And when I say lexically closest, I mean, imagine you’re looking at the source code of your program,
you see the x, you know, your eye wanders up outside of the enclosing blocks until you find a block that actually defines the x variable in question.
And that definition is going to be the one that your actual use of the variable is going to point to.
In contrast, dynamic scoping is a form of scoping where the reference to x doesn’t actually refer to, you know, whatever is lexically obvious.
Instead, there’s a concept of an implicit, you know, global variable, if you can think of it that way, which sort of gets changed whenever you do an assignment.
So what the value of x will be is not what you saw, you know, in the lexical scoping, but in fact, whatever the caller to, you know, your function, set the variable to be when you, when you before you actually called in the function.
So you have to look at the call stack to figure out what the value of a dynamically scoped variable is.
And so very concretely, in the Python programming language, there’s no native support for dynamic scoping, but a lot of use cases that people use for context managers, you know, that’s the with statement where you can with blah, and then inside your inside of this block, something different happens because of the context manager.
Context managers are a very easy way to like implement dynamic scoping, because what you do is, when you enter the context manager, you set some global variable to some value, when you exit, you reset it to its original value.
And that’s basically equivalent to having done a dynamically scoped variable assignment.
And of course, you know, regular old variable references in Python are done lexically.
If you import modules and use identifiers from those, that’s also done lexically.
Okay, so up until this point, this is something that, you know, you might have gotten told about in your programming languages class in undergrad.
So what the heck does this have to do with PyTorch API design?
So the first thing I want to talk about is a sort of case study in what happens when you want to change the semantics of a library, or in this particular example’s case, the Python language itself, and why, you know, whether or not you choose to do this with lexical or dynamic scoping,
has pretty big implications on how usable the thing is.
So here’s how the case study goes.
So back in Python 2, the Python developers made a bad decision.
And the bad decision they made was that they defined the slash operator to mean integer division.
This was a very understandable mistake to make because languages like C defined a single slash to be integer division.
But what they found was that, like, lots of people were using Python to, like, calculators and stuff like that.
And they’d always ask things like, what is one divided by two?
And Python would helpfully or unhelpfully, from your perspective, say zero.
And that was very unexpected.
So the Python developers decided, okay, we want to change what the meaning of division is.
We want to change it from integer division to true division, so that if you divide one by two, you don’t get zero, and secu, you get 0.5.
Obviously, this is BC breaking.
So how are you going to deal with a problem like this?
Well, you want some way, when you have a BC breaking change, to let people opt into the new behavior before it becomes mandatory, and then only at some later point in time, namely Python 3, make it required.
So, you know, there’s this intermediate time when you can change the meaning of your program to switch from, you know, integer division into true division.
So how exactly did Python do this?
Well, Python actually needed to introduce a special mechanism called a future import to make this happen.
So the way the future import worked was that there’s this special module called future, and you could say, from future import division, and then what that would do was it would be changed the meaning of all of the slashes inside your current module to go from division to true division.
Now, if you’re like me, and you’re thinking, you know, why the heck do I have to introduce an entirely new language feature?
So future is not a module.
It is like a special language feature that changes how the Python bytecode interpreter interprets your program.
And why the heck do they have to introduce this new feature?
Why couldn’t they just have set, well, like something like, okay, instead of importing division from like the normal module, import division from the, you know, like true division module.
The same way, you know, if I had a function, you know, if I had a function, and I wanted to change the function semantics, I could have a v one of the module, and a v two of the module, and I could just pick which module I imported that function from to get one version or the other.
Well, the reason they needed to do this was because the division operator actually isn’t a function.
What division in Python, the sugars into is a call into a magic method and whether or not the sugars into a call into the magic method div, or the magic method, true div depends precisely on your version of Python, and whether or not you import future division.
So, in effect, the way that the meaning of division was defined was not by lexical scoping, which, in fact, in some languages like Haskell, the meaning of division is lexically scoped.
It’s provided by this prelude module that, like, is implicitly imported by your program, and that’s how you tell what the meaning of division is.
That’s not the case in Python.
Division always desugers into one of these method invocations, and method invocations, well, they’re not really lexically scoped or dynamically scoped.
Instead, it’s a form of dynamic dispatch where you ask the object what the meaning of the operation should be.
And so to change the method invocation that happens in this case, you actually need some actual, you know, juice from the language itself.
And so that’s why the future mechanism exists.
So, Python had this problem.
The problem they had was that they wanted to change the meaning of a method invocation in a backwards incompatible way, but they had no way of letting people opt into it one by one.
So, they introduced a language feature letting you change the meaning of the method from one thing to another.
In PyTorch, we often want to make BC-breaking changes to methods, but unfortunately for us, there’s no way to implement a same future-style mechanism inside PyTorch.
You just can’t do it because it requires language support, and Python didn’t give us language support to do this.
The best approximation for this is to have some sort of global flag, which you can use to toggle between the old behavior and the new behavior in question.
But notice, this is very different from what future import division does, right?
Future import division only affects the division operators inside your module.
If you import some other module that’s using old-school integer division, that integer division stays the same way that it used to be.
So, it’s very local.
You can reason about what the meaning of division operators is simply by just looking at the top of your file.
With a global flag, you don’t actually know what the meaning is without walking up the call stack and looking for someone who actually set the global at some point in time.
And so, we actually try very hard not to do this in PyTorch.
And the reason why we do that is going to become clear in my second case study.
Case study two, device context manager.
To explain this case study, I have to first explain what a device context manager is.
And this is a little tricky because there’s no such thing in PyTorch, but it is a thing that has been requested over and over again by many different users.
So, here’s what this hypothetical mechanism would do.
When you write PyTorch programs, you often want to write your program in such a way that you have both CPU code and CUDA code.
So, what does this look like?
Well, you know, like you have your script.
You want to debug it and test it on CPU.
And then at some point, you want to rerun it again on CUDA.
And if you know anything about, like, PyTorch’s API, we don’t exactly make this easy to do.
You have to actually plan your program out and, like, explicitly, like, you know, parameterize over the device in question.
And then, you know, toggle that with your options.
If you just sort of write, like, really plain straight line code, you’re probably ending up hard coding that it operates on CPU or CUDA.
So, the device context manager is this concept that lets you write the naive code, like, allocate a bunch of tensors with no device argument, do a bunch of operations on them, and then implicitly change the meaning of the factory function.
So, that if you, you know, use this context manager and say, hey, set the default device to be CUDA, then whenever you do any inner calls to the factory functions in question, they will actually produce CUDA tensors instead of CPU tensors.
So, this is a decent example of dynamic scoping in action, right?
Like, when you use one of these context managers, it’s not just the, like, local calls to factory functions that are in your module that would be changed from CPU to CUDA.
It’s also all the inner calls to, like, all the modules you might be instantiating and everything else.
And this is kind of desirable, right?
Because, like, one of the things that people find very annoying about how things have to be done today is you have to, like, plumb the device you want down recursively into all of the, like, creation functions that you’re doing.
And in this case, this is, like, all of the submodules in your modules.
By the way, we used to not actually let you plumb device down, but Joel Schlosser very recently landed a patch to PyTorch that makes all modules take a device argument so you can change what the device is, you know, at module construction time.
Before that, you had to actually always construct your module on CPU and then move it onto the device you wanted.
And that’s kind of inefficient, and a lot of people didn’t like having to do that.
So anyway, so this device context manager would let you change, for example, where your modules get allocated without having to actually explicitly pass in this device argument.
And so a lot of people would like this.
It would make things very convenient, and we don’t want to do it.
Why don’t we want to do it?
Well, the reason we don’t want to do it is because of the fact that it, you know, actually recursively goes down and all of your calls in the call set change their semantics, right?
This is, like, both a blessing and a curse.
The blessing of it is that you don’t have to coordinate with anyone to change the device.
You just set this context manager, and then magically the meanings of all of your factory functions change.
The curse of it is you don’t have to coordinate with anyone.
So if someone writes some code that, like, assumes that TorchEmpty is just going to give you a CPU tensor because when I tested the code on my machine, it gave me a CPU tensor.
Like, you know, how difficult could this possibly be?
That code is going to unpredictably break.
And in practice, this code unpredictably breaks because we have a janky version of device conics managers called set default tensor type, which you can actually use to change the default tensor type from CPU to CUDA.
Please don’t do this.
We really hate this function.
We want to get rid of it.
But this one, people always post forum posts being like, hey, I did this thing.
And, like, my code, some code library code that I’m calling doesn’t work.
So the, like, problem with untyped dynamic scoping is that it is a global tax on all code written in your library.
If you have primitive function calls that are modulated by some dynamic scope by a context manager, everyone who writes library code is obligated to make sure that their code works under all possible settings of the context manager.
So in this case, whenever I write a bare torch.empty and not bare torch.empty device equals CPU, I’m obligated to make sure that this will work even if you do a CUDA device.
And maybe this is, like, possible, and maybe this is even the right tradeoff to make.
But historically, PyTorch doesn’t have this requirement.
And so a lot of code is not written under this assumption.
And so if you wanted to add a device conics manager and you wanted to do it right, and when I say right, I mean, like, this conics manager actually works in, like, 99% of all the situations you use it in.
You actually have to go and painstakingly audit all of your Python code to make sure that it’s actually doing the right thing in this case.
Blech.
So, like, you know, dynamic scoping leads to unpredictable effects because it, like, lets you reach into code that wasn’t expecting to be modulated.
Sometimes this is a good thing, right?
Like, it saves you from having to explicitly pass arguments around.
If you’re Emacs, you know, actually, like, you love dynamic scoping because it makes it so easy to just set some variables and then use them later inside somewhere else without having to muck about with function signatures.
But, like, this implicitness also comes with a cost.
Okay, I have one last case study, and this relates to Torch function and also a sort of new mechanism proposed by NumPy for handling factory functions.
So, a little bit of backstory here.
So, Torch function is this thing where you can write an object, you put a Torch function magic method on it, and then whenever you pass these objects into Torch.cat, Torch.ed, any of the functions in the Torch namespace, we’ll actually just call this magic Torch function method so that you can override the meaning of operations involving tensor subclasses.
So, this is very useful, and you can use it to implement all sorts of interesting tensor-like objects without having to actually, like, you know, monkey patch all of, you know, PyTorch’s functions to, you know, do something different in this case.
But there is a problem, and the problem is Torch function is predicated on the idea that any given function operation takes in an actual tensor as an argument.
Because the way it, like, does dispatch is in the very Pythonic dynamic dispatch style, we look for an object that has a Torch function on it, and that’s the Torch function implementation we call.
So, what happens when you have a function that doesn’t have any tensor arguments?
And an example of that is a factory function, right?
Torch.empty, which just takes in a list of sizes and gives you a tensor in question.
So, custom classes have a problem, which is they need to also somehow override these factory functions, but they have no way of doing so because their standard mechanism of overriding is via dynamic dispatch.
But there is no dynamics dispatch in this situation.
So, there are a bunch of ways to solve this problem.
As the saying goes, if the mountain won’t come to Muhammad, Muhammad must go to the mountain.
So, if you, you know, want dynamic dispatch and the factory function doesn’t have dynamic dispatch, well, turn it into a call that does have dynamic dispatch.
So, we have a bunch of functions on tensors like new empty and new zeros, and, you know, you can use those in place of the good old-fashioned Torch function, Torch factory function in the main namespace.
And that will indeed work.
And then you just have to define those things in your Torch function to get things going.
And this just preserves the same property, right, which is that you are using the objects that are lexically in scope to do the dynamic dispatch to get to the implementation you want.
There’s an elaboration on this idea, which is a NumPy proposal at this point in time, which instead of directly, like, creating new variants of methods for tensors for all the factories’ functions instead, wrap them up into a module call.
So, given a tensor, you can extract out a module that corresponds to the, you know, type of module that you would have called the factory functions on, but this one is specialized for the subclass in question.
So, what does this look like?
So, I’ve got a tensor, I want to create a new tensor, so on this tensor, I call the module accessor, which gives me a Torch module, something that looks like Torch, so it’s got empty, and it’s got ones, and it got zeros on it.
But this module is special, because if I call zeros on this module, I will actually get a tensor that is of the same subclass as whatever my original tensor that I got this module out from, from the beginning.
So, same idea, right?
Use the lexically scoped values to get out the module and then do the dynamic inspection on the module itself.
So, you just don’t have to, like, shove everything into the method namespace.
Of course, there’s another way to do this, and that’s using a context manager.
And this is actually more likely than you might think.
So, in previous podcasts, I’ve talked about Functorch, a method for doing, you know, functional transformations on PyTorch programs.
And in Functorch, there’s a very natural place where a context manager would be applied, and that’s when you use one of the higher order combinators, like vmap, to actually do an operation on a tensor.
So, when I enter the vmap, what I’m effectively going to do is I’m going to basically turn on the vmappiness.
And what that also means is that I might very reasonably want to override the behavior of all the factory functions as well implicitly when I do this.
And this is actually very natural, and, in fact, in JAX, this concept is called omni-staging, where, in previously, JAX only did data-dependent control flow, but at some point in the future, they realized, hey, actually, it’s really useful to be able to, you know, override the behavior of all these free functions, and so, you know, let me just go ahead and do that.
And so, that’s called omni-staging in JAX.
So, which of these is the right thing?
Well, if we look back to our previous case study on Device Context Manager, PyTorch said, hey, you know, we want explicitness.
We don’t want, we’ve got all this code that’s been written already that doesn’t think that you’re going to, like, change the meaning of things under your feet.
So, like, you know, let’s just make sure that you keep doing things explicitly.
And so, we don’t really want to add this context manager.
But then, when we look at this, you know, Torch function module case, you know, there is a solution that you can do to, you know, stay with the lexical attitude, which, honestly, is PyTorch’s attitude.
But you can also see that there is a lot of merit to doing the dynamic scoping.
And these problems of backwards compatibility don’t, you know, they’re not as pressing because although you might not have written your code so that it works correctly under CPU or CUDA, with VMAP, well, you know, you’re explicitly asking for VMAP in this case.
So, one is, you’re probably going to, like, make sure all the code you’re calling is stuff that works correctly in this case.
And two is that VMAP actually, you know, is very carefully written so that, like, the code on the inside looks exactly like you’re doing a single example case.
So, it really is supposed to work, even if you, like, change out the semantics of everything.
It’s just, you’re just, you know, adding these batch dimensions in a way that, like, your code should be indifferent to.
So, what’s the right answer?
Well, I don’t really know.
When I talk to people and they ask me for device context manager, you know, I used to call over Greg and Greg would, like, no, we’re not going to do this because everyone’s code is not going to work in this case.
Well, maybe.
If you’re willing to put in the work to make this all work correctly and all the library and all the ecosystem, I think, you know, some dynamic scoping might actually be pretty helpful.
But there’s a lot of work, and I want to see this work actually, you know, have an honest attempt for this.
That’s everything I wanted to talk about for today.
Talk to you next time.API-design-via-lexical-and-dynamic-scoping
EP44 pytorch-probot
Hello, everyone, and welcome to the PyDigitive podcast. Today, I want to talk about PyTorch
ProBot, a simple bot based on ProBot that we use at PyTorch to do various operations
on GitHub. So what’s the point of having a bot on GitHub that will do actions for you
automatically? Well, as some members of the Rust community have put very eloquently, a
bot is a really good way of codifying otherwise very mechanical, very easy, but like, you
know, time consuming tasks that humans would otherwise have to do into an easy to do framework
that will do it automatically for you, right? So it’s kind of like a lint rule, right? Where
when you have a linter in part of your CI, you don’t have to then manually say, hey, you
know, I think that line length is too long in your pull request, the machine will automatically
do that for you. And you know, you can save human bandwidth for things that actually matter.
Well, ProBot is exactly this, its goal is to automate things that are otherwise easy to
do. And, you know, save our time for doing things like actually reading the issues you
send all of us. There’s a few pieces of functionality that we currently have implemented in PyTorch
ProBot. There’s three. The first is what we call CCBot. So CCBot is very simple. When you
have an issue on PyTorch, you can add labels to it. Well, CCBot lets you maintain a subscription
to any number of labels that you want to get CC’d on. And then when someone labels an issue that way,
well, CCBot will edit the issue and CC you on it. So that’s very useful, because otherwise, there
isn’t actually a good way to watch a label on GitHub. And you don’t actually want to be, you know,
waiting through all of the issues on GitHub. And if you are, you know, even if you are pretty good about like
looking over the issue list by hand, if you have a lot of labels you want to keep abreast of,
well, that’s a pretty complicated, you know, search query that you need for that situation. So it’s
easier to just get them all in your inbox, and you can decide what you want to do with them. I subscribe
to a lot of issues this way. Really, it’s there’s too many issues in PyTorch for any one person to
process. So CCBot is a really good way of making sure you get CC’d on the stuff you’re interested in,
even if you’re not keeping an eye out for them in the ingestion point. The second piece of
functionality that Probot does is a label bot. So what label bot does is if x is labeled with blah,
then also label it with blah. And one of the like, use cases we do for this is for high priority. So
the way that high priority works in the PyTorch repository is once again, we have a lot of issues,
a lot of these issues are very minor and don’t really matter that much. And to make sure
we don’t lose the important issues in the big sea of issues, we have a high priority label. So what
you do is, when you think something is something that should actually get fixed, you can label it
with high priority. Now, the problem is, people don’t necessarily always agree on you know, what
high priority is. And we also have a socialization problem, which is like, you say you’re new to the
PyTorch project, and you know, you want to know whether or not something is high priority or not,
how the heck are you actually going to know this, right? Like, you’re going to be like,
oh, well, I don’t really know what it means to be high priority. And then you might be conservative,
and you might not not mark an issue as high priority when actually it is high priority. And
the problem is no one else is reading the issue, because you were the one who was supposed to triage
it. And then we just lose that issue to the sands of time. So the idea behind the label bot is well,
whenever someone marks something as high priority, we also add a label triage review. And what that
means is that in our weekly triage meeting, we need to go over this issue and discuss why we think it’s
high priority. And you know, you know, what, what we’re going to do about it, actually, not so much
what we’re going to do about it. But just, you know, why is it high priority? And the function of
this, because most of the time when people label things as high priority, they stay high priority,
like I’d say 90% of issues are like that. But the point of this is that everyone can easily see,
hey, here are all the high priority issues that are going on. This is what we collectively as a team
think of as high priority. It’s a really good way of socializing issues in this way. But we couldn’t do
this. If you know, when someone labeled something as high priority, we didn’t also say,
please review it in the triage meeting. And finally, there’s a new feature that was developed
by Eli Uriagas and Sam Estep for triggering CI jobs when a label is added. So one of the things that we
have as a problem is we want to build on a lot of configurations, but actually building all those
configurations is pretty expensive. So we don’t want to actually build everything initially. And if there
is some exotic configuration that you think your PR actually needs testing on what you can add a label
to your pull request, and that will trigger extra texts and how those tests triggered, well,
that’s done by probot once again. So that’s really it probots logic is not that complicated. And I just
want to talk a little bit about the probot framework, which I decided to use after much humming and hawing,
you’ll see why in a moment, and also some sort of meta points about how probot was designed. And you
know, why I think these design ideas are actually good ones for the framework. So first, why probot?
And actually, this was a not an easy choice for me, because probot, the framework is a JavaScript
framework. And well, you know, we’re PyTorch, we’re a Python shop. So I would have ideally liked it if I
could have written my bot in Python. But probot won me over by a number of pretty useful features that I,
in fact, did appreciate a lot when I was developing this extension. So for one, I can actually, when I’m
work developing the probot framework, I can actually run my node app locally. So like, you know, I’m
hacking on my laptop, I got my source code, I made a change, and I can run it. And then you know, I’ve got
my GitHub app going. And I can actually associate it with a real GitHub repository, and do you know,
smoke testing by modifying the GitHub repository, that triggers some hooks, which get bounced to my
local instance on my laptop, processed, whatever, you know, API calls I’m going to do, and then actually
see it show up on GitHub. And the way this is done is they have this like reflector service, which
knows how which is, you know, like, if you install one of these dev instances, you register the reflector
service as the host name, because typically your MacBook isn’t publicly addressable. And then it
bounces the request back to your, you know, local instance, which is, you know, subscribe, which which
has subscribed to the reflector directly. That’s pretty awesome. And it made developing very easy,
because normally, when you’re developing these hooks, it’s very annoying to like generate synthetic events,
because, oh, you know, you got to like, go and trigger the hook and then download it from GitHub,
and then save it to some fixture, blah, blah, blah. Here, you can just like directly just muck around
with the repository and see what actually happens with Probot on the fly. That’s pretty nice. Probot also
has some opinions about testing, mostly, you know, based on mocking. Mocking isn’t my favorite way of
doing testing, because it’s very manual, you have to like, you know, create the fixtures, create what the
outputs are, and you know, get that all going. But it’s a very convenient way when you’re dealing with an
external service like GitHub. And no, you don’t actually want to be hitting the actual GitHub endpoint API, if
you’re actually, you know, running your test suite. Of course, what we even better is if someone wrote a crappy
reimplementation of GitHub, with support for the GitHub API and the GitHub hooks and the GitHub
notifications, so that I could like just stand up a like local copy of GitHub, and then you know,
test against that. Well, so I can always dream, I actually have like a very small implementation of a
very small fragment, that is what I need for implementing GH stack. But you know that you can hear
about that in my GH stack podcast, which is in the past. And finally, Probot, you know, had existing
documentation for how to deploy it on AWS Lambda. And this was very attractive to me, because I was
when I was developing Probot, it was kind of one of these things where it’s like, okay, I want to build
this thing. And then I want to forget about it and not have to worry about it ever again. And if I had
to like, stand up a server, and then actually, you know, maintain the server over time, well, I’d have to
take software upgrades and like, you know, kick the server when it goes down. Oh, don’t want to deal with
that. But if it’s a Amazon Lambda, that’s great. And I don’t have to worry about it. Well, I mean,
I do have to worry about it if Lambda like changes how their API works. But at least I don’t have to
worry about doing a server that’s so vaunted serverless, which you know, a lot of people are
like, actually, you know, we’ve gone too far, you know, serverless is not so great. But I think in this
particular case, serverless was a really good call, because it just reduces the maintenance costs.
And that really like gets me to the meta points, right? Like one of the like enduring goals with
the with the design of PyTorch ProBot was that I wanted to have as little maintenance as possible
on this. And so one one answer for that is to, you know, put this as a serverless deployment so that I
don’t have to worry about administering the server. Another thing is that ProBot has no state,
there’s no database, there’s no persistent state. The CC bot is an interesting thing,
which is like, you know, we need to know what we’re going to see who we’re going to CC when a label is
done. And the way ProBot actually does this is we have this GitHub issue, and the GitHub issues body
contains the text of all the subscriptions. And so what ProBot just does is it loads up the GitHub issue
on startup time. And then like, that’s, that’s how the state gets managed. And this is very,
very simple. We can install a webhook for listening to issue updates so that when the issue gets updated,
we know to redo it. And then when the Lambda instance dies, well, you know, the next time
it spins up, we’ll just refetch it again from GitHub, no biggie. So we’ve like offloaded the state onto
GitHub, which you know, is a big company and actually in the business of running a bunch of web servers and
databases to maintain GitHub. And now we no longer have to maintain a database ourselves. I can’t
stress how useful that is. And of course, we give up some stuff to do this, right? Like, for example,
you can’t actually subscribe to labels on CCBot, unless you actually have right permissions on the
PyTorch repository, because otherwise, you can’t edit this issue. But like, we hand those out like candy.
So like, if you want to like, do one of those things, just ask, or you can just ask one of us to
like add you to the CC list. And we can do that for you as well. And another thing is that, you know,
we don’t want ProBot to be this thing where it can break in an unpredictable way, and then like go into
an infinite loop, like repeatedly adding labels and everything. So ProBot is designed to be idempotent.
So if I accidentally deliver the webhook again, or like I’m running multiple copies of ProBot, which is
what I was actually doing at some period of time, ProBot can be deployed on Lambda, it can also be
deployed on GitHub Actions. I tried deploying it on GitHub Actions. And at the time, GitHub Actions had
a really long latency, like it took like up to a minute before the GitHub Action ran. And I really
liked adding label and seeing it instantly show up, like, you know, less than a second later. So Lambda
was the only way to do it. But I had both of these running at some time. And if the bot wasn’t
idempotent, then you know, like bad things could have happened in this case. But like,
if it’s idempotent, that does make the types of operations you’re allowed to do with a bot,
you know, less complicated. But it also just, you know, makes it harder to have accidents with a
framework in question. And finally, I talked a little bit about ProBot’s testing framework. So
I was wondering if I would just do live testing and then call it a day. And I was like, in the end,
no, I actually want to test this code, there is some non trivial parsing code associated with like
CC bot. So I went and like, got the testing set up, I like figured out how to do testing in node JS,
which like was kind of annoying, because I’m not really a node person, I don’t really know about
like npm. But like, it’s all there. And that was really nice, because I doubt anyone else would have
spent the time to add the testing framework. So it’s like making sure the initial infrastructure
exists beforehand, is really helpful when you want to hand the project off to someone else.
And they’re probably not actually going to like add tests unless there’s already a testing framework
there. So I think that paid off. ProBot can use developers, for example, something that I’ve wanted
to do for a really long time and haven’t done because I’ve just never gotten around to it is we
have a CC bot, and it works for issue labels, but it doesn’t work for pull requests, because I just
never set up listening for labels on pull requests. So that would be really nice feature to have so that
people could also tag pull requests, and you could get CC’d on them in that case as well. All right,
that’s everything I want to talk about today. Talk to you next time.pytorch-probot
EP45 Memory-layout
Hello everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about memory
format in PyTorch. To answer what memory format is, I want to talk a little bit about
how tensors are laid out in memory. Tensors are multi-dimensional qualities. You can have
as many dimensions as you want in a tensor, and that distinguishes them from good old-fashioned
vectors, which are one-dimensional, or even matrices, which are two-dimensional.
And so, although there’s this dimensionality, although you can have cubicle data or just
arbitrarily high-dimensional data, when it comes down to it, when you want to actually store this
data in your memory, well, memory is linear, right? Standard CPUs let you address memory via a
numeric pointer that is laid out entirely linearly, right? Like you’ve got address zero,
then you’ve got address one, two, three, and so forth and so forth. So when you have a multi-dimensional
tensor semantically, you need a way to linearize this into some actual concrete ordering in memory
that doesn’t have any constant dimension. It’s strictly one-dimensional. And this linearization
is the layout of a tensor in question. To give an example, it’s helpful to look at the two-dimensional
case. So in a two-dimensional matrix, let’s imagine, for example, a matrix where reading left to right,
top-down, I have one, two, new line, three, four. So this is a square matrix, one, two, three, four.
I want to figure out how to lay this out in memory. And there are actually two reasonable ways you can go
about doing this. So one is you can read out the rows and sort of paste out the rows side-by-side in
memory. So when I read this to you left to right, top-down, right? And so I said one, two, and then
three, four after the new line. And so you can actually just lay it out in this order. So in
memory, you see one, two, three, four, like directly laid out in this way. And this is what we call C
order or row major order because you first do the rows. And this is what like the layout you’ll get
with PyTorch. And like with any sort of C programming language where you do a multi-dimensional array,
this is exactly how it’s going to go. But there’s another choice, right? Instead of reading from left to
right, top down, I could read top down first, and then left to right. I could do the columns first.
And this gives you so-called column major layout. So when laid out, I would have one, three, two,
four, right? Because the first column is one, three, and the second column is two, four. And so in this
case, like the layout on disk is different. And in fact, if you were writing in Fortran, this is in fact
the order that your arrays would be. Which one is better? Well, it depends. I mean, it depends on
what kind of algorithm you want to do. And like either of these could be valid representations for
your data in question. Of course, it’s often, there is often a convention, because when people write
kernels, they usually want to make an assumption about how things are laid out. And so for example,
in PyTorch, the conventionally is you just assume things are real major, unless, you know, like
something specialist happened. Another example of layout, and this one is much more germane to deep
learning is in image processing. So in images, what is the typical thing that you need to do? So an image
consists of a bunch of pixels, of some height and some width. And typically images have multiple colors.
So you need a channel dimension that represents, you know, this is the red color, this is the green color,
this is the blue color. And of course, because you’re typically doing, you know, gradient descent on batches,
you also have a batch dimension, so that you have a bunch of images sort of stacked up on top of each other.
And so the standard representation for an image in PyTorch is what we call NCHW.
So what that means is first, the first dimension is the batch dimension. The second dimension is the
channel dimension. The third dimension is the height dimension. And the fourth dimension is the width
dimension. So if you imagine, if you want to imagine what this looks like in memory, for a moment, let’s
forget the batch dimension, the channel dimension comes first. And when the channel dimension comes first,
or is the so called outermost dimension, that’s the thing that changes least frequently when you are
going through the actual linearization in order. So just to like, you know, go back to the example,
right, so you’re going to have a red and a green and a blue channel. So what is the image going to look
like in memory if you have a CHW and CHW layout? Well, first, you’re going to see all the reds are
right, like all the red pixel values, you know, for each row going going down on your tensor.
And then you’re going to see all the greens, green, green, green, green, right, for once again, the rows
going down the image, and then you’re gonna see the blues, right. So what you can imagine in memory is you’ve
got this region for red region for green and region for blue. And why is it like that? Well, that’s because the
channel is first. So it is the thing that changes least frequently. Of course, there’s another layout that is
commonly used. And that layout is called NHWC. So instead of channels being first, channels are last.
And so in this particular case, right, because the channels are the innermost dimension, it’s the
the thing that changes the most frequently. So in fact, if I looked in memory, what I’d see is RGB,
RGB, RGB, RGB. So like, you know, every time I’m, you know, handling a pixel, I’m going to put down all the
values for each of the channels before moving on to the next pixel. So it looks like, you know, an actual like
what an actual like, you know, LED screen would actually look like in this situation. And of course, which of
these are better? Well, once again, it comes down to the algorithms. But it turns out that in, you know, some
convolution algorithms, they’re just more efficiently implemented in NHWC. So that’s like sort of what a lot of
people want to be able to use channels last layout to make their code run faster because of these special
kernels. So we could just stop here and say, Okay, well, Ed, you know, I great, I know what NCHW is, and I
know what NHWC is. So now, you know, whenever I get an image tensor, I just need to know if it’s one of these or the
other and aren’t I done. And that is okay. But, um, you know, when we wanted to add support for both of
these tensor types, we had a problem. And the problem we had was we didn’t want to actually force people to
keep track of, you know, what layout their tensors were, right? Like they could do this, and they were
doing this. But it was a pain in the ass to actually have to deal with this for all of our operators. Like
just think about convolution for a second, right? Convolution needs to know what we’re actually going
to, you know, do the convolution over with regards to the channels, and what we’re not going to do over.
So if you have an NCHW tensor, you need the convolution to operate with the channels in the
first position. And if you have an NHWC tensor, you need to have the convolution operate with channels in
the last dimension. And these are different algorithms. And you need to actually tell convolution
what type of tensor you actually pass in. And tensors are these very dumb, you know, like
n-dimensional arrays, they don’t actually have any semantic content. So that’s something you’d have to
keep track of externally from the tensor. And that’s a pain. And we didn’t want to have to do that.
So what did we do? Well, to answer this question, I have to take another detour and talk a little bit
about how we implement this linearization under the hood in PyTorch. And this is done using strides.
So what are strides? So I said that, you know, layout, memory layout of a tensor, of an n-dimensional
tensor, is all about taking your, you know, various elements, and then laying them out in a linear
sequence of addresses in memory. Well, strides are a way of computing, given any given coordinate in the
logical tensor, where does it physically lay in the actual linear memory address layout. So let’s just talk a little bit
about, for example, C layout, which is what PyTorch does. So in C layout, right, the outermost dimension,
the dimension that comes first, is the one that changes least frequently. Or in other words, like to
get to the next element, the next slice in that dimension, you have to jump a bunch of elements
further, right? That was the R, R, R, R, R, G, G, G, G, G, G, B, B, B, B, right? So you need to like jump
four R’s to get to the G’s, and then another four to get to the B’s. But on the other hand, if you’re an
innermost dimension, one on the very end, well, you just, you know, can look at the next element and see
what the element is in that situation. And this is the concept that strides do. So stride says,
for any given dimension position, how much do I have to advance the physical memory pointer
to get to the next element corresponding to that dimension? So if your innermost dimension is fast
moving, then the sorry, if the innermost dimension is the one that changes, you know, all the time
contiguously, then I say the stride for that is one. Because if I like want to move to the next
element, I just go to the next physical memory layout, it’s all laid out contiguously. Whereas
if I’m on the outermost dimension, and I want to, you know, jump really far, then I might give it a
stride of say four, if this was a, you know, size four tensor in the outer dimension. And that just means,
hey, to get to the next element, you have to jump four elements ahead. So going back to our like
original example, 1234, that square matrix, right, the strides for this in C layout would be 21. To
get to the next element in a row, you only need to look at the next, next spot in your contiguous
memory. But to get to the next, sorry, to get to the next value in the row, you just have to get to
the next value in the column. To get to, you know, to move the row down, you have to jump past the
entirety of the row. And so that’s why the stride is two, because the two is the size of the row that
you have to skip across to get to the next thing. And of course, if you have Fortran layout, then your
strides are simply one three, because when you want to see what the next column is, well, the columns are
now laid out contiguously. So you just advance it by one. But if you want to see the next row, well,
those are not set out contiguous, and you have to jump. And so the stride in that case is two, right?
So seal out is two, one, the strides are in decreasing order. And Fortran layout is one,
two, the strides are in increasing layout. And in fact, you can flip between these two strides just
by using transpose and PyTorch, which doesn’t do a copy, it just, you know, fiddles around with the
strides and then gives you a new tensor with those different strides. Okay, so what the heck does this
have to do with memory layout? Well, we had a very clever idea to make memory layout work. So PyTorch
originally only supported NCHW, and all of our convolution operations assumed that you would put
the channels first when you call them. So what we said is, hey, let’s just double down on that. So the
user visible API, the logical view on tensors, always requires channels to be in the first position right
after batch. But if you want to use channels last layout, well, no one said that the NCHW logical
layout had to correspond to NCHW physical layout, right? It could, and that would be the case when
the strides are strictly decreasing. But it could also remap to a physical layout that actually holds
things out in NHWC. I’m not going to tell you what the strides are in this case, because it’s not the
obvious one. It’s not the permutation from NCHW to NHWC. It’s the reverse permutation because reasons.
Try deriving that by yourself if you’re actually interested. And so by doing it this way, right,
like the physical memory layout is what the kernel actually cares about, because like the kernel,
like is going to run faster because of something that it’s doing regarding memory locality. But at
the same time, we can still give the same user experience where like a convolution always takes
an NCHW tensor, and it just might happen to be one of these weird transpose tensors that is represented
differently in physical memory. Some things to know about internally how we implement this.
So although we store strides, and in principle, you can calculate whether or not something is NCHW or
NHWC from the strides, it’s kind of expensive to do this. So we actually have this giant bit filled on
tensor that like has all the common memory layouts that you want to often test for, like when you’re
doing convolution. And these are all just pre computed based off the strides to make access fast. I kind of
hate this design, but it is very expedient. And it indeed does have performance benefits.
There’s one last interesting thing about memory layouts done in this way that I want to tell you
about. And this is the ambiguity problem. Let’s imagine that I have a one by one tensor. Well,
the strides for this tensor are one one. Why is it one for rows? Well, because there are no rows. And even if there were
rows, I would only have to go one to go to them because the size of the row is one. So like, you know,
advancing it is easy. When I have strides that are like one one, where I have one of these one size
dimensions, I cannot tell what the layout is, I cannot tell if this is row major or column major, because the
strides just don’t have any information for me. And this is a problem. Because one of the things that we need
to do when we are doing memory layouts is we need to propagate memory layouts, right? Like it’s no good
if I feed in a NHWC tensor, expecting convolution to get it and use my efficient, you know, channels last
kernel, if somewhere in the middle, I have an operator that takes in one of these tensors, and then just
calls contiguous on it. And the meaning of contiguous is put it in NCHW format. So it’ll go ahead and do
that. And then well, sucks to be you like you’ve just lost all the optimization opportunity. And so
when you have tensors, which like lose this layout information, you might actually make the wrong
choice and turn it back into an NCHW tensor. If you like expand the size, this has happened, we Natalia
Gimelschein fixed a bunch of these cases, when we were originally trying to figure out how to do this. And
like, most of the time, the way we resolved it was like, there was some extra data, there was some
other tensor that we could rely on to get the information that we needed. There’s also some
conventions you can do when you’re writing out the strides, because actually, you have a lot of degrees
of freedom when a stride is for a size one or size zero tensor, right? Like, if your size, if your tensor is only
size one, it doesn’t matter how big or small your stride is, because you’re never going to actually use it,
you only ever multiply it with zero. And you know, you never multiply it with one, because that would
imply there were two elements. I had a proposal for solving this problem called layout permutations,
where the idea was, instead of only storing the strides, we also store a layout permutation that says
exactly what the permutation is. This would also solve the ambiguity problem, because when I have strides
one, one, I would also know via the permutation, if it was zero, one, or one, zero. But we never implemented
this because it was kind of a lot of work. And we solved most of the most pressing problems
by just manually fixing them. So that’s it about memory format. Memory format lets you, you know,
move around your dimensions and get faster kernels. That’s everything I wanted to say for today. Talk to you next time.Memory-layout
EP46 Reference-counting
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about how we do
reference counting in PyTorch. You might think of reference counting as something that isn’t
all that interesting, especially in C++ where there are plenty of classes like shared pointer
that allow you to do reference counting without having to think very hard about it.
Well, there’s actually a lot of subtleties doing reference counting in PyTorch and I want to talk
about a few of the things that are going on here. So one of the very first things that you figure
out when you look into reference counting in PyTorch is that we don’t actually use shared
pointer for most things. Instead, we use this thing called intrusive pointer. Intrusive pointer is the
term of art for reference counting schemes which store the reference count for an object directly
on the object itself. So this is in conscious to shared pointers in C++ which work on any type of
object and the way they do that is the reference count is stored in what’s called a control block
which is allocated separately from the reference count in question. Of course, if you use make
shared, the control block and the actual object in question will be done together in one allocation.
But in general, when you have a shared pointer, it’s actually two pointers. One pointer to the
control block and one pointer to the actual object in question. So that’s a little wasteful and it also
makes it difficult to take a raw pointer and convert it into an owning pointer. So in PyTorch, we implement
all of our reference counting using intrusive pointer. So the intrusive pointer stores the ref count on the
object. You have to inherit from an intrusive pointer base which says, hey, here’s where the ref count is,
here’s the memory layout that intrusive pointer expects. And then intrusive pointer is just, in fact,
the actual, you know, smart pointer class that handles the reference count increment and decrement
when things go in and out of scope. So the tensor type that you all know and love is exactly simply
a wrapper on top of an intrusive pointer to the tensor impl, which actually contains the tensor data in
question. And tensor impl has a very minimal API. And then tensor, the wrapper class actually has a ton of
extra methods defined on it, which, you know, lets you do all the good old fashioned method calls that you
want to do in PyTorch. So the ref count on intrusive pointers is atomic. So it means that PyTorch does work
correctly in a multi-threaded setting. But it also means, like its shared pointer breadth in, atomic
operations are actually quite expensive. And that also means that intrusive pointer bumps are also
expensive. Why, by the way, are atomic ref count bumps expensive? Well, the reason is that when you do an
atomic operation, your processor has to actually bounce the, you know, cache line, which, you know,
previously could just directly operate it on back into main memory to make sure things get consistently
seen by the other cores in question. And that communication is quite expensive. In contrast,
Python does a lot of ref counting, and people don’t generally think of, you know, increasing or
decreasing ref counts in Python as very expensive. And that’s because Python ref counts are actually
non-atomic, and they’re protected by the global interpreter lock. So, you know, the interpreter only
runs in a single-threaded fashion. And, you know, increments and decrements that are not atomic,
that are not locked, those are very cheap to do. So because tensor ref count bumps are very expensive,
we actually go through quite a lot of trouble to avoid actually doing ref count bumps when we can.
And in fact, in PyTorch, when we write functions, like we write operators, typically the lifetime of
tensors is very, very regular, right? In particular, is that, you know, when we call a function with a bunch
of tensors, those tensors are going to stay live for the entirety of the function. Because, you know,
what are these functions doing, they’re not storing things in data structures, they’re not
destroying anything, right? They’re just reading in the tensors as input, and then doing things with
those. So in fact, everywhere in PyTorch, where you know, you don’t actually want to steal a tensor in
question, we just pass around const tensor ampersand, which is just a very convenient way of writing,
hey, pass in this tensor, and don’t actually, you know, do a reference count bump when you pass it in
in this way. Now, if you’re a veteran C++ programmer, you might be thinking to yourself,
hey, why are you doing a const reference to a shared pointer type, which actually points to the object in
question? Isn’t that a double indirection? Shouldn’t you just be passing a, you know, tensor
impulse star, or some sort of direct pointer to the object in question in this situation? And really,
the answer is, yes, you would be right. In an ideal world, this is what we would do. But remember that
tensor is a type that has a lot of methods on it. And tensor impl is a very bare bones type. So, you know,
when we were originally writing out the A10 library, we had this problem, which is that, well, you know,
these tensors that people want to take in a non owning fashion, well, these people still want all of the
methods, all of the, you know, useful, convenient stuff that’s only on tensor and not on tensor impl to
be available in the situation. And if you pass a tensor impulse star, well, you’re not going to get any of
that information. So, you know, at the very beginning, we were like, okay, well, we’re just going to const
tensor ampersand. And, you know, that’ll be very easy and convenient to do. And you’ll get all the API
that you had before. And then the rest is history. So like, everywhere you look in PyTorch, you’re going to see
const tensor ampersand all around everywhere. There’s also a little bit of nuance here, which is that if you have a
const tensor ampersand, you might be thinking to yourself, hey, you know, maybe I should just pass it by value. And
that, you know, also, whenever I get to move into the tensor in question, doesn’t that, you know, save me a reference
count bump in that situation anyway. And certainly, if you are dealing with a function that wants to take
ownership of the tensor in question, this is certainly a good thing. But once again, most of the functions
in PyTorch are borrowing from the tensor, they don’t actually take on ownership. And there’s this funny
business with the itanium ABI, which says that if you have a non trivial class, an intrusive pointer is a
non trivial class, because it has a destructor that’s responsible for decrementing the ref count,
when it exits. If you have a non trivial class, you must put it on the stack so that I can take a
pointer address to it. So I’m not allowed to pass in an intrusive pointer to a tensor impulse directly
inside of a register, it always has to be on stack. It’s a kind of crappy thing about the ABI. It actually
is one of the reasons why unique pointer is not a zero cost abstraction, you pay for using unique pointers
instead of raw pointers that you just manually alloc and dealloc. But you know, basically, whenever you
say constants or ampersand, that’s basically what you know, people were doing anyway, when they were
forced to put their intrusive pointers on the stack. So it’s no worse, really. So taking stock where we are
right now. So we’ve got tensor, tensor is a reference counted type. It internally is represented as an
intrusive pointer to a tensor impel, which actually contains the actual data for the tensor in question.
Reference count bumps in pytorch are atomic and therefore expensive. And in order to get around
that, most people pass around tensors as const tensor ampersand. By the way, this const on the const tensor
ampersand means that you’re not allowed to mutate the reference itself, right? So like if I had a
tensor x, and I pass it into a const tensor ampersand, you wouldn’t be allowed to, you know,
set x equal to y. And that would change what the binding was at the top level. What it does not mean
and what something that is very easy to get confused about is it does not mean that the tensor itself is
const, and we’re not allowed to mutate it, you’re allowed to mutate whatever you want. Const correctness
on tensor is not actually a thing. And this is because when we say const tensor ampersand,
we mean a const reference to a mutable tensor, not a reference to a const tensor, which in, you know,
shared pointer parlance would have been shared pointer, open angle bracket, const tensor,
closed angle bracket. That’s just sort of not representable. If you just say tensor, because
tensor is already, you know, an intrusive pointer to a tensor impulse. So you’d have to like come up with a
different type, like const tensor in that situation, which, you know, might not be a bad idea. And
there’s an issue about this, and someone should go about and implement this at some point in time.
A funny problem happens occasionally, when you’re working with this tensor type, which is that sometimes
you have a tensor impulse star. Remember, one of the perks of doing intrusive pointers is you can pass
around a bunch of raw pointers to the objects in question. And then you can always easily convert
these into real, honestly, goodness shared pointers. You can’t easily do that with a shared pointer,
because, well, you know, you need to somehow get at the control block. That’s why enable shared from this
is a thing that, you know, is an extra bit of information that records where the control block
is. So you can always get to it when you need it. So your problem is, you’ve got one of these raw tensor
impuls, and you want to pass it to one of these constants or ampersand that I said is all over the
code base in PyTorch. And here’s the problem to do this, you actually need an honest to goodness
tensor class. Although the tensor class is, you know, representationally equivalent to a raw pointer,
because at the end of the day, it contains a C10 intrusive pointer. And what is a C10 intrusive pointer?
It’s just a raw pointer with a bunch of specialty structures. C++ does not allow you to actually
interchangeably, you know, convert between these two representations. So like, you’re kind of stuck,
right to actually pass a tensor impulse star to a const tensor ampersand, you have to somehow
manufacture a tensor. But manufacturing a tensor, you know, ordinarily gives you a ref counted owning
object that is obligated to destroy the tensor, you know, decrement the ref count when the tensor goes
out of scope. So it seems kind of like you’re out of luck, right? Like you want to create a non-owning
const tensor reference, but you can’t do it, because well, you know, you have to make a tensor and tensor is
getting in the way. So Scott Walchuk had a really good observation about how to solve this problem,
right? So remember that the problem is that if we created tensor, well, one is that, you know,
ordinarily, you have to increment the ref count when you create a tensor, but you could imagine
skipping that. But then when you destruct the tensor, the tensor will actually decrement the ref count,
right? So you’ve got two ref counts, you need to somehow get rid of. But intrusive pointer actually
has a condition in its deallocation. And the condition says that we only decrement the ref count,
if the intrusive pointer actually is non-null. If the intrusive pointer is null, we skip the
decrement altogether. And this behavior in the destructor gives us an out, right? What it says
is that if I manually clear the intrusive pointer before the destructor of tensor runs, then the
destructor of tensor will see that the pointer is null, and it’ll skip the decref. So all I need to do is be
able to release an intrusive pointer without decrementing the ref count and nulling out the
value on the inside. And I can get by scot-free. And this is the idea behind tensor ref.
So how does tensor ref work? So tensor ref is a class, it contains a tensor as its member, but it’s intended
to be a non-owning version of tensor. So you are able to construct these without incrementing ref count
bumps. And when you destruct these, no ref count bumps happen. On construction, what you do is you
take a tensor, and you take the raw pointer for that tensor, and you manufacture a new tensor object
without actually incrementing the ref count. Intrusive pointer actually has an API for doing this. It’s like
don’t increase ref count tag in the constructor. It used to be private, but you know, we made it a little
less private so that we could do this particular thing for tensor refs. And then when we destruct
the object, well, destructors for child classes run before parent classes. So in the child class
destructor for tensor ref, what we do is we release the pointer. So what release does is it sets this
intrusive pointer to null and skips the ref count bump. And now the parent destructor, which, you know,
is going to process the members in the class in question, namely the tensor, will see that while
it’s a null pointer, so there’s nothing to do. So you’ve bypassed the increment ref count and decrement
ref count in both cases. And once again, what was the point of doing all of this? Well, now I have a way
of given a tensor impulse star, I can create a tensor const tensor ampersand, right? I do that by creating
one of these tensor refs, which internally contains a const tensor ampersand. And that’s the way that I
can actually then call these functions without having to do any reference count bumps. So this is a
pretty good, cool idea. And we actually never implemented it. And the reason we never implemented
it was because, well, you know, tensor ref is an entirely new class, C++ doesn’t have dot overloading,
that is to say, there’s no way to say, hey, given a class, here’s what the meaning of all dot foo
operations means, because then I could just forward it to tensor. So actually, we’d have to code
generate all of the same methods that used to live on tensor on tensor ref as well. That was kind of a
pain. And so no one has gone around to doing it. However, Megan Lele has been working on a similar
concept, optional tensor ref. So what is optional tensor ref? Well, optional tensor ref is for those
situations where you want to optionally pass in a tensor to one of the kernels in PyTorch, or maybe there’s
no tensor at all. Previously, we implemented these as a std optional tensor, but there’s a problem with
this implementation. Do you see it? std optional tensor with no extra references or pointers or
anything like that implies that you’re getting an owning reference to tensor. So in fact, to call a
function like this, you have to do a reference count bump. That’s bad. And you know, we kind of mess this
up. And we’re trying to fix it with structured kernels. So optional tensor ref doesn’t have this
problem. It also is a little more efficient than optional tensor, because optional tensor, the optional
class is obligated to store whether or not the tensor is full or not by a separate Boolean. But you know,
we can actually just represent that as a null pointer tensor inside optional tensor ref. And finally,
optional tensor ref doesn’t have the problem of the API, because well, you expect to have to use arrow notation
whenever you’re accessing an optional object, because you don’t know if it’s null or not. So there’s a lot of
stuff that goes into reference counting in PyTorch. And if there’s one thing that I want you to take away from this
podcast, it’s that atomic ref counts are expensive. So avoid them whenever you can. That’s everything I wanted to say for
today. Talk to you next time.Reference-counting
EP47 torch.use_deterministic_algorithms
Hello everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about the
determinism mode in PyTorch, which allows you to run PyTorch computations in a fully
deterministic fashion, so that if you run PyTorch programs again, you’ll get the same result.
What’s the point of determinism? Ask no one who ever had to debug a very hard-to-reproduce
problem because the problem was non-deterministic. Deep learning programs are already very hard to
debug because, well, there’s a lot of stuff that is going on that you don’t really directly have
access to as an engineer, and so if you forget to add a constant in some area, you might not find out
about this except that your network trains a little bit more slowly. And so if you’re doing a four-hour
long training run and only into hour three do your gradients start exploding and you get lots of NANDs
everywhere, well, it’s going to be very painful to debug this problem in most circumstances.
So determinism makes at least part of the problem easier, which is that if your program is bit-for-bit
deterministic, if it always produces exactly the same bits every time around, then, well, okay, maybe
you don’t find out about the problem until three hours in, but at the very least, every time you try,
you will get the same result. Because the only thing that’s worse than a problem that only happens
three hours into your training is a problem that only happens once every 10 times three hours into
your training. Ugh. So torch deterministic was a proposal that was made by Sam Gross a very long
time ago, and we didn’t really do very much with it because there’s a very annoying thing you have to do
to make this feature actually come into being, which is you actually have to audit all of the
operations in PyTorch and look for ones that are not deterministic and then do something appropriate
in the situation. So lots of kudos to Kurt Moeller who actually picked this up and, you know, got it,
saw it all the way through to the end. He’s the one who made it all happen.
So what’s the basic concept between torch dot deterministic? Well, there’s a few things to
talk about what you want out of a way of running your programs deterministically. First off, you
don’t want determinism to be on all the time. Why? Because being deterministic is actually quite
expensive. There are a lot of algorithms where if you allow for a little bit of non-determinism,
they can run much, much faster. And when you, you know, make things run deterministically,
well, you’re going to get quite a bit of a slowdown. And so, you know, there’s actually a
very delicate balancing game PyTorch does with regards to its defaults. Do we give people, you
know, all the knives and run really fast and make it easy for them to do things that are wrong?
Or do we actually, you know, try to prevent errors and, you know, try to make sure people,
you know, don’t do the wrong thing. And sometimes we trade off performance for, you know, making it hard
for people to do the wrong thing. But determinism is not one of those things. We do not
give you determinism by default. You have to ask for it explicitly. Another question is, you know,
one of the things about non-determinism in your network is you might not even know about it when
you, when one of these things happens. So there’s sort of two parts to Torch Deterministic.
So one is just letting you actually, you know, use deterministic algorithms whenever they are
available. But second is just identifying when non-deterministic code is being run. So you can
know, oh, yeah, this training run is using that function. That function is non-deterministic.
Maybe it doesn’t even have a deterministic implementation, but at least I know about it.
And I can, you know, ask my system error if I use something that doesn’t actually have
a deterministic implementation. So the framework level implementation of Torch.deterministic is pretty
simple. There is a context manager that, you know, you can use to turn on the, you know,
warn or error on determinism. And then everywhere in our code base where we’re about to do a
non-deterministic operation, there is a line that just says, hey, alert that, you know, this is
non-deterministic. And then depending on the setting, if, you know, you’re supposed to error or if you’re
supposed to warn, you’ll get one of these other things. And in some cases, if determinism is requested,
we can route to a different algorithm. And there’s just an if statement that does that.
Very, very simple. So really most of the juice is in deciding one, that this was a good idea,
and two, actually following through and editing all the algorithms. There’s actually not that many
different types of non-determinism in the library. So one of the most common ones is the backing
libraries that we use, especially in CUDA, are often non-deterministic in these, in some situations.
So the, like, classic example of this is convolution. CUDA-NN, you know, has a lot of algorithms,
and some of its convolution algorithms are non-deterministic. And in fact, there’s even,
prior to the, you know, generic determinism flag, there was a CUDA-NN deterministic flag,
which was specifically about, you know, foregoing use of the faster algorithms so that, you know,
you would use one of the, like, deterministic convolution algorithms. In other cases, you know,
that are not library code, the most common reason for non-determinism is an atomic addition. So to
explain why atomic additions can cause non-determinism, it’s important to know a little bit about floating
point numbers. Floating point numbers are not commutative. Repeat after me, floating point numbers
are not commutative. If you actually think about, you know, what it would take to actually write a
floating point implementation, this kind of makes sense once you think about it. Because, like, let’s
imagine that, you know, you have a, you know, very, very small quantity, and you keep adding it until
eventually it becomes a larger quantity, like, you know, 0.1 plus 0.1 plus 0.1 plus 0.1 and so forth.
And eventually, you know, you can get something fairly big until, you know, your precision falls
off the cliff, and you don’t have enough precision to know every time when you’re adding plus 0.1,
you, you know, actually can put it in. Because this is a limited number of bits, right? The whole
point of floating point is you can change the amount of precision you have. So, like, if you’re close to
zero, you get a lot more precision. If you’re a really big number, you have less precision. But if
you’re adding, you know, 0.1 to, you know, a trillion, well, you’re probably not actually, you
know, like, there’s just no space to represent it in the floating point representation. Well, if you,
you know, do a bunch of these additions, and you get to a reasonably large number, and then you add it to
a large number, you know, you can expect to see the contribution of all those incremental 0.1s.
But if you start with a big number, and then you keep adding 0.1 to it, and each time you don’t
actually take up the floating point number, because well, there’s, you know, no change, because you
don’t have precision to represent it, well, you know, clearly, you’re gonna get a different result
in these situations. By the way, there’s some kind of interesting ways to work around problems like
this, even with limited precision. One interesting way is to sort of randomize whether or not you take
up the result or not, depending on if, you know, it’s just too small for the precision. So like, say,
you’re adding one with a million, and you don’t have enough precision to represent one, but you do have
enough precision to represent 10. Well, maybe you would just increment only a tenth of the time,
non-deterministically. Of course, this is really terrible for determinism. So let’s get back on track.
So because floating point addition is not commutative, if you have any operations that
are like, hey, run a bunch of stuff in parallel, and then atomically accumulate it into some buffer,
which, you know, is doing some sort of reduction, well, that’s going to be non-deterministic if you
do it the obvious way, which is with an atomic, you know, addition in the situation. So most cases of
use of atomic add in CUDA, those are non-deterministic. And so it’s like, it’s like actually
super, super simple, half the time to figure out if something is non-deterministic. Does it use atomic
add? Oh, it uses atomic add? Well, it’s probably non-deterministic. And that’s really all there is
to the feature, right? It’s, you know, a context manager that sets some global variable that triggers
the behavior, and then a bunch of code everywhere that, you know, says whether or not something is
deterministic or not. One of the things that we do, you know, would accept patches for, and in fact,
in the last half, some work that some folks did was for some of our internal training workloads,
they added support for a lot of deterministic versions of operations that didn’t previously
exist. Kudos to them. Great work. Is yes, in some cases, we will just hard error if you ask for
deterministic. And if you provide a deterministic version of the algorithm that we can use in place
of it, even if it’s slower, that just, you know, makes the deterministic feature more generally
applicable. I also think of torch.deterministic as a really good sort of API model for other types of
things in PyTorch, which we might want to do. In particular, there is nothing intrinsically wrong
with non-deterministic operations. It just happens that sometimes you want to know if you’re about to
stumble into one of these things. So it would be nice to be able to just easily set some flag to then be
told about all these situations, maybe as warnings to just find out about all the sites or errors if
you like absolutely, absolutely cannot abide with a non-deterministic result. Well, there’s actually a lot
of other behaviors in a framework like PyTorch, which have a very similar property. For example,
Natalia Gimelshin is working on a version of this, but for CUDA synchronizations. What’s a CUDA
synchronization? Well, that’s when you have some CUDA computation, and you’re like, hey, GPU, please
finish all your compute, and then send me the result back to CPU, so I can go look at it. That’s, you know,
something that will happen implicitly, sometimes in your PyTorch program, it can trash your CUDA
performance. And it would be really nice to know when this has happened. And it’s as simple as just
making sure everywhere these synchronizations can happen, we have a test. And so you can set a flag
and then, you know, have it raise a warning or raise an error. So once again, super simple, but it goes a
long way. So this is like, I don’t know, it’s not very glamorous, but I think it adds a lot of value to
our users. And so I want to encourage people to work on this kind of feature, because hey, it really does
pay off. Okay, that’s everything I wanted to say for today. Talk to next time.torch.use_deterministic_algorithms
EP48 gradcheck
Hello everyone, and welcome to the PyTorchDev podcast. Today, I want to talk about GradCheck,
a mechanism for automatically testing the correctness of derivatives written for functions.
Where to start on this podcast? Well, I will tell you all about calculus and derivatives
and finite differences, but before I get there, I want to talk a little bit about testing.
When the word testing comes to mind, there’s a bunch of different possibilities for what you
might mean in a situation like this. Perhaps the very first set of tests a enterprising programmer
writes, or to put it more precisely, the first set of automated tests, because who hasn’t ever,
you know, written some code and then just run it directly in the REPL to see if it worked or not.
The first automated test one writes tends to be of the form, write a single test case with some input,
and then, you know, test that the output is what you expect it to be.
And for many types of programs, this works pretty well. And you know, if you are a proponent of say,
test driven development, the model is that you’re not supposed to write any code before you start
writing your test cases, and you go one by one by one until you get there. Of course, writing tests in
this way manually can be a bit irritating. And it’s especially bad if you’re working on a numeric
library like PyTorch, where typically the input is going to be some random number, and the output is
also going to be some random number. And it is very non enlightening. If you write your tests, as you know,
if I feed in a bunch of these floating point numbers, then I get these other floating point numbers,
right? It’s just difficult to maintain. And it doesn’t even work that well. If for example,
your precision is slightly off, or you make a change that changes some of the epsilons of it all.
And now you have to go manually update all of your tests. Now, of course, in a previous podcast,
I talked about expect testing, where the idea behind expect testing is that you can automatically
update your test cases when things change. But in this particular podcast, we’re going to look at
different form of testing, namely property based testing. What is property based testing? Well,
property based testing is based on the idea that instead of specifying individual input output pairs,
you say, hey, here is a property that I expect to hold for all inputs, or maybe conditional on some
properties on the input being true, I expect to hold for all inputs. And so I will simply randomly
generate inputs, and then see if this property is upheld by the function in question. And when we
type when like in, you know, say undergrad, they teach you about property based testing, the like canonical
example is reversing a list, right? So what are some properties you can test for reversing a list?
Well, if I have a list, and I reverse it, and I reverse it again, that should give me back the original
list, right? So reversing twice is idempotent. Of course, this is very boring. And a lot of people
come out of this, and they think, you know, okay, well, property based testing isn’t that interesting.
But I want to tell you today, that grad check is an example of property based testing. And it is
really, really effective at testing programs in PyTorch. Okay, so how does grad check work? Well,
the, the problem grad check is trying to solve is that in PyTorch, we are a library that implements
automatic differentiation. And the problem with automatic differentiation is we basically have a
bunch of mathematical functions, and we never write what their derivatives are. And in math,
the derivative of function is a well defined concept, there is one correct answer, modulo subgradients
and stuff like that. And of course, there’s always a possibility that when you write how to translate a
primitive into its derivative, you wrote the translation down wrong. And so this kind of error
is what the property based testing of grad check is trying to figure out. So for a moment, let’s
remember what the definition of a derivative is, there are a number of ways to formulate this. For example,
in your calculus class, you may remember some formula involving limits, and f of x plus dx minus f of x divided
by dx, something like that. But another way that I like to think about derivatives is, you know, suppose I
have some function, and, you know, it might be very wiggly, it might have very strange behavior. And at any
given point, if I zoom in enough, the function starts looking more and more like a straight line,
right? Like I keep zooming in, zooming in, until, you know, all the wiggles go away, right? I’m just
looking at a single segment. And while it looks like a line. And so the derivative, what the derivative tells
us is what the linear approximation of a function is at any given point on the function, right? So if I’m
asking derivative at, you know, some point, then like, that’s going to give me, hey, you know, like,
it’s flat, or it’s curved upwards, or it’s curved downwards, those are the various different derivatives
you can have. So it stands to reason that you can calculate a derivative simply by zooming in
sufficiently on the function, looking at two points, and then turning those points into a line. And that’s
exactly what the method of finite differences does. It says, you can numerically compute a derivative by
simply just taking two points on the line, the function that are very close to each other, dividing
by the distance they are from each other. And that’ll give you, you know, the slope of the line at that
point. And as I said, that’s a very good linear approximation, essentially, the smaller and smaller
you make the delta. Of course, PyTorch doesn’t compute your derivatives this way, it would be
heinously slow to do so and also not very accurate. Instead, you know, when we write derivative formulas,
what we’re doing is what we’re writing down what’s called the analytic derivative, which is, you know,
like in math, in calculus, right, you had a bunch of rules for, you know, given a bunch of functions,
how to convert them into derivatives, and the analytic derivatives are just simply directly
writing down those rules for automatic differentiation system. So one of the themes in property based
testing is that if you have some way of implementing a function in two ways, or if you have some way of
representing a property in two different ways, if you have two ways of doing the same thing, then a very
easy to set up property in this situation is just to say, hey, when I do method A, and when I do method
B, they need to give the same result. And if they do, well, that’s good for me. This sort of like sort
of comparison against a reference implementation is very, very convenient to test because you don’t have
to know anything about the outputs, you just need two implementations, and you can compare if they work
together or not. And so in the situation of differentiation, and what Gratchek does is Gratchek
says, hey, I have two ways of testing what a derivative is, I can do the numerical method where I just,
you know, take finite differences and see what it looks like just by looking at the points. Or I can
take the analytic solution, the one that I’m trying to test the system under test, and just directly
compute it based on the symbolic formula in that case. And all I need to do is compare these two
formulas on a bunch of random inputs. And if they always agree with each other up to some tolerance,
then I know that I’ve implemented my derivative correctly.
Of course, there’s a complication. And it’s easiest to explain this complication in two steps. First, when I
describe this function to you, you were probably imagining a squiggly line in 2d space. And you know,
the derivative was just some straight line that was tangent to the line at this point. But first, if we
think about neural networks, neural networks aren’t, you know, there aren’t only two parameters in a neural
network, that would be a very impressive neural network called linear regression. Instead, there are
many, many dimensions for all the parameters. And they all, you know, do a lot of computation up until one
point. And so a more accurate way to think about a neural network is that you have some sort of surface,
some like high dimensional surface, but, you know, it’s easiest to visualize this in 3d space. And what
you’re trying to do is you’re trying to find the gradient, which represents the, you know, orientation
of a hyperplane, a the plane, which is tangent to the surface at some point. But that’s just complication
one. Complication two is that when you look at the individual functions that are used inside a neural
network, and these are the ones that we actually want to do grad checks on, remember, because they’re the
ones we’re writing derivatives for. It’s not just a, you know, surface, because that might be some sort of
function, which takes in some high dimensional space, and produces another high dimensional space.
So really, to model what this transformation looks like, in a linear way at a very at a neighborhood,
you need a thing called the Jacobian matrix. It’s kind of hard to describe what a Jacobian matrix does.
But one of the explanations that I read on math overflow, that I quite like, is imagine you you
have your vector space, and you’re looking at one point in the vector space, when you perform your
operation on this, you map this point in the vector space to another point in the destination vector
space. And furthermore, all the points in the neighborhood also get mapped at the same time when
you do this. And you’re looking for a single matrix that describes how these points distort,
move around, etc. And it’s a matrix because, you know, we’re talking about linear approximations of
functions. Okay, so where where am I going with all this? Well, it turns out that even when you have a
n dimensional input and an n dimensional output, you can compute the Jacobian. And the way you compute the
Jacobian is, you can do it both analytically and numerically, I’m doing it numerically, you simply
just, you know, take all your inputs, you change one of them to be perturbed slightly. And then you
keep doing this for every single input until you eventually get to the end, right. And so every time
you perturb a different input slightly, you’re getting another column of the Jacobian, I actually always
forget whether or not it’s rows or columns, but I did look this up for the podcast. Similarly, when we do
the symbolic derivative using our backwards formulas, all we need to do is for every output, because
remember, this is a possibly n dimensional output, try saying, hey, what’s the derivative, what’s the
gradient that affects this particular output, or the next one or the next one, and this one will help us
reconstruct the Jacobian row by row. And so now, well, suppose you didn’t get all of that, right? At the
end of the day, we’re setting up this property based test. And what are we doing? Well, we’re just,
you know, taking our two implementations, which know how to compute the Jacobian, and then just checking
if the Jacobian actually equals each other. Or at least, that’s what we used to do. So Alvin Desmaison
and Jeffrey Wang came up with a pretty interesting technique for making this faster, because what’s
basically happening is you’re repeatedly running your finite difference slash your backwards derivative on
every single input slash output until you’ve read out every row slash column of the Jacobian. And then
you’re just doing the check on the Jacobian. So this is like very precise, right? Like the Jacobian
is fully specified by once you read out each thing, because it’s linear, right? And the magic of linear
functions is they can be fully characterized as a matrix of the appropriate dimensionality.
But remember, this is property based testing, right? We’re not even getting full verification
that anything is right, we’re testing that the gradients line up on various points in the function
space. And in fact, it’s sort of the least of our worries, whether or not any particular approximation
of linear approximation is correct, like we don’t really need to check it in all that detail.
Realistically, we will figure it out in the end. So this leads to the idea behind fast grad check,
the idea behind fast grad check is, hey, we have this matrix, the this implicit Jacobian matrix. And
previously, we were, you know, painstakingly reconstructing each of the rows slash columns,
because that’s what our things gave us. But in fact, we don’t need to reconstruct everything.
Instead, all we need to do is compute some sort of linear combination of this with some randomly
sampled vector. And well, as long as these vectors are similar, then we know that the matrices are very
likely to be similar. The reason why this works involves a bit of math. And I encourage you to look
at the quoted resources, which talk about this in more detail. But essentially, what’s going on is
we’re computing either a JVP or a VJP, depending on whether or not we’re doing a backwards formula,
that’s the VJP, or we’re doing finite differences, that’s the JVP. And what you’ve got in this case is
you’ve got the Jacobian multiplied with a vector, but on one side or the other, depending on the case you
are. So you just multiply it by a different vector on the other side and make sure it’s consistent in
both cases, then you end up with a VJU. And that is going to be a very small quantity and very easy
to compare. Another analogy for this situation, which might be useful if you remember this from your
probabilistic programming classes, is Freewald’s algorithm. So that’s given two matrices, and you want
to multiply them together, you’ve got some result, and you want to see if this result is actually the
correct one. So the naive way to do this is to actually just go ahead and do the matrix multiply
between A and B, and then compare the elements point-wise, giving you C. But what you can do
instead is you can just multiply the matrices by some vector, and then by the properties of
associativity, you can multiply B by the vector first, and then multiply that vector by A, and that gives
you a simple vector, whereas C multiplied with a vector directly. And then you don’t have to actually
do a matrix multiply. You just are doing easier to do operations that are simply matrix vector
different. So that’s the main idea behind grad check. If you are very interested in automatic
differentiation, I highly recommend learning what Jacobians are, what JVPs and VJPs are. Unfortunately,
a podcast is not a very good vehicle for mathematical understanding. So if you didn’t really understand all
that, that’s okay, you’re gonna have to spend some time with the textbook. It’s just the name of the
game. But a higher meta principle here is that property based testing is pretty cool. Yes, it can be
hard to do correctly sometimes, right? Like you need to make sure you, you know, run your random samples
deterministically, so that you always get the same result in CI. And you also need to make sure you design
your properties and your random numbers really well. Because if you don’t, you’re just going to get
nonsense. But grad check is a really example of a really elegant way of using math, where we’re
basically like taking advantage of the fact that you know, there’s like this adjoint thing going on.
And that’s some that relates the two different ways to do derivatives, and then using that to basically
test all of our functions. So we basically don’t do tests for gradients by hand, we just rely on grad check
to tell us if we got it right or not. Okay, that’s everything I wanted to say for today. Talk to you next time.gradcheck
EP49 Asynchronous-versus-synchronous-execution
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk a little bit more about blocking versus unblocking APIs in PyTorch
and its implications on various design questions in PyTorch.
In the CUDA podcast that I gave a long time ago, I mentioned how CUDA is asynchronous.
That is to say, when you do operations on CUDA tensors, they don’t execute immediately.
The program will actually return from the function you call before it’s actually done doing the operation in question.
Instead, in the background, your GP will be chugging away doing the actual computation in question,
and your actual Python program is allowed to run ahead and figure out what the next thing that it needs to do
in order to execute the next operation is.
This is in contrast to CPU execution, which is synchronous.
So when you ask for a CPU operation on something,
well, you’re going to wait until the CPU operation is entirely done before moving on to the next thing in question.
Asynchronous execution for CUDA is pretty nice because it means that we aren’t bottlenecked
by the Python interpreter overhead so long as we queue enough work for the GPU to do.
You just have to wait until you’ve started up actually doing computation,
and then any further overhead from the Python side program can be covered up as long as, you know,
you’ve got enough work to do because you’ll probably hit and queue the next piece of work before the GPU is actually ready to do it.
On the flip side of the coin, it’s nice for CPU to be synchronous because, well, it means that, you know,
once you actually have a CPU tensor, it’s actually got the honest-to-goodness data.
So if you want to, say, FFI it out to some other system by passing on the raw data pointer,
there’s nothing special you have to do.
It just works.
And, of course, it’s a lot easier to implement a synchronous API than an asynchronous one
because then you have to decide all sorts of questions about, you know,
how exactly you’re going to notify the threads that something is ready,
how exactly you’re going to, like, queue things to execute,
and, all in all, it just removes a bunch of implementation complexities that you have to deal with.
By the way, CUDA, by default, pulls, but, you know, you can actually change out
how exactly it does the synchronization between the thread that’s actually executing things
and your native thread by toggling a special configuration in the CUDA API.
So both of these paradigms make sense.
When you operate exclusively in CPU or exclusively in CUDA, you know, there isn’t too much to worry about.
But there are a bunch of places in our API where we interact between CPU and CUDA.
And this is the point at which it actually is a little non-trivial
to deal with the impedance matching between these two paradigms.
So to look at one particular example, let’s look at the non-blocking argument on the to method on tensor.
So what does this do?
Well, it says when normally we have a conversion from a CPU tensor to a CUDA tensor or vice versa,
we will wait until the conversion is completely done before returning from this function.
And non-blocking says, actually, don’t bother waiting.
Just go ahead and return immediately while the, you know, CUDA driver is doing this asynchronous update.
And let me go ahead and do other things in the Python program.
So it doesn’t take too long to realize why we don’t default to non-blocking execution by default.
Let’s think about the CPU to CUDA case.
So the CPU to CUDA case is not such a big deal, right?
So you have some memory and you want to transfer it to CUDA.
And, you know, like your CUDA kernels are already going to be asynchronously executed
after this particular hosted device copy happens.
So what’s the big deal?
Well, there are two problems.
One is that when CUDA does memory transfer, it needs to actually have the memory in some
location so that the GPU hardware can actually direct memory access it out of the RAM in your
actual CPU.
And so to do that, you need some special memory called page locked memory.
And the way you get that is using a pin memory allocator in PyTorch.
That’s from the CUDA API.
So you can’t do non-blocking CPU to CUDA or vice versa operations by default.
You actually need your CPU tensors to live in pinned memory.
And pinned memory isn’t free because, like, when you pin the memory, you’re saying to the
operating system, you’re not allowed to, like, move it to swap.
You’re not allowed to move it around.
And so it reduces the amount of flexibility your operating system has to deal with your
CPU memory.
So by default, PyTorch doesn’t allocate pinned memory.
By the way, Cafe2 did allocate pinned memory by default.
But PyTorch doesn’t do that.
And so you need to make sure you, like, ahead of time, actually pin things if you want to
use non-blocking.
But that’s not even the end of your troubles.
So if you do a CPU to CUDA operation on pinned memory, you will have some, you know, thread
in the CUDA runtime going ahead and copying the data from CPU to CUDA.
What happens if someone goes ahead and mutates that CPU tensor while this transfer is taking
place?
Well, you’ll get nonsense in this situation, because it’s not like we went ahead and made
a copy of the CPU buffer before we did the transfer, right?
The whole point of non-blocking is to make things run faster.
And, you know, the way it actually makes things not faster in this particular case with pin
memory is we get to avoid actually having to do a copy into pin memory before we do the
operation in question.
So we’re reading directly out of the source tensor, zero copies.
And that means you actually need to make sure that the tensor sticks around, doesn’t get
deallocated, doesn’t get overridden until you’re done doing this memory transfer.
And of course, ordinarily, it would be safe to override the CPU tensor immediately after
the two operation returns, except remember, you said it was non-blocking, right?
So it’s going to return immediately, regardless of whether or not the copy is finished or not.
The reverse situation is even worse.
So when you have CUDA going to CPU, ordinarily, you know, once again, this will block until
everything has been copied into CPU.
If you specify that to be non-blocking, then we will immediately return, we will have given
you a CPU tensor, but the CPU tensor will be filled with garbage until some undeterminate
time in the future when the device to host copy finishes.
And in fact, the only way to properly wait for this transfer to finish is to either do
a CUDA synchronize, which is just a blocking operation waiting for everything in, you know,
CUDA to make its way back to CPU.
Or if you want to be a little more fine grained about it because you’re running multiple streams
or multiple other types of concurrency, you can set up an event on CUDA, which will trigger
after this copy is done.
So there are a lot of caveats here, and it is not easy to use these APIs correctly.
But, you know, one of the philosophies of PyTorch is, right, like give a simple API, not a easy
to use one necessarily.
And so we give people all the tools they need.
We have reasonably simple semantics.
And in this case, you know, you’re kind of just up to your own to make sure you do everything
correctly.
And there is performance to be gained here.
So people will use non-blocking to get that performance in the situation.
There’s been a longstanding idea running around that no one has implemented yet to sort of make
the situation a little better.
And it’s called async CPU, right?
So I talked about how CPU is synchronous.
And one of the reasons why it’s synchronous is it’s just easier and more efficient to implement
because you don’t need any blocking mechanisms.
But there’s nothing stopping us from having a CUDA-like asynchronous execution model, except
all the execution is happening on CPU.
So we dubbed this async CPU.
The idea behind async CPU is it would be a different device, distinct from the CPU device.
You would share all the kernels that regular PyTorch uses.
But when you do an operation, instead of immediately going ahead and running the CPU computation to
the end, we would put this in some sort of queue for some worker thread to actually execute the
actual operation on.
And once again, the idea is, you know, if you have multiple threads and you have a lot
of work to do, you may be able to successfully have the control thread run ahead and, you know,
make up for the fixed overhead of doing all the synchronizes correctly in this multi-threaded
concept context to avoid, you know, once again, cover up the latency from executing Python
programs.
An added benefit, which is, you know, sort of drawing from the discussion we just had,
is if we had an async CPU tensor, we could give a user-friendly API for CUDA-to-CPU
non-blocking copies, right?
So what you would do is you would say CUDA-to-CPU doesn’t return a CPU tensor.
It returns an async CPU tensor.
And you can now just directly run operations on it and rest assured that those operations
would only ever actually execute once the device-to-host copy had actually finished.
So the async CPU idea has been around for a long time.
And for the longest time, we never implemented.
And there was a good reason why we didn’t implement it, right?
Which is that adding a new device to PyTorch is a lot of work, right?
We’ve got so many operators.
And, you know, if you had a new device like async CPU, well, yes, you can reuse all of the
kernels that you, you know, had for the CPU thing, but, you know, async somehow.
But you still have to actually handle computing the metadata for the tensor you are going to
return from the async CPU operation.
To explain this in more detail, it’s useful thinking about what are blocking versus non-blocking
operations on CUDA tensors.
So we’ve already established that doing something like a device-to-host transfer, aka what would
happen if you called, say, item on a CUDA tensor, is blocking, right?
We have to wait until we get the actual data in a CPU before we can do anything with it.
But there are a lot of also methods on tensors which are not blocking.
For example, I can take a CUDA tensor and I can ask for what its size is.
And this doesn’t actually cause us to synchronize with the GPU waiting for all the operations to
finish.
Why?
Well, it’s because the size information is maintained on CPU, right?
It’s not something that’s stored in CUDA.
It’s stored on CPU.
Many things are like this.
In fact, you know, if I ask you a question like, hey, here are two CUDA tensors.
Do they overlap in memory?
Well, I don’t need to actually do a synchronize with CUDA because I have my CUDA data pointers
and I can just look at those and the sizes and the strides and figure out if there’s a overlap
or not.
So the problem with async CPU, right, is that whenever you want to do an async backend, you
need to actually say what the output like size and strides and everything else is without
actually running the kernel in question.
And that would have been a lot of work.
You would have to do it for every operator.
And so no one really wanted to do the work.
And so async CPU never became a thing.
Fortunately, there is a project called metatensors, which allows you to run the operations without
doing the computation in question and figure out what the output tensors, size, dtype, everything
like that looks like.
So basically assuming that you have something that is like metatensors, you actually basically
have most of the pieces you need for doing asynchronous CPU generation.
And you just need to like stick a code gen on the problem to generate fast unboxed kernels
that like put the arguments on the queue and ship them off wherever else to actually execute.
So async CPU is a project that probably finally has gotten its time, but with metatensors.
And, you know, it just needs someone to actually go ahead and work on it.
Stuff gets really weird when you’re in the asynchronous world, though.
So I want to give one more example of non-blocking making things very complicated.
And that’s in the CUDA caching allocator.
So the CUDA caching allocator is a way of, you know, allocating CUDA memory without actually
hitting CUDA malloc, which in old versions of CUDA was very slow.
So we maintain this big pile of CUDA memory.
And, you know, when you ask for an allocation, we look in it, find a free spot that, you know,
has enough space, and we give that to you.
And similarly, if you give us back some memory, you free some memory, we just return it to the
pool so someone else can use it.
So the hazard in the CUDA caching allocator is what happens if someone returns some memory
to the CUDA caching allocator, which, by the way, this is entirely CPU-side.
There’s no synchronization involved.
And then the CUDA caching allocator goes ahead and hands out the memory to someone else.
But at the same time, you are still executing the asynchronous CUDA kernels that were expecting
the CUDA memory to be live.
So you’re in one of these very awkward situations where I can have some CUDA memory in the CUDA
caching allocator.
According to the state on CPU, it is free.
But actually, we are still operating on that memory in a bunch of backlogged async CUDA kernels
that are executing.
Oof.
Now, ordinarily, this doesn’t cause any problems.
Because remember, CUDA is organized into these streams.
So if you are only operating on a single stream, well, if you say, OK, now I’m going to reallocate
this memory for someone else and trash it, that trashing operation happens in the stream
and will happen after all the original CUDA kernels that were waiting to work on the original
data before you get there.
So, you know, the race is averted.
But that’s only true if all those operations are on the same stream.
And as I said, we support multiple streams in PyTorch.
And so you can actually end up with the data showing up on a different stream, you know, and
then there’s no guarantee of synchronization.
So to handle that, we, you know, force people to also record stream information when they run
their kernels.
And this is how we insert the necessary events to make sure that we actually go ahead and wait
for all those informations to be done before the caching allocator, you know, you can actually
use this memory that you’ve gotten from the CUDA caching allocator.
Okay, so that’s been a whirlwind tour of async and synchronous execution and how to put them
together.
That’s everything I wanted to say for today.
Talk to you next time.Asynchronous-versus-synchronous-execution
EP50 Multithreading
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about
multi-threading in PyTorch. Threads are a mechanism for running multiple computations
in parallel and it’s no accident that in PyTorch we make extensive use of threads to make computations
run faster because, well, as you may know, the thing we’re doing most of the time is running
lots and lots of very similar CPU computations and so it actually is typically embarrassingly
parallel and we can often take advantage of multiple threads to make things run faster.
That being said, threading is a surprisingly tricky and surprisingly subtle problem and in this podcast
I just want to talk a little bit about some of the things to be aware about when working with
multi-threaded code in PyTorch. To start off, I want to talk a little bit about how you as a user slash
developer typically interact with multiple threads in PyTorch. There are, of course, APIs in PyTorch
which implicitly use multiple threads in the course of their execution without any work from you at all.
For example, when you run data parallel to run multiple computations in parallel over multiple
GPUs on your device, when you run Autograd backwards on it, in fact, Autograd will automatically
parallelize the backwards passes of each of your GPUs to run on separate threads because without doing
that we would actually be unable to saturate your GPU devices.
Of course, all the operators in PyTorch that you call may or may not use multiple threads and there is a mechanism in PyTorch called SetNumThreads that lets you help tell PyTorch how many threads to use when executing various operations in PyTorch, whether or not to use lots of threads or only to use one thread because maybe you’re using the threads for something else and you don’t want PyTorch using up all the cores on your system.
As an operator implementer, the typical way you can parallelize code is using a handy-dandy function called Parallel4, and we’ll talk a little bit more about how that exactly is implemented in a bit.
And there’s a few other bits and bobs for places where you interact with multiple threads.
For example, there’s a little thread pool that has got a work queue attached to it in C10 that you can queue various things to run at some later point in time.
Our RPC system makes use of this extensively.
And there’s also fork join parallelism support in TorchScript where, you know, although Python, you know, doesn’t normally support multi-threaded execution, more on that in a moment as well, you can do fork joins and when run in the TorchScript interpreter, they’ll get run in parallel in that situation.
So, in other words, there’s multi-threading all over the place in PyTorch, and oftentimes you don’t really have to think very hard about it because there’s usually some pattern or some pre-existing way of handling it that makes things work out.
Except when you do, and then we get all these bug reports about how PyTorch is running slower, or PyTorch is using too many cores, or, you know, PyTorch isn’t respecting the number of threads people are asking about to give it.
And don’t forget about, you know, just straight-up crashes and other mishandling from handling threads.
There’s a lot to chew on on the subject of multi-threading.
So, we’re going to just sort of walk through some of the things to be aware about in PyTorch.
No discussion of multi-threading in PyTorch would be complete without a brief reminder that Python is not a multi-threaded, friendly programming language.
Of course, there is a multi-threading module in Python, and you can, in fact, run your Python computations in multiple threads.
You just won’t get any parallel speed-up from it because Python has this thing called the global interpreter lock,
which means that at any given point in time, there may only be a single thread running instructions in the Python interpreter.
So, say goodbye to your ideas of, you know, popping open multiple threads and then, you know, running your Python code in each of them to make things run faster.
We only are able to get a parallel speed-up when we are not holding the global interpreter lock, which, fortunately, is most of the code in PyTorch written in C++.
This is a very important thing to keep in mind because it also means that, in some cases, when we do need people to be able to write Python code that runs in parallel, we have to do very strange things to it.
Of course, I’m not really going to talk about multiprocessing.
I’m not going to talk about data loader.
We’re just going to focus on multiple threads.
But it’s good to have this idea about the gill in the back of your mind.
So, one of the ways to taxonomize the uses of parallelism in a library like PyTorch is to distinguish between what we call inter-op parallelism,
that is, running multiple ops in parallel, versus intra-op parallelism, where we have parallelism inside of an operator.
Inter-op parallelism is kind of your good old-fashioned parallelism that you would imagine in a, say, web server or, you know, RPC service,
where, you know, you’re getting a bunch of requests from the external world.
These requests are all coming in concurrently, and you just need to have enough threads running to service all of these requests.
And you don’t really want a single thread servicing every request because, well, you know, that’s not using up all the capabilities in your system because your system has multiple physical cores.
So, you want to parallelize over the logical workload.
So, inter-op parallelism refers to parallelism that sort of is external to PyTorch.
It is sort of the parallelism that is over what models you’re running or how you’re running those models.
There is some level of inter-op parallelism in PyTorch.
When I talked about, say, for example, fork join parallelism in TorchScript, that counts as inter-op parallelism because, you know, TorchScript can run multiple TorchScript interpreters in parallel, each of them firing off various operators.
Intra-op parallelism, on the other hand, is the kind of parallelism that I talked about at the beginning of this podcast where, you know, when we’re doing tensor operations, we have a lot of data we want to work over.
And so, you know, when that data is sufficiently large enough, you want to split it up into various pieces of work and then just have multiple threads working on it.
And that’s what APIs like parallel for are.
They’re just a way of kernel writers to say, hey, you know, I’m writing this code and, you know, I think it’s pretty chunky.
So, I think it would be useful if this main loop got parallelized and, you know, maybe it’s like a point-wise operation.
So, it’s embarrassingly parallel and I can just have each of the threads working on their own little chunk of memory.
No problem.
So, we’ve got all of these APIs for working with threads.
And so, how do we actually, you know, run this computation on threads?
And to think about this question, we have to say, we have to ask a question that is basically, what are the thread pools in PyTorch?
So, just to briefly talk about what a thread pool is, slash why they exist.
A thread pool is just this concept of a number of threads that sort of are allocated once by the system and then hang around to, you know, deal with the work that you want to do.
So, it’s called a thread pool because you’ve got this pool of threads available to do work for you.
Why do thread pools exist?
Well, they mostly exist because we don’t really trust the operating system to do a good job in efficiently allocating and deallocating the threads.
Because, like, a very simple way, and in fact, you might do this in languages with better native support for threads, like in the language itself, is you might imagine just spinning up a new thread whenever you want to do a piece of parallel work and then just finishing it when you’re done.
Unfortunately, you know, operating system threads are specified to have a minimum amount of stack, and of course, they have a bunch of operating system context, and so it’s actually pretty expensive to, like, spin up and spin down threads all the time.
So, instead, we just have a pool of threads.
We don’t.
We spin them up once, and then we just reuse them as much as we need for the rest of the things we want to do.
Some other conventional wisdom that comes from working with thread pools includes the idea that you want one thread per physical core in your system.
Now, this conventional wisdom is a little bit of a mixed bag.
So, first, let me tell you where this idea comes from.
So, this idea comes from various applications where latency is a problem, and you don’t really trust the operating system thread scheduler to do a good job of making sure that your threads get scheduled in a prompt manner.
There are a number of reasons why this mistrust is reasonable, but one of them is because the operating system doesn’t really know any specifics about the workload that your application is doing.
And so, it does do preemptive, you know, threading when you have more threads than physical cores, and it’s actually reasonably efficient in throughput-heavy applications, but, you know, there is a quantum for when the operating system scheduler is willing to, you know, switch a thread to some other thread.
And, you know, if you have an application where your latency requirements are smaller than that quantum, well, it sucks to be you.
You better go ahead and implement your own thing.
Similarly, operating system threads have some fixed cost for context switching.
That’s why if you have too many threads in your system, that also causes the operating system to thrash because it’s spending all of its time context switching.
And if you know something special about the workloads you’re doing, well, maybe you can do a little better than having to context switch in this situation.
So, having a thread pool is just common sense when it comes to doing a multi-threaded application.
Like, it’s the first, like, the cost of creating threads and destroying them is the first thing that will show up in a profile if you write a system in the naive way.
And that leads to a problem.
What’s the problem?
Everyone and their dog has their own thread pool.
So, let’s talk a little bit about all of the thread pools in PyTorch.
So, there are a few ones that are sort of very classic.
So, the classic thread pool that we use for a lot of things is the OpenMP thread pool.
OpenMP is a compiler extension for conveniently writing parallel applications.
You may have used it before with the Pragma OMP compiler Pragma.
Although, in PyTorch, you shouldn’t do that.
You should use a parallel for instead.
It solves a number of problems that, you know, for example, actually using the number, right, correct number of OpenMP threads when you’re in a subthread in this situation.
But, OpenMP is very common and we use it to do basic parallelism on all of our threads.
And it’s very easy to get started.
If you just look it up online, you can see that, you know, how to use this thing.
And that’s one thread pool.
I mentioned earlier that Autograd has its own thread pool, which we use to make sure we can saturate GPUs when we’re executing them.
It wouldn’t really make sense to run these in the OpenMP thread pool.
There’s no really way to, like, drive the OpenMP thread pool with the types of workloads that the Autograd threads have.
And also, we also have some really crazy stuff implemented in the Autograd thread pool for dealing with re-entrant Autograd.
That’s Autograd where we call into some custom Python function and then that function itself calls into the Autograd engine again.
We have this problem where we need to preserve the C stack, but the C stack has limited space.
And so if you keep calling into Autograd again and again, you’ll run out of stack space in this situation.
And finally, there’s also a C10 thread pool.
And this is what we use to do interop parallelism.
It’s, you know, sort of our own implementation thread pool.
You can put work onto it and the work gets processed by thread when it’s ready.
The JIT uses it and also distributed uses it.
Although distributed also happens to fire up a bunch of its own threads for various tasks that it needs to do.
And, of course, we use a number of libraries to do various acceleration for many of our operators like mkldnn and nnpack.
And all of these libraries also need a thread pool of some sort because, well, you know, being able to paralyze your operators is really, really helpful.
For some libraries like mkl, they just used OpenMP.
And so we actually just get to share that thread pool with our own uses of OpenMP.
But there’s also some applications that have their own thread pools and some applications that, you know, to their credit, allow you to explicitly specify what thread pool you want to use.
The fact that libraries come with their own thread pools that they want to use makes it difficult to change what the thread pool implementation is.
So OpenMP is not the only game in town when it comes to, you know, sort of lightweight multi-threading inside of operators.
There’s also another library by Intel called TBB, Thread Building Blocks, which is an alternate implementation of thread pools that has some nice properties.
And TBB is cool, and we actually, Christian Perch, spent some time looking into whether or not we could use it in PyTorch.
And in the end, we couldn’t because, well, mkl is compiled against OpenMP.
And so, well, we are stuck using OpenMP because, well, you know, that’s just what we’ve got to do.
So I hope this proliferation of thread pools explains to some small degree why when you ask PyTorch to set the number of threads to blah, it’s actually not so simple a thing to implement.
Because it’s not just a matter of, like, going to the one place where the one true thread pool is set and changing the number of threads there.
No, we have to go to every thread pool and modify them.
And if we forget one or someone, you know, slips in a new thread pool when we aren’t looking, then this thing won’t be respected.
And so we’ve had a lot of bugs over the years, you know, sort of fixing cases where the knobs for changing the number of threads doesn’t work.
But I think it’s working right now in master, which is nice.
Okay, so we’ve talked about how to use threads and when you queue parallel work, how it actually gets executed via thread pools and how many thread pools there are.
So what else is there to worry about about multithreading?
Well, there’s also just a ton of other random stuff.
Let me just go through some of it before we finish up this podcast.
So one is that PyTorch will occasionally fork itself.
And the reasoning for this is because, as I said, Python, you know, doesn’t support multiple threads.
And so people often use multiprocessing to deal with this.
And on Linux systems, people often use fork multiprocessing to deal with this situation.
What do I mean by forking?
Well, when you have a process, when you fork it, the process turns into two processes, one that continues, you know, at the same point it was originally, and one that goes into a condition branch that, you know, it’s got exactly the same program state as before, but it’s executing another branch on the conditional.
While almost exactly the same state.
It just doesn’t have any of the threads that the original process had.
And this is a big problem because what if those threads were doing something important?
So fork based multiprocessing is fundamentally broken in the presence of threads, but that doesn’t stop people from accidentally trying to use it when they use multiprocessing.
So that’s why we always tell people to, you know, try the other multiprocessing option, spawn, which actually creates a new project process from scratch, rather than trying to fork the original process.
But, you know, people do it, and, you know, the CUDA runtime, in fact, internally makes use of threads.
So if you fork while the CUDA runtime is initialized, it’ll just be completely broken.
And we also have some logic explicitly checking for when this happens, so that we can give a better error message than just hanging on users when this happens.
Some more fun stuff.
So we really like thread local state in PyTorch.
Thread local state is a very modular way of adding sort of, it’s basically a really convenient way of adding an argument you pass to every function without having to actually modify every function to add that argument.
So, like, whenever we have things like automatic mixed precision, or other, like, modal type things, those are implemented using thread local state.
Because if you did it with a global variable, well, then, you know, these things wouldn’t be thread safe.
Because you couldn’t have multiple threads with different settings of AMP being turned on versus AMP being turned off.
The problem with thread local state is thread local state is specific to a single thread.
So what if you, say, fork off into another thread, or you have some work and you put it off into another thread because you want to run it under parallel four?
Well, you’re not going to preserve the thread local state in that situation.
And sometimes that’s the wrong thing to do, because morally, you actually wanted to preserve the thread local state in the situation.
We’ve had a number of bugs over the years, where we, like, forgot to preserve one piece of thread local state or another.
At this point, most state gets preserved by parallel four, but there’s some places where we don’t want to do it for performance reasons.
There’s an issue tracking this.
It’s kind of annoying.
Something to be aware of when you’re relying on thread local state inside code that runs inside parallel blocks.
One last thing, multi-threading is sort of the bane of every computer science student, because it’s really, really hard to write multi-threaded code correctly.
Scratch, computer science student.
Bane of any engineer, honestly.
And in PyTorch, we don’t really, like, do very much with multi-threading.
So if, say, for example, you looked at the tensor object, we don’t give any multi-threading guarantees on it, besides that reading from a tensor is okay for multiple threads.
Reading and writing from multiple threads, no good.
Writing, definitely no good.
With a little caveat that if you’re writing into the actual data in the tensor, well, I suppose we can let that slide, even if you’re racing a bit.
Because it’s just, you know, numbers, you know, who cares if it gets corrupted?
It’s just, you know, stochastic gradient descent in that situation.
So, multi-threading.
It’s kind of complicated.
There’s a lot of thread pools.
There are a lot of ways to blow your foot off.
We get a lot of bugs related to multi-threading.
But if you’re writing any serious, you know, high-performance computing library, it’s something you have to know about.
So, hopefully this podcast has given you a little taste of, you know, what some of the PyTorch world problems and multi-threading are.
That’s everything I wanted to say for today.
Talk to you next time.Multithreading
EP51 Multiple-dispatch-in-torch_function
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about multiple
dispatch in Torch Function and how you can use it to make sure your Torch Function implementations
play nicely with others. So if you don’t know what Torch Function is, I highly recommend
go and listen to the Torch Function podcast that I did a few weeks ago. The short version
is Torch Function is a way to overload the meaning of Torch Functions when you make your
own custom subclasses in Python. And so when you’re writing a Torch Function, there is an
interesting problem, which is what if you subclass tensor one way and you say, I want the behavior
to be this, and someone else subclasses a tensor another way and says, I want the behavior to be
some other thing, and I pass both of these tensors to the same operation, like say I add a logging
tensor with a unit of measure tensor. What is supposed to happen in this situation? If we look
at the behavior of Python in situations like this, on normal method overloading, we realize Python is a
single dispatch language. And so traditionally, there is a distinguished argument, the self-argument,
for which you actually do the implementation on. So let’s say that I have, I’m adding two objects
together, a plus b, well, what will happen is I will call the add magic method on a because Python
orients towards, you know, preferring the first argument in this situation. And a is responsible
for checking if it actually understands how to deal with the second object in question. If b is, say,
a subclass of a, chances are, a is going to just go ahead and treat b as if it were an a without using
any of the extra behavior from b. Of course, this can be horribly inflexible sometimes. And so Python
added another way to handle situations such as what if you said one plus some object instead of some
object plus one? Well, clearly, you can’t override the underscore underscore add on the one literal. So
what Python also has is the right side versions of match magic methods such as our add, which say,
if the implement method isn’t implemented on the first object in question, try it again with the
second object in looking for the other implementation are add instead of add. And so what will happen is
when you say one plus some object, first Python will attempt to run the operation using the implementation
from one, one is going to say, I don’t know how to add to this some object thing. So I’m just going to
return not implemented. And then Python will try again with the second argument calling our add on
that argument. And this time it will work. And you’ll actually get a successful dispatch in this
situation. So to recap, in stock Python, most method dispatch is single dispatch. And if you have a
normal method on a function, that’s what’s going to happen. But sometimes there is a need for multiple
dispatch. And Python has this sort of convention, which is, you know, well, try the operation on all
of the objects in question. And you know, if one of them says, I don’t know how to do it, try it on the
other one. So binary ops, and you know, ops with many tensor arguments are galore in, you know, the
torch library, right, like whatever we had to deal with addition in Python, well, we also can add two
tensors together. And so when torch function was originally designed as array function in the NumPy
ecosystem, it was designed with an extra mechanism for making sure multiple dispatch would work in
this situation. Here’s how it works. And remember, it works very similarly how Python simulates multiple
dispatch and certain magic methods. When you call an operator that is torch function overloaded,
for the first thing we do is we collect up the classes of all the tensor arguments in it, because
that’s all of the possible implementations of torch function that may be used in this situation.
We look and see if any of these classes are subclasses of other classes. This is important because,
well, let’s say that I have an A and I have a B that inherits from A, and I want to add A and B
together, it’s better for me to try the B first, rather than the A first, because B might have some
special handling that overrides the behavior of a stock A operation. Other than that, I pick some
arbitrary order to run the torch functions on just subclasses first, and then I go ahead and run them
one by one. And the first time one doesn’t return a not implemented error, and actually returns an
actual result, that’s when I actually return that result for real. However, torch function implementations
can say, I don’t know how to deal with this, and pass on the baton to some other class that might
be able to handle it later in the implementation. Unlike stock Python, we don’t have special versions
of torch function if you are in the first argument or the second argument or third argument. Torch function
is a class method, so it can always be called no matter what, where the class in question lives in the
argument list. So, you know, as an actual implementer of torch function, you’re responsible for going over
the arguments, and making sure if they are actually your object in question, or if they’re a normal
tensor, or God forbid, there’s some other class that you don’t know how to deal with.
So let’s imagine that I’m writing a logging tensor. And a logging tensor is very simple, because it just
prints something and then just wants to go ahead and run whatever the operation was before. So logging
tensor is kind of universal, right? It works in any situation. And so we don’t need to be very restrictive
about what kinds of other subclasses we can deal with. So logging tensor might go ahead, look through all
the arguments, find the logging tensors that are in them, log what their values are, and then go ahead and
unwrap them and call the function again, on the same arguments as before. Remember, calling the same
function as before, make sure that if there are other subclasses involved, those can get a chance at it, the
logging tensor just removes itself from the picture. Or let’s say you’re some very special tensor that is
implemented like, as a back end into some accelerator, or some custom back end, well, you’re probably not going to be
able to deal with arbitrary subclasses. So what you should do in the torch function is when you are
processing it, you should go through all the types that were passed in and check that they are all exactly
your type or maybe, you know, a tensor type. And if you see anything you don’t support, you should return
not implemented instead of raising an error or anything like that. This is not super obvious to do when you’re
just copy pasting code. But if you keep it in mind, it’s actually pretty simple. It’s just a little bit
of extra error checking that you need to add to torch function and make it compose well with other
implementations of torch function. And of course, it’s not a magic bullet, right? At the end of the day,
someone needs to be able to handle all of the arguments in question. So if you know, you have a bunch of
extensions, and none of them know how to deal with each other, then well, that’s fine, you’ll just get
an error saying that there wasn’t any torch function that actually implemented this. The key thing about
multiple dispatch is that you can retrofit new functionality onto the system that you may not
have had before. So imagine that you know, someone’s gone ahead and built a torch function subclass that does
some extra behavior. And then you’re a further extender and you’re like, Oh, this is a great idea. But I if only I had
another class that I could customize the behavior even more, well, that class knows about the first torch function
implementation, and it can write generic implementations that work in both cases. And in this way, you can post facto
add more functionality onto the system that you know, perhaps the original implementer of some class didn’t
anticipate. And this is one of the things that people like a lot about multiple dispatch. It’s this ability
to solve the expression problem by just, you know, putting giving people a place to put the completion of
how feature A interacts with feature B. So multiple dispatch in this way is kind of cool. And remember that
I said that we we always run subclasses before their parent classes, because you know, they’re more specific,
but otherwise, the order of the multiple dispatch is unspecified and PyTorch is allowed to pick
whatever order it wants. But in general, most operations you’re going to do on a tensor aren’t
commutative. And so it’s kind of, it’s a bit tricky if you know, you actually are going to run
these in any arbitrary order, and you still want them to be well specified. So what really is going to
happen most of the time is most of your operations that you know, don’t know about each other are just going to say not
implemented when they see something they don’t support. And it’s only really the things that you know, know about each
other, they’ll have a very specific ordering in mind. But there is a situation when you do want to be able to make custom
subclasses of tensors, and you want them to be composable, and you want control over the order in which they run. And this is called
functorch, aka jack style composable transformations on functions. One way to think about what functorch does is it creates a bunch of new
subclasses like batch tensor and grad tensor, which, you know, imbue the meaning of operations with different things, right, like
batch tensor takes in what used to be a single example series of operator calls, and turns them into batch versions. And a
grad tensor takes what used to be a simple forward only secretaries of calls, and then also computes the backwards at the same
time when you execute those calls. The composition of these passes matter, it matters if you do a vmap, and then a grad,
which is traditional good old fashioned, you know, training over batch, versus a grad, and then a vmap, which is a more
exotic type of training called per sample gradients, where you actually compute a gradient for every single sample, you don’t
average all together in one big loss. And the whole pitch about functorch is that these transformations are
composable. So you, you know, grad can work with vmap, vmap can work on grad, and you don’t want these to actually have to
know about each other, right? Like you can specify these transformations individually, and then, you know, put them
together in whatever order you like. So how the heck does this play well with a multiple dispatch system, like we just
described before with torch function? Well, remember that I said that although the order we call methods is
unspecified, there is one thing that is guaranteed, which is we are always guaranteed to run the subclass method before
the parent method. So let’s say that I want to do some composition of operations, say a vmap first, and then a grad. Well, if I want to
make sure that I handle the gradients before I do any v mapping, then all I need to do is make sure the gradient
class subclasses the vmap class. And of course, I might do it the other way, right? I might want to have the vmap class
subclass the gradient class. And so really, what I want to happen in this situation is I’m actually just going to
dynamically create new classes for whatever sequence of compositions I want. So if I want to do a vmap and then
grad and then a vmap, well, I’ll just, you know, have a vmap one that inherits from grad zero, that inherits from vmap zero,
or, you know, like, whatever. Fortunately, Python is a very dynamic language. And so it’s pretty easy to allocate classes on the fly.
So you’ll have some implementation of this class. But when a user wants to actually use it, they actually
have to, you know, set up this inheritance hierarchy that says what order the transformations relate to
each other. But you know, this is not something they have to write any code for, you can just do this for
them on the fly by generating the classes. And the wonderful thing about this is it says, hey, you know,
functorch is this cool thing. It’s got all these transformations, they’re composable with each other.
And in fact, the torch function, multiple dispatch mechanism, or really the dispatch to Python dispatch
mechanism, but they’re one in the same, they’re literally implemented using the same code.
This mechanism is general enough to make this work. So we don’t actually have to add any extra level
or stack or anything like that to make the multiple dispatch work out in the situation. That’s pretty
cool. And something Richard and I didn’t expect when we’re trying to work out what to do in the
situation. It also answers some questions we had, which is what should happen if you have, you know,
some functional transforms that aren’t nested in each other and are leaking between each other.
And this would correspond to a subclass A of some parent and a subclass B of some parent, but A and
B aren’t related at all. And you know, remember what I said about torch function, what you’re supposed to
do is check your types and make sure you actually understand everything that is in there. So if you get
some type that isn’t related to your current class hierarchy, you’re supposed to return not
implemented error. And so we’ll correctly get the correct error case in this situation, which is that
well, this is not something that’s implemented, you haven’t said how these two passes interact with
each other. So we’re not going to guess one way or another. So what’s the upshot? Well, Python doesn’t
have native multiple dispatch, but torch function and torch dispatch dispatch to Python, both implement a
form of multiple dispatch for handling what happens when you pass multiple different subclasses to a
function. It’s pretty simple, but very powerful and good enough to express all sorts of things,
including Jack style, composable transformations. That’s everything I wanted to say for today.
Talk to you next time.Multiple-dispatch-in-torch_function
EP52 Batching
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about batching, a fundamental concept in PyTorch and many other numeric computing libraries.
Batching is one of those very fundamental characteristics in PyTorch.
And if you’re listening to this podcast and have gotten this far, you probably know a thing or two about it.
But let me just take some time to explain, you know, what is so important about batching.
The concept behind batching is that when we do operations in PyTorch, like adding or subtracting or multiplying, we don’t do them on single numbers.
Instead, we do them on batches, on arrays of numbers.
And when we do an operation, we do it many times over.
In a deep learning context, when we do a batch computation, we might be doing the same operation,
doing the same series of matrix multiplies, convolutions, whatever, on multiple inputs all at once,
called a batch of inputs, before, you know, computing a loss and doing stochastic gradient descent in the situation.
Batching has a long history.
The concept of computing on arrays or vectors
comes from all the way back from this language called APL, where everything was an array and you sort of only could ever do operations on it.
APL’s concept of defining operations that worked on arrays rather than single elements
sort of paved the way for most modern numeric computing libraries, PyTorch included.
The most important thing about batching is it lets you amortize the overhead of whatever interpreter loop or top-level programming language you’re using.
Because when you ask for an addition or a multiply, you’re not just doing it on one element, you’re doing it on many, many elements.
And so if your batch size is large enough, if your array is large enough, then, well, you’re going to spend most of your time in some sort of optimized C code
that’s handling the actual processing for each element, rather than, you know, wasting all your time in the interpreter, you know, repeatedly looping over something.
So, you know, like, basically, at a higher level of abstraction, if you write code that operates on, you know, arrays rather than on single elements,
we can just do a lot better job at executing it eagerly.
This characteristic of batching shows up all over the place.
For example, in the automatic differentiation community, prior to the rise of deep learning,
many AD systems would actually, you know, perform AD at the level of individual operations on single numbers.
And, well, this would actually lead to quite a lot of performance problems, because, well, you know,
you’re tracking these fine-grained, you know, autograd histories through every single element in, you know,
maybe some sort of physics simulation.
And so when we do autograd in PyTorch, we actually track automatic differentiation at the level of batched operations,
not individual operations.
And that reduces a lot of the record keeping we have to do, because, well, if you have a 10,000 size array,
we still only record one piece of information for the autograd of operations involving that array.
Given the importance of batching for running code efficiently,
you might imagine that it would be easy to write batched code in PyTorch.
And, well, you’d be half right.
So in a previous podcast episode, I talked about a vmap, a new feature in PyTorch for taking code that’s written in a per-example way
and converting it into its batched version without requiring any changes.
Vmap is pretty cool.
You should go listen to that podcast if you’re interested in it.
But, you know, people were writing PyTorch models way before the days of vmap.
And there were, you know, simpler ways of writing batched computations.
Namely, you just took operations that knew how to handle batch operations and you strung them together.
And so if you’re just doing simple operations like, you know, point-wise operations,
this wasn’t too difficult because, well, if you add together a tensor of size 2 with another tensor of size 2,
that’s the same thing if you turn it into a batch where you take a tensor of size n by 2
and add it to another tensor of size n by 2.
Nothing changes in the way you write the operation in this situation.
But it’s a little too much to ask for every PyTorch operator to work in the same way.
And in fact, if we look at all sorts of operators and we try to classify what their batching behavior is,
you’ll quickly find that there is a lot of variation.
So there’s a few cases that are very regular.
So one is this point-wise situation, right?
And in fact, there’s a sort of more general case of this,
which are just functions that take arbitrarily many batch dimensions
and functions that are willing to broadcast.
Broadcasting, by the way, is this thing where if you don’t provide enough elements
compared to someone else’s batch, we will so-called broadcast the element.
Namely, we’ll stamp out multiple copies to sort of match up the size in question.
This is really useful, for example, if you have an array and you want to add 2 to it.
Well, 2 is not the same size as a, you know, 10 by 10 array,
but we’ll just broadcast 2 into a 10 by 10 array that just contains a lot of 2s
and then add them together.
So functions that, you know, take many batch dimensions and are willing to broadcast,
these are typically just the point-wise functions.
And these are very well-behaved and it’s very easy to, you know, do batch computations with them.
Some functions, however, only take one batch dimension.
And you’re going to have to actually kind of look at the documentation
to figure out if this is the case or not.
There really isn’t any rhyme or reason.
A lot of this behavior is simply inherited from the old days in LuaTorch where, you know,
like someone was writing in the kernel and it was for neural networks.
And, you know, usually you only have one batch dimension in neural networks,
just the batch of the inputs you’re processing over.
So they didn’t really need more batch dimensions.
So you’ll have some functions that only take one batch dimension.
Some of these functions, you know, are even like they will take an optional batch dimension.
So if you just leave that dimension off,
they’ll just assume that you just wanted to operate on a batch size of one.
And some functions are really weird.
Like take, for example, torch.matmol.
Depending on the dimension size of each of its inputs,
it might do a matrix multiply, it might do a dot product,
it might do a matrix vector product,
or it might do some sort of batched computation.
And there’s like a bullet list saying what happens in each of these situations.
So it’s no surprise people really like using vmap,
because, well, vmap just sort of abstracts all this information away.
But, you know, we have to pay the piper somehow.
And so the cost of implementing vmap is we actually have to write
all of these batching rules to like figure out, you know,
how exactly to put things together.
And I talk a bit more about that in the vmap podcast.
What I want to talk about today is I want to compare and contrast
batching operations with how it’s done in NumPy.
Because in NumPy, actually, over the years,
NumPy has developed a little more structure on batching and broadcasting operations.
And they call these the structure ufunks.
And I just want to explain what a ufunk is,
because it’s a pretty useful concept.
And all of API concepts from PyTorch were taken from NumPy.
We don’t actually have a direct concept of ufunk.
But it’s one of the things we’re considering adding in the near future.
So a ufunk, or universal function for short,
is NumPy’s way of referring to any function that, you know,
has a number of well-defined properties that make it work very regularly.
And what do I mean by that?
Well, ufunks are functions that have batching behavior.
So that is to say you can add more dimensions to their beginning.
And you can, you know, broadcast them if the dimensions don’t line up exactly.
And they also support some amount of typecasting.
So if you pass in some inputs that don’t have exactly the same types,
a ufunk will know how to promote the type
and, you know, like get some common type to do the computation in.
So the concept behind a ufunk is really just, you know,
taking some primitive operation like an add between two elements
and then turning it into a vectorized operation
that has, that can actually operate on as many dimensions as you want.
And if this sounds familiar to you,
it should because tensor iterator is basically an implementation
of, you know, turning a C++ functions
into what are basically ufunks in PyTorchic.
So we don’t call them ufunks.
And, you know, actually ufunks in NumPy
have a bunch of other interesting properties.
For one, they have a bunch of other variants.
So when you talk about NumPy.add, there’s actually also a NumPy.add.reduce.
And what that function does, it’s a function attribute on side of the NumPy.add function,
is it takes, you know, your reduction dimension and reduces it using the operation
that is the one from the ufunk, namely addition.
So how come ufunks aren’t just an internal implementation detail in NumPy?
I mean, you know, tensor iterator is something that you have to know about if you’re a PyTorch
developer, but it’s not a user-visible concept.
I asked Ralph Gommers, a NumPy maintainer and one of our collaborators at Quantsite, about this.
And he gave me some very interesting information about ufunks.
So ufunks are not that great for users because users find it a little strange to take a function
and then take an attribute on it and then say np.add.reduce.
That’s kind of weird.
But because ufunks are introspectable and, you know, they have a very regular structure,
they can be used in other libraries to do things that, you know,
sort of wouldn’t be possible with just plain NumPy.
So for example, scipy.special consists mostly of ufunks that are easy to define
and they just reuse NumPy’s machinery to take these, you know, functions and then turn them into ufunks.
In the same way that, you know, downstream users of PyTorch might want to use tensor iterator
to, you know, make point-wise style operations.
But another example of ufunks is Numba.
So Numba is a optimizing compiler for Python that basically, you know, can take code that is just
written in Python and then compile it to CPU or CUDA.
And so when you write a NumPy operation and it is a ufunc operation, well, Numba can actually
easily lower that into their IR because they know, hey, well, ufunks all operate the same way.
So if it’s something that ufunks, it just needs to know, you know, what the single element operation is
and then otherwise can use a common lowering behavior in the situation.
One of the reasons why I personally have been thinking about NumPy ufunks recently
is because we’re looking at how to rationalize our operators and sort of reduce the amount of
boilerplates we will have to write in the situation.
And, you know, actually exposing this concept of ufunks as a concept in our operating library
is one way of saying, hey, all of these operators have the same behavior, so you can treat them
in the same way.
In fact, there’s an even more general concept than ufunks called a generalized ufunk.
And these ufunks basically make it possible to define things that aren’t just element-wise
operations like add or subtract, but things that actually do non-trivial transformations
on dimensions like matrix multiply.
And the concept is still kind of the same.
You need to define what the, you know, sort of core operation is, right?
Like a matrix multiply takes your dimensions and removes the inner dimensions and, you
know, puts the outer two dimensions together.
But then, once again, because we’re in a batched universe, you might want to actually batch this
operation.
And so the generalized ufunk says, okay, and then, you know, you can tack on as many batch
dimensions as you want.
And so once again, if something is a generalized ufunk, then you know at least that the batching
is handled in a very regular way.
So, you know, the combination of these two things means that, you know, it’s not as important
to have something like vmap because, well, as long as your operators are one of these things,
then, you know, you can rely on it acting the same way.
Although, well, it’s still kind of nice having vmap because not everything is going to be a
ufunk.
Not everything is going to be a generalized ufunk.
And so, you know, if you just don’t know, if you don’t have time to read the documentation,
you know, vmap will just make it easy.
You just don’t have to worry about it.
So that’s it for batching.
So batching is how we make PyTorch as an eager library efficient because we can amortize
the overhead of Python over doing computations over many, many elements.
PyTorch is not very regular about how batching is done on operators.
It’s a very per operator thing.
Some operators take many batch dimensions.
Some operators only take one batch dimension.
Some operators don’t take a batch dimension at all.
But there is some structure to our operators.
And one way to think about it is, is an operator implemented using tensor iterator or not?
But some other ways of thinking about it, because PyTorch is very similar to NumPy, is, you know,
what things are ufunks?
What things are generalized ufunks?
That’s everything I wanted to say for today.
Talk to you next time.Batching
EP53 DataLoader-with-multiple-workers-leaks-memory
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about a famous
bug in PyTorch, bug 13246, aka data loader leaks memory when workers is greater than
zero. This is my apology for not actually knowing how to do a data loader episode because
the subject of data loaders is deep and vast and I should probably do an interview with
Vitale FedUnion, our main developer working on data loader right now. So instead I’m going to just
talk about this particular issue which is of interest to anyone who’s ever you know trained
any models in PyTorch and talk about all the things you need to understand to know exactly what is going
on with the issue, why the issue happens, and why the various fixes for it works. So at the end of this
podcast hopefully you’ll know about all of these things. So to start I should explain what exactly
this bug looks like from the perspective of a user. So imagine you’re trying to train some model in
PyTorch. There’s a bunch of things that are important to training a model but in particular we want to
look at how exactly you are getting data into the model in the first place right like your data is
going to be some data set of images or audio files or whatever depending on your domain you need to
somehow load it up into memory and then actually feed it into your model to do the training in question.
And so that process of loading the data is done by the aptly named data loader which is responsible
for you know getting this data from wherever it is doing some pre-processing on it and then formatting
it into PyTorch tensors so that we can actually use it for actual you know processing. So the bug looks
like this. So data loader has this feature called num workers which lets you parallelize the data loading
process over multiple processes. This is pretty handy because sometimes you are CPU bound on the
you know number of pre-processing steps you can do and so farming them out to a bunch of you know
separate processes can help make sure that your actual model you know stays full of data because
maybe your GPUs are actually running way faster than the process of loading your data. This is very easy
to accidentally end up in and so like paralyzing the data loading process can help in this situation.
So what you do is you’ve got your data loader and you say okay I want the number of workers to be you
know 8 or 18 or you know however many you think you want your parallelism to be. You start running your
model. It starts training. Everything looks okay. You know it’s using a lot of memory but it’s within the
bounds of your CPU system and you know you start doing iterations one after another and at some point
you out of memory and so you run it again and you look at the memory usage and you notice the memory
usage is slowly going up as you are running your training process and you’re like huh there must be
some sort of memory leak in the data loader and so the issue’s original name was data loader leaks
memory when numworkers is greater than zero. You’ll also notice that if you don’t set the number of
workers to you know something big then the leak quote-unquote doesn’t actually happen. So that’s what
the bug looks like. Now to explain where this bug comes from because actually in fact it’s not a
technically a PyTorch problem it’s a problem with CPython and it’s actually a very difficult problem to
resolve at the CPython level. We have to talk about a lot of concepts. So one is I need to explain you
know what exactly is going on with data loading and multiple workers and why we want to do it and
how this is set up. Two we need to talk a little bit about how process creation works on Linux, what
fork is, what copy and write page memory is, and finally we have to talk a little bit about the CPython
runtime, namely what is reference counting and what is it all about. Eagle-eyed listeners, forgive my
mixed metaphors here, may notice that in fact we have talked about many of these things in previous
versions of the podcast but I’m going to just talk about them over again today because it’s important
to understanding what is going on with this so-called memory leak in data loader. Okay so let’s first talk
about data set and data loader. So as I mentioned data loading is very important for deep learning
training and sometimes it’s hard to make sure that you you know have enough data to actually keep your
model busy on it and so that’s why people often want to do parallelization. Now how exactly does
parallelization work in PyTorch’s existing data loader design and this is important because the way we set
things up here contributes to the likelihood you’ll run into this problem. So the first thing to remember
is that data loader was originally designed to be something that just works in a single process.
So people just you know looked at it and tried to make something that you know would be reasonably
idiomatic and make sense if you wanted to load things from a single process. So the way things tend to work
in the data set is well you’ve got some data set so you need to run some constructor for it which you
know initializes some stuff about the data set more on this later and then depending on whether or not
you’re doing one of these iterable style or map style data sets there’s some way of actually
fetching data when you want to get it in the data set in question. So a very common separation say if
you’re doing training on an image model is in the constructor for the data set you load up a list of
file names say for all of the file images that you know might be in your data set and you know that’s
helpful because it can tell you you know how long the data set is and you know what are all the possible
like indexes that you can sample in this situation. And then when you actually index into the data set
to get something that’s when we actually load data from the image in question. So what does this look
like right? So you like have your constructor you say okay well load up all the file names store it as a
method on the stored as a property on the object and then inside the iterator for the object read out that
property do some stuff with it you know read out the actual image give it to the user. So this is like
the obvious way you would go about writing a data set in a single process case and one of the things
that data loader wanted to do was we wanted to like keep this same code working but just on multiple
workers. So how exactly do we do that right? Because like we’re accessing this data that was constructed
in the data set and you know like what’s going on with all the workers in question.
Intuitively what’s going on is we actually are able to access these properties on the data set
from each of the workers in question even though you know we only allocated them once in the parent
process. So how exactly does that work? To answer that question we need to know a little bit about how
multiprocessing works and in particular how multiprocessing with fork works. So fork is a core
primitive in the Unix style operating systems and what it does is it takes some process and it makes
a copy of it literally a copy. So that’s why we call it a fork because you know previously there was
one process now there are two processes and they are exactly the same. Well except for the fact that you
when you do the fork syscall one process gets zero the other process gets one that’s how you tell if
you’re the parent or the child. Now this might sound kind of crazy pants right like why would you go
through all the trouble of you know copying all the memory from the first process into the next process
like what’s up with that? Well it’s kind of useful right because maybe there’s a bunch of memory that
you set up beforehand and then the code after the fork wants to make use of it and so well you need it in
the parent process and you need it in the child process. And in fact forking is very cheap in
operating systems like Linux because of an optimization called copy on write. So remember
when I talked about shared memory in PyTorch and I said hey you know normally each process has its own
memory but in some circumstances you can share memory between processes and that’s how shared memory CPU
tensors work and that’s also how shared libraries in your operating system works. Well like a single
library is loaded up once into physical memory but then mapped into multiple processes via virtual memory
mapping on your operating system. Well the same applies when you do a fork. So when you do a fork we
don’t actually go ahead and eagerly clone all the physical pages we just make a copy that refers to the same
physical page. Now of course each of the processes that the child and the parent could go ahead and start
mutating these pages and the the like sort of semantic meaning of a fork is you actually did get a copy.
So if we don’t actually make a copy when someone writes to it we have to then actually materialize
the copy and that’s why it’s called copy and write. It’s free as long as you only read it and if you start
writing into it well now we’re going to start doing copies on these pages. So going back to the data loader
well you know so what’s happening when we have multiple workers is we just fork the python process.
Every process still gets access to all the stuff that you initialized in the constructor for the data set
and as long as you don’t write to it which you know like intuitively you’re not doing any writing
to the you know like list of file names right you’re just reading from it then you know you should be able
to share this memory without actually having any problems right right well there’s one last piece
of the puzzle and that’s python reference counting. In python things that look like read-only operations
like oh give me you know the field of this object and assign it to a variable these so-called read-only
objects operations actually do writes under the hood to the memory in question and what are these writes
for therefore reference counting. Reference counting is a way of ensuring that we know how many outstanding
references there are to any given piece of data so that when the ref count goes to zero we know we can
deallocate it. What this means is that if you you know read some field out of an object and assign it to a new
variable that didn’t exist before we’re obligated to increase the reference count of the object in question
and that’s a memory write. So hopefully you can see where this is going so putting all the pieces
together. So why when we you know run the data loader initially there’s not very much memory used
even though we’ve spawned off all these workers. Well that’s because of the fork copy on write
optimization which says is that hey when you immediately fork the process we don’t need to
use that much memory because we can just you know share the pages between the processes. Of course if we
start writing to those pages and that’s what happens when python reference counting comes into play
then you will start actually you know writing to the pages and forcing them to be materialized. And so
as you go through your data set as you process more and more items you will start touching more and more
reference counts causing more and more pages to get copied to your child processes until in the worst
case scenario every child process is using as much memory as the parent process. And sure that’s not a
big deal if your parent process was only using you know 10 megabytes of memory but it is a pretty big deal
if your parent process was using 4 gigabytes of memory and you know 4 times 10 worker processes that’s 40 gigabytes
you’re probably out of memory at that point. So what can you do about this situation? And actually we can just
examine various you know things that led to this problem and each of them sort of suggest a way to solve this
issue. So we might say hey the problem is that we’re doing this python reference counting and you know like if we had some way of sharing data between processes without requiring you to increase the reference count when you access them that would prevent us from paging this copy on write memory into you know copies in the child processes and save us from a lot of memory usage. Well that’s not so easy to do with pure python objects.
but it’s easy enough to do with other types of objects like numpy arrays and py arrow arrays. These are objects they are reference counted per se but the data in question each individual integer that’s stored in a numpy array or you know as people were doing in you know workarounds for this issue storing strings in numpy array those things inside of the array themselves are not reference counted.
So as long as you don’t actually like take out a new reference to your numpy array then you can just you know index out a subset of the numpy array and that will actually just you know be an operation you can do without incurring any reference count bump.
Of course even if you actually cause a reference count bump on the numpy array you might still get lucky if say the actual data for the numpy array was allocated out of line and so you you know like they lived on different pages so you only cause one page to come in but not the rest of your data.
Although I wouldn’t count on that just make sure you don’t increase the reference counts on the shared data you’re accessing.
There’s a bunch of other things you can do right like you can use c types to allocate raw data.
You can also use any other library that you know basically wraps around a raw c representation of the data in question that doesn’t involve real python objects.
Another conceptual fix to this problem is to say hey um this you know accessing a shared memory is kind of you know bogus right like um the first rule of designing distributed systems is shared memory is bad right.
You want explicit cues you want to be explicitly about saying what communication you do between processes.
It’s a lot easier to debug it’s a lot more scalable it you know prevents problems like this.
And so that’s what the sort of data loader rewrite that Vitaly Fedunin and Erjie Aguan have been working on and specifically the data pipes concept is that instead of having this monolithic data set object that like does everything that you want to do we’ll have a bunch of composable data pipes which you can hook up with queues.
And that do various stats of processing the most important thing is it’s functional and so you don’t actually have any shared state right like when I want to feed something from one data pipe to another I have to do it via an explicit queue.
And that would prevent this problem now there’s one more way of solving this problem which isn’t even mentioned on the issue in question but which I discovered recently thanks to some of my colleagues at Facebook.
So another way you could solve this issue is you could literally go into CPython and say hey these objects I just don’t want you to increase the reference count anymore right.
I want to somehow make these objects immortal and so you know in CPython if I you know acts as an immortal object I’m just going to skip the reference count entirely.
If you can somehow do that right then you could actually use honesty goodness normal Python objects in the good old-fashioned data loader API and that’s you know kind of attractive because it is kind of a pain to go and pack all your strings in a numpy arrays.
Well it turns out there is a fork of the CPython interpreter called Cinder developed by folks at Facebook.
I can talk about this because Cinder is actually open source you can go download it and try it out and Cinder implements an API for immortalizing the entirety of your Python heap.
So the way it works is at some point in time you can say hey I think everything on this heap is going to be live for the rest of eternity and Cinder will go ahead and you know mark all those objects as immortal
and now you will no longer do reference counts on them which means that if you then fork and have workers access that memory they can access it willy-nilly without worrying about reference counts.
So there you have it one of the most famous quote-unquote memory leaks in data loader.
It’s probably affected everyone who’s done any non-trivial processing with data loaders in PyTorch.
I’m not going to say that you know PyTorch exactly is blameless here although this is technically a CPython and fork and interaction problem we probably could have done a better job designing the core abstraction in PyTorch to make it harder to actually accidentally run into this case.
But it’s a pretty interesting problem one that you know is likely to show up if you do any other sort of multiprocessing and I hope this was an interesting podcast and gave you a little bit of insight about some of the complexities and interesting internal workings of working with data loaders in PyTorch.
That’s everything I wanted to say for today talk to you next time.DataLoader-with-multiple-workers-leaks-memory
EP54 Half-precision
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about reduced
precision floating point formats, namely float 16 aka half precision and bfloat 16 brain floating
point. Float 16 and bfloat 16 are important alternative precisions for floating point numbers
as opposed to the ordinary 32-bit floating point representation, which are often used in deep
applications to speed up computation in cases where the extra precision afforded by 32-bit
floating point numbers is not necessary. I’m not really going to have time to give you a complete
lowdown about everything there is to know about IEEE floating point numbers or how these formats
are set up, but I do want to give a little bit of a working knowledge about some of the important
points of floating point formats and also how they affect how we write code inside PyTorch,
because something that you very often have to do, for example, when you’re writing a kernel
is you’ll write a normal implementation, the normal mathematical way for 32-bit floating point and for
64-bit floating point, but then for half precision, you need to do something special. And we’ll talk
about why you often need to do something special in these situations and what kinds of things you have
to pay attention to. So to start off, let’s talk a little bit about floating point numbers, what they
are, to understand why half precision does something a little unusual with floating points in normal
sense. So floating point numbers are a way of representing decimal numbers, because if you’re
familiar with normal computation on computers, we love integers. We use integers to represent
everything, but sometimes some things can’t really be represented as integers, right? Like numbers with
decimals on them, for example. And so the float in floating point numbers refers to the fact that we
change the precision by which we’re representing numbers, depending on how large the number is.
To understand what I mean by this, let’s think about a situation where you don’t care about floating
point precision, namely storing currency. So, you know, in US dollars, you have number of cents and
you have number of dollars. So, you know, I may have $10.46. And there’s always some amount of cents
associated with any number, no matter how big the quantity of dollars I’m talking about, like a
million dollars or a billion dollars. There’s still only is ever two decimals of precision that I need
to record the number of cents in question. There’s never like sub cent quantities in typical monetary
transactions. So this is a prototypical example of a fixed point, fixed point number, where you want to fix
the decimal point at two, you know, two digits of precision, no matter how big the number in question
is. Of course, if you’re doing something like doing a measurement between how far you are between two
cities, or for example, representing a neural activation in a neural network, your neural, if you’re if your
quantity is in the millions, you don’t really care about those two digits of precision after the decimal
point. So the idea behind floating point is that you don’t have to, you know, store significant
digits based on where the decimal is, you let it float, and you just store a fixed number of significant
digits. And just what those digits are depend on how big your number is, right? So if you’re talking about
a million, then you might store significant digits for the millions, the hundred thousands, the ten
thousands. But if you’re storing something like one, then you might store the, you know, first decimal
place and the second decimal place, and the third decimal place, it floats along with you when you have
the number in question. So given this basic specification of floating point numbers, there are basically two
major parameters that you can vary when you’re defining a floating point representation, right? You can say
how many bits you’re going to use on the significant, aka, the like significant digits that are in your in your
number, and how many digits you’re going to devote to representing the exponent, which basically just says
how big the number is, right? Are you talking about ones or thousands, or millions or billions? And so we can use
this to sort of understand what’s going on with 32 bit floating point numbers, 16 bit floating point numbers,
and also brain floating point, because they all actually have the same semantics, just different
settings for these parameters. So the significant for 32 bit floating numbers is 24. So that’s a lot of
digits of precision. And so one of the sort of observations that drives lower precision floating
point is that, well, you don’t actually need all those significance. So 16 bit floating point numbers only have
11 bits of significant, and B float 16 only has seven bits of significant. If we talk about the exponent
instead, well, 32 bit floating point numbers have eight bits of exponent, 16 bit floating point numbers
reduce the amount of exponent you have. So you only have five bits of exponent. And B float 16 actually keeps
the number of bits for the exponent the same as 32 bit floating point numbers. So another way to like think about the
difference between float 16 and B float 16 is float 16 sort of was like, okay, well, we need to chop off
16 bits from our representation to, you know, reduce it in size by half, we’re going to chunk some of it off
of this thing. If we can, we’re going to chunk some of it off of the exponent. And then you know, we have a nice
balance. B float 16 was like, we want all of the exponent bits, we want the same what we call dynamic range,
the same, you know, max and min values we can represent in floating point numbers, and we’re willing to chop off
tons and tons of actual precision off of the actual, like, you know, digits in question, the significant
to get it. So why use half precision, or B float 16 numbers? Well, as I mentioned before, they use half the
space of memory that a 32 bit floating point number uses. So this has a number of implications,
right? We need to store the values of tensors in memory. And so if you can store a number in half the
space, well, you’ve basically doubled the number of parameters you can store in your model. And
furthermore, you know, not just, you can store more numbers in your RAM. But when you’re actually
like loading up this data into your processors to actually compute on it, well, that’s half as much
memory bandwidth you need in this situation as well. And oftentimes, one of the primary costs of doing
deep learning inference or training is just getting the freaking values out of memory in the first place.
And of course, if you only need to compute on 16 bits of data, instead of 32 bits of data,
that means less silicon. And you can, for example, vectorize more easily for the same amount of
silicon. Now, it’s sometimes, you know, the memory benefits, I would say for half precision are the
primary benefits. And the computation benefits do help sometimes, but also sometimes they happen not to
matter. And we’ll see an example of this when we talk about CUDA support for half precision.
So let’s talk specifically about half precision for a moment. So what are some things to know about
when you are writing code that needs to operate in half precision? So one of the like things you first
figure out that’s very, very obvious is you are way, way, way more likely to overflow your floating
point number than if you were doing 32 bit floating point numbers. A float 32 can store values up to 10 to the
38. That’s 38. That’s 38 zeros after, you know, three quantity. I don’t even know what quantities can
go that high that I normally deal with in a day to day basis. In contrast, a float 16 can only go up to
65,504. That’s it. If you go much higher than that, they’ll just go to infinity in float 16. So yeah, got to be super,
super careful. Because the dynamic range of half precision floating point numbers is smaller, when you want to do
training with networks, and you want to use half precision instead of float 32, you often need to tune your
hyper parameters differently, because while you need to make sure you don’t actually go outside of the dynamic
range supported by half precision. One of the most common ways people use half precision is in fact, not by
making their entire network operate only in 60 bit floating point numbers, that’s often just too much, it’s like
too little precision, and your dynamic range is just going to get messed up in a lot of situations. But instead, by
using something called automatic mixed precision, which just says, well, there’s some operations that are very
unlikely to go outside of the dynamic range you want, and will only cast to float 16 and make use of the benefit,
the lower memory usage in those situations. It also helps that automatic mixed precision is super easy
to use, you literally write your network as if you’re writing it for 32 bit floating point numbers, and then
you just turn on a flip switch that like automatically switches it without you having to do anything.
Half precision has been around for a while, and it’s been available in specially in Nvidia,
CUDA. There’s actually really no silicon for doing half precision computations on Intel CPUs. And so you’re
most likely to see use of half precision inside CUDA programs. But actually, there’s a little bit of
nuance to this, which is that you might imagine that like, you know, you put your tensors in half
precision, and then you do operations on them. And you’d expect to see, you know, actual like half
precision silicon being used. But in fact, in PyTorch, we don’t use any of CUDA’s half precision
intrinsics, which would let you actually use the half precision operations directly in the hardware.
Instead, we convert everything into floating point numbers and do the do the computations at higher
precision. Why do we do this? Well, it’s because for many of our operations that we implemented for
half precision, they are in fact, not compute bound, they’re bandwidth bound, and we spend more time
reading the data out from memory than we do actually doing the computation on it. And in these situations,
it doesn’t matter if we waste time doing conversions to and back from floating point,
because, you know, we’re still waiting on the next block of memory. And so we can just, you know,
do things in higher precision. And so a lot of computations in CUDA operate at this higher
precision internally, only converting back to float 16, when you need to write it out back out into
memory. Remember, this is still a win, because you’re using half as much memory,
using half as much memory bandwidth. So what you typically expect is for a computation on half
precision to be twice as fast as a computation in 32 bit for precision. And that’s just because you’re
literally reading out half as much memory. That being said, in some situations, you are somewhat
compute bound. A good example of this is when you’re doing matrix multiplies. And so when you do matrix
multiplies, in fact, there is this thing called TF32 that newer NVIDIA GPUs implement, where they do the
multiplies and matrix multiplies at an internal format. And in fact, they don’t do it in half precision,
they do it in a special precision that is 11 significant digits and eight exponents, sort of like a combo
of float 16’s precision and B float 16’s dynamic range. And this happens entirely internally. So you
don’t you don’t see it, you’re just feeding in float 32s and getting out float 32s. But it makes things run
faster. And you know, you hope that the numerics don’t change too much. So to summarize, half precision,
the dynamic range is way, way small. So you’re mostly likely to see people converting half precision at
very, you know, localized spots in their code, where they know they don’t actually need that level dynamic
range. And you mostly only ever see half precision in CUDA on NVIDIA GPUs. Okay, let’s talk a little bit
about B float 16. So as I said, B float 16 is they just took their float 32, chopped off enough
significant digits until they, you know, could fit in 16 bits, and they kept exactly the same dynamic
range. So floats 16, B float 16 is actually very easy to emulate, right? Because you can use normal
32 bit floating point hardware to run it, you just, you know, sort of zero out all of the digits that are
below the level of precision that B float 16 would have given you, and then just run the normal float
32. So people did a number of experiments with it, and showed, hey, you know, B float 16 is great,
because, you know, we got rid of all of those, you know, pesky, like, you know, very fine detailed
digits in the numbers. And turns out, it didn’t matter at all, like, you know, our model still converged,
because we weren’t actually making use of that precision in any good way. And so B float 16 has shown up
in a lot of places. True to its name, brain floating point, it was originally designed by folks at Google
for use inside the TPU. But since then, it’s shown up in a lot of places, in particular, on the latest
Intel CPUs, starting with Xeon, there’s actually silicon for doing B float 16. So unlike half precision,
which only ever usually shows up in CUDA, B float 16 shows up in a lot of places, it shows up in TPUs,
up on your CPU. So if you’re probably looking for some lower precision training, it’s probably going
to be B float 16. In fact, Intel has been working with us to extend automatic mix precision to support
B float 16. So originally AMP was something developed by NVIDIA for CUDA only for half precision, and
Intel’s, you know, given us a patch that turns it on for CPU, and does exactly the same thing except
using B float 16 instead of half precision. Unlike in the CUDA situation, where we were typically memory
bound, we are often compute bound on CPU. And so sort of the silicon we’re using is in the AVX 512,
you know, vector instruction set. See also my, you know, previous podcast about vectorization.
And there’s just, you know, a lot of built in support for actually doing these computations
fast. Okay, so I’ve told you a lot of facts about float 16 and b float 16. What does this matter if
you’re doing code in PyTorch? Well, it mostly only matters if you’re writing kernels. And so when we
write kernels in PyTorch, we typically try to write it in a generic way that works for any, any type in
question, right? So typically, it’s templated so that you can do it in float, and you can do it in
double. And for most use cases, float versus double doesn’t really matter. You can write the same
algorithm in all of these cases. But when you have float 16, or b float 16, now you actually have to
pay attention to how you’re doing your internal computations. And in fact, we have two concepts
for like basically internal computation types, which are important when like, you know, using the low
precision floating point would result in catastrophic loss of precision, and you’d basically get wrong
results. So the first concept is the act type template, a ACC underscore type. What this does
is it gives you an accumulator type corresponding to the number in question. So for example, if I had
int 8, the act type of int 8 is int 64. Because if I’m, you know, summing together a bunch of 8-bit
integers, I will very quickly overflow 8 bits. And so we do the accumulation in 64 bits so that we can
actually, you know, get the real value in the situation. Similarly, similarly, when we do
accumulations on half precision floating point numbers, we need to accumulate them in 32-bit
floating point. Because as I said, you’re really likely to overflow 65,000 if you don’t actually do
this at a higher precision. This is very, very common, right? Like I mentioned matrix multiply
using this TF32 thing. They only do that for multiplies. The accumulate still happens in 32-bit floating
points. So like the the the the idea of needing to accumulate at a higher precision is common all over
the place. We don’t we don’t accumulate in double precision for float on CUDA because double hardware
is really, really slow. But in fact, on CPU, we still we act type goes to double in this situation.
The other concept we have is op math. And that just says what the internal computation type we’re going
to do. And this takes advantage of the fact that on CUDA, we’re typically memory bound, not compute
bound. So in fact, most of our internal operations happen in floating point precision. And this is good
for precision purposes, because if you do all your internal computation in 32-bit floating, and only
convert to 16-bit floating at the end, you’re not going to have as many like you sort of catastrophic
cancelation or loss of precision events from every intermediate operation in question. Of course, if
you’re running enough operations, you might still want to do them in half precision, because you might be
compute bound in that situation. So that’s most of everything that I wanted to talk about with half
precision. There’s one last thing that I want to put in your brain, which is that reducing the number of
significant bits or exponent bits is not the only way to, you know, sort of reduce the amount of memory
that your parameters use. There’s another way you can do it, which is you can represent your parameters
as integers entirely. And that’s called quantization. And it’s another very interesting way to reduce the
memory footprint and compute costs of your models. That’s everything I wanted to talk about today.
Talk to you next time.Half-precision
EP55 Tensor-subclasses-and-Liskov-substitution-principle
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about tensor
subclasses and the Liskov substitution principle. If you haven’t seen it already, I recently
posted the State of PyTorch Core September 2021 edition, which basically talked about
all the things that were going on inside PyTorch Core right now. And one of the things that
you may have picked up reading over this is that we actually got a lot of stuff going
on related to tensor subclasses. That is to say, you know, subclasses of tensor that
add more different kinds of behavior for any sorts of things you might want to do. And
there’s a ton of things going on here, like linear operators, like debug tensors, like
Funktorch. And I wanted to pull open the cover on one of the things that we’ve been thinking
about when designing how this ecosystem should look like, and that’s the Liskov substitution
principle, which says some things about when it is permissible to subclass some object and
when it is not permissible to. Okay, so let’s just dive straight into it. So what is the Liskov
substitution principle? So you may have learned this in, you know, your undergrad class about
object-oriented programming. And the definition you heard probably sounds something like this.
If S is a subtype of T, then any T may be replaced with S without altering any desirable properties
of programs that were previously using T. That’s a bit of a mouthful. So let’s look at an example.
Let’s suppose that we have some class that implements, say, bags. So bags are sort of
owner-order collections of items. But unlike sets, you can have multiple copies of an item in a bag,
right? So like I might have three apples and two oranges. And if I had a set, I could only say that
I have an apple and an orange. But in a bag, I could say I have three apples and two oranges.
Now, if I have an implementation of a bag, I can easily reuse this implementation to implement a set.
All I have to do is subclass it and say, well, whenever I insert things into the bag, if I already
have the thing in the bag, I’m just not going to insert it in that situation. No problem. So this
subclassing works, I can use inheritance to implement sets in this way. And it violates the
Liskov substitution principle. Why does it violate the Liskov substitution principle? Well, imagine that
you’ve got an algorithm and you know, it wants to do some sort of counting of objects. And so it was
using a bag inside its algorithm to like put things in and then at the very end, read out what the counts
of things should be. If you replaced the bag with a set, which we were sort of thinking about is a set,
a subtype of a bag, if we replaced a bag with a set, then when I ran this algorithm, I would only ever
count up to one for any given item that I was looking for. And that probably isn’t what my algorithm
wanted to do. Barbara Liskov gives another example, which is that in the old days, when people were sort of
just figuring out this object oriented programming thing, people would make claims like queues and
stacks are subtypes of each other. Why did they say that? Well, you know, a queue and a stack have a
push operation and a pop operation. And so you know, the methods are the same. So well, you know, they’re
structurally indistinguishable from each other, right? Like they just have the same methods. So you can use
one or the other. And Barbara was like, well, but that doesn’t make any sense, right? Like if I had a
program, and it’s using a stack, and then I replaced the stack with a queue, my program is going to do
something totally different, like because you know, last and first out and first and first out are
totally different ways of going out doing things. And probably my program wouldn’t work at all if I
replaced my stack with a queue. So the moral of the story behind Liskov substitution principle, and why
you know, like we love to teach it in the undergrad CS curriculum, is because it shows people that,
hey, subclassing is not the same thing as subtyping or behavioral subtyping, as Liskov liked to call it
in the later days, right? Like just because something has the same interface doesn’t mean they’re actually
substitutable. You actually have to say something about what the behavior of the program is in these
situations. So I remember learning about the Liskov substitution principle and thinking to myself,
well, that doesn’t sound too complicated. You know, like, this seems like a very simple thing
to abide by. You know, what’s the big deal? And well, maybe it is. But in fact, you know, I would say
LSP has spawned a ton of debate all over the internet about like, what exactly is meant by this. And
it’s not exactly straightforward to apply the principle in any every cases. In fact, there are some very
embarrassing situations where very famous software projects have violated LSP and discovered it to their
detriment later. Ralph Gammers relates to me a very fun story from NumPy’s history, which is that there’s this
class in NumPy called NumPy.matrix. It’s a subclass of ndarray. So it was at least originally intended to
be usable in any situation where an ndarray was. And it’s basically a specialization of ndarray for the
matrix situations, right? 2d. And what they did was they were like, okay, well, because these are matrices,
we’re going to make multiply, like just the normal asterisk operator, mean matrix multiply in the situation.
Well, even though NumPy.matrix has the same API as NumPy ndarray, it totally violates LSP because,
you know, anywhere I had some NumPy program that was originally expecting to have an ndarray
and expecting the star operator to give me pointwise multiplication, if I sub in a NumPy.matrix,
I will suddenly get matrix multiplication. And I’ll probably just get errors in this situation. And my
program will not behave the same way. And like it will have none of the, you know, properties that I
wanted to have. So as a result, like every, you know, like serious NumPy function in the ecosystem
first casts everything to ndarrays, just so that, you know, they don’t have to worry about someone
passing a NumPy matrix. You really shouldn’t use NumPy matrix if you can get away with it.
So what I think makes LSP so controversial is that we said that you can replace any T with an S
without altering desirable properties. But we didn’t really say what is meant by desirable
property. Barbara, at least, meant what she meant by properties was that if you were only using the
API defined by the supertype, you couldn’t see the difference between using a T versus using an S.
And this is a very reasonable definition, especially in an academic context. But while in actual programming
languages like Python and C++, there are a lot of ways you can interact with an object. So if you say
every operation that was possible on the supertype needs to be preserved by the subtype, well, in practice,
there is basically no change you’re allowed to make. Like, as a simple example, in Python, I can ask what
the type of an object is. And if I subclass my type, then I will get a different subclass in this
situation. And therefore, it is observable that there is a difference in the situation. And therefore,
no subclass is a true subtype in this situation. And to take the flip side perspective, I could say,
well, you know, programs are meaningless. It doesn’t matter what a program does. All I need is for it to be
type safe, or for it to not raise exceptions. And so as long as it cracks like a duck, as long as it
implements all of the methods that I expected on the original object, I have no obligation to you
to make the subclass actually behave in any reasonable way. And so a lot of, you know, monkey
patching and duck typing in Python sort of is based on this idea, right? There’s no spec, you just subclass
plus the object, override a bunch of stuff and pray that something reasonable happens.
So clearly, there is a solution to this problem. And the solution to this problem is that we shouldn’t
use concrete implementations of objects, as the definitions of our super types. And let instead,
we should use some sort of abstract specification, and use that as the basis for deciding what behavior
is allowable or not. And this is definitely, in my opinion, what Liskov had in mind when she said,
well, you know, the LSP is all about not being observably different when you talked about it in
terms of the super type. But of course, this was in simpler times when, you know, we didn’t have tons
tons of ways to break encapsulation on objects. But of course, defining an abstract specification for
what a tensor is supposed to be is not so easy. Of course, it’s easier than defining an abstract
specification for what a widget factory is supposed to be because, you know, tensor has its roots in
mathematics. And one could say mathematics is, you know, very much in the business of sort of abstracting
away, you know, differences between objects. But at least in PyTorch, you know, we don’t have
anything written down. It’s all based on off of an informal understanding of how code tends to work
with tensors in practice. And that means that you really are, you know, sort of rediscovering what it
means to be a tensor every time you make a tensor subclass. Of course, there are some tensor subclasses
where it’s not so hard to make a determination in this way, right? Like, for example, there are a lot of
types of tensor subclasses like logging tensors, or finite tensors, or nan tensors, where it’s kind of
easy to see that these obey LSP, because all they do is they do the same thing a normal tensor would have
done, but then with a little extra behavior on top, like printing out what operators were called,
or, you know, testing if all the elements in the tensor are finite. And so the spec here is that while
everything that like is the tensory behavior, that’s part of the abstract specification, and all the other
things like the logging behavior, or whether or not we throw exceptions or not, that’s sort of external
to the tensor specification. And most code that you write is going to, you know, be indifferent to those
extra things, the extra logging or the error reporting. It’s indifferent to the error reporting, by the way,
because in Python, you can actually throw exceptions, unlike in languages like Go, where all exceptions
have to be handled manually. If you had to handle exceptions manually, then throwing an error would not,
in fact, be a, you know, easy to add piece of behavior on top. Then there are some types of
objects which mostly obey the Liskov substitution principle. But if you poke hard enough at implementation
details, maybe not. And a good example of this are the linear operators from GPytorch, authored by
Max Bellendot. What are these things? Well, the basic concept is that tensors traditionally store all of the
data corresponding to them. But sometimes there’s special linear algebra structure associated with
the tensor. And so if you store only that, or you like store that there is in fact this structure at
all, in the first place, a lot of linear algebra operations can be run faster. So a very simple
example of this is if you have a diagonal matrix, you don’t need to store all the matrix, which is
mostly zeros, you can just store the diagonal and you want to multiply a diagonal matrix with another
matrix. That’s only linear, right? Because you just zip through the diagonal and you’re done.
So these also sort of obey the Liskov substitution principle in a very, you know, tight way because,
well, a diagonal matrix is still a matrix, which is still a tensor. So there’s still this is a
relationship and mathematically, you know, anything you can do with a tensor, you can also do on a
diagonal matrix. And even if you don’t have a kernel for it, what you can do, you can just materialize
the diagonal matrix into a normal dense tensor, and then do the operation. But there’s still some stuff
that doesn’t work, right? Like, you can’t get out a data pointer to the contents of a diagonal tensor,
and then expect, you know, the first N elements to be zero, right? Like you’re going to get if I give you a
data pointer, it’s going to be do this contiguous representation. And it’s not really going to, you know,
behave the same way you would have expected with a normal strided tensor. And this is sort of okay,
right? Like most code written in PyTorch and Python doesn’t involve poking at raw pointers. And so for the most part,
you can generally assume that code is going to behave okay, in this situation, you might still have to audit
your code if you know, like maybe you’re back ending to some external C kernel. And finally, there’s tensor
types that don’t really obey LSP at all, like nested tensors, which want to change the type of size in
tensor so that it doesn’t return just a tuple of integers, but it actually returns some nested
structure, saying what the size of all the various dimensions in your tensor are. And so technically,
facilities like torch function allow for this, you can define a torch function on an object that doesn’t
subclass from tensor at all. So there isn’t even any subtype relation, besides the, you know, the Python
duck typing relation that all objects participate in. But it’s still rough for a tensor like this,
because you might still want to use like code that was written on normal PyTorch tensors in this
situation. And so you’re appealing to an even smaller subset of the tensor language, an even, you know,
more relaxed set of invariants and properties that like generalizes for both nested tensors and normal
tensors. And it’s just generally hard to figure out what this is supposed to mean. Things get even
hairier if you actually honest to goodness subclass from tensor, because from our C++ side, we have a
actually we have a very strict contract about what fields in the C++ implementation have to be filled
in, you know, with actual values. And there’s very specific concrete machine types associated with them.
And anyone who subclasses from tensor is obligated to fill these in, in a reasonable way. And sometimes
it’s not so easy to do. But because we want to be able to inline accessors to these fields on tensor,
we have this very strict, you know, behavioral requirement, that sort of makes it a little
difficult to create subclasses of tensor. That’s why you have to use underscore underscore new,
instead of underscore underscore init, it’s because that, you know, underlying C++ tensor object has to
be allocated all in one go. There are many other subtleties that I could talk about. But I do want
to relate this discussion back to LSP for one particular aspect, which is what should be the
behavior of custom tensor subclasses be when you mix two different subclasses together. Like, say,
I have a debugging tensor, and I add it to a diagonal tensor, like what exactly should happen
in the situation. Zachary DeVito had a good comment the other day about what it means to be
compositional, what it means to be compositional is that you don’t need to look at the cross product
of any interaction between classes to understand what things are going to do, right. So if you have
to sit down and like manually write down what it means when you cross a debugging tensor with a
diagonal tensor, you’re not compositional, right, you’re writing this monolithic thing, and you’ve
manually worked out what the interactions between these two things are supposed to be. If we want to
be compositional, this interaction has to be worked out automatically. But how could we actually do that?
Because if I am adding these two tensors together, I probably have an implementation of adding a logging
tensor to a normal tensor. And I probably also have an implementation of adding a diagonal tensor to a
normal tensor. But you know, that doesn’t give me an implementation of diagonal tensor added to a logging
tensor. And of course, LSP says that actually, I do have a way of getting an implementation of this,
right. So when I have a logging tensor, I also have a normal tensor. And so I could use that tensor in
place of the tensor in the implementation that takes a diagonal tensor and adds it to a normal tensor.
And similarly, when I have a diagonal tensor, I also have a normal tensor. And I could just use
that diagonal tensor as if it were a tensor into the implementation of a logging tensor plus a tensor.
And so via LSP, if you actually believe in it, which it’s not entirely clear that you should,
um, we can actually generate a implementation that works out of the box without having to like
deal with these cases individually. But there’s a problem, right, which is there’s two possible ways
I can implement it, and their behaviors are actually going to be divergent. And so in general,
this is kind of hard to resolve. And in fact, the only way to really resolve it, um, in a reasonable
way is to do the non compositional thing, and just explicitly say what the interactions of these two
tensors should be, unless you’re functor. The lesson of functor is that if we define an ordering between
these two operations, and we say, we phrase each of these tensor subclasses as a way of, you know,
sort of desugaring a bunch of tensor operations into a bunch of lower level tensor operations that don’t
make reference to your tensor subclass, like, you know, this diagonal tensor turns into a bunch of
operations on not diagonal tensors. If you have the ordering, and you have the desugaring, then you can
decompose these, and it’s in a unique way, and it’s compositional. So I’m not really sure what the right
answer here is in general. But my hypothesis, and when I look at NumPy, I see that there are plenty of
ND array subclasses, but they mostly don’t interact with each other, is that people are going to write
tensor subclasses, they are generally not going to make them compositional. And if you do want them to
be compositional, well, you need to fit them into a framework, like in functor, like, you know,
JAXA’s functional transformations. So that’s pretty interesting. And I hope we can develop it in more
detail and share it with you when we figure it all out. That’s everything I wanted to say for today.
Talk to you next time.Tensor-subclasses-and-Liskov-substitution-principle
EP56 All-about-NVIDIA-GPUs
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I have a special guest with me today, Natalia Gimalshin,
who’s going to be here to talk to us about all the various GPU architectures.
Natalia, do you want to introduce yourself real quick before we start?
Hi, I am Natalia Gimalshin, and I am considered a GPU expert around Facebook,
perhaps undeservedly, but anyway, GPUs are going to be the subject of our podcast today.
All right, so when I was thinking about topics that I wanted to bring in other folks to talk about,
one of the things was sort of just, hey, there’s a lot of different NVIDIA cards out there
that do all sorts of different things, and sometimes when you’re new,
you hear things like A100s and V100s and how they have different performance characteristics,
and I just want to talk about this a bit and sort of get a sense about what’s actually important,
because if you just like pull up, say, the Wikipedia page that says about all of the devices
that NVIDIA has, there’s a ton and ton of different cards.
Which ones matter?
Which ones don’t?
How do I actually understand them?
You know, so that’s what I kind of want to dig into today with Natalia.
So I guess we should first start off by talking about GPU architecture.
So Natalia, what exactly is a GPU architecture, at least in NVIDIA’s terms?
In NVIDIA’s terms, GPU architecture has pretty much the same meaning to it as a CPU architecture.
It’s just a number of capabilities that this particular generation of the GPUs has.
How fast can it process floating point numbers?
How fast can it process low precision numbers?
How fast can it do indexing computations?
How fast can it read and write memory?
How many register and shared memory does it have?
So all those characteristics constitute a GPU architecture.
And each next one, of course, is considered to be better than the previous one.
So what’s an example of one of these GPU architectures that we’re talking about?
The most recent GPU architecture is Ampere.
So those A100 cards, or if we are in consumer lands, then 30-something cards.
That is the latest architecture that boasts, obviously, the best performance known so far.
The previous architecture would be Volta and Turing cards that are still excellent cards and still used a lot around many places.
And we can go back to the previous generations, but I guess we’ll do it a bit later and in a forward order and not in a reverse.
What’s the difference between Volta and Turing?
The difference between Volta and Turing is that Volta is mostly a data center card, and it is the first card to introduce the tensor cores.
Turing is the thing that allows NVIDIA to do very fast, low-precision matrix multiplications.
Turing is kind of a consumer brother of this server card, and an additional capability that Turing has is fast integer computations that can be used for very fast quantized inference.
Okay, so we’ve so far talked about three GPU architectures that NVIDIA has released, Ampere, Volta, and Turing.
So Ampere is the latest and greatest.
And you mentioned a bunch of different things that, you know, distinguish these characteristics.
So let’s talk about Ampere and Volta for now, staying in the data center.
So, like, what are the big differences between these architectures?
So, the big difference between Ampere and Volta, the one that’s probably most important for us, is that Ampere has introduced a couple of new data types for Matris computations.
One is BeFloat16, the data type that has been used for a long time on TPUs.
That also occupies 16 bits in memory, but doesn’t suffer from the same problem as the older low-precision NVIDIA type, FP16, suffered from.
Because FP16 has a very small dynamic range and is prone to underflowing and overflowing.
So a lot of numerical tricks have to be applied to avoid this.
BeFloat16 has fewer Matrisi bits, but as many Matrisi bits as regular FP32 type.
So, if you are not over-underflowing with FP32, then BeFloat16 will probably provide you with more stable numerical characteristics.
And the second type that I mentioned, TF32, TensorFlow32, is a weird thing that’s aimed at DL practitioners who can get the speedups right out of the box.
So, NVIDIA’s claim here is that if you are using TF32, then you don’t really have to do anything to your existing FP32 program.
If it works with FP32, it’s supposed to work with TF32, except it will be much faster.
And the reason it will be faster is that when GPU is doing matrix multiplication, instead of reading all your Matrisi bits, it will read just a few of them and perform lower-precision matrix multiplication.
That will be much faster, but you will still be left with your 32-bit container.
You will still be left with all your dynamic range, and generally, you’ll get the same results faster.
So, if I don’t use any of these new features, and I upgrade from a V100 Volta to a A100 Ampere, do I expect my code to run faster?
Yes, you would expect your code to run faster because the peak performance for A100 is noticeably better than peak performance of V100.
V100, and that is both for bandwidth bound code, memory bandwidth for A100 is higher, and for compute bound codes, because peak compute performance for A100 is always higher.
And you should be using at least one of the low-precision data types if you are running on V100 and A100, because that is their main claim to fame.
If you are just doing your plain FP32 computations, you are throwing away a lot of compute power that V100 and A100 allow you.
All right, so that’s cool.
So, here’s a question for those of us who don’t work at Facebook.
If I wanted to play around with an A100 or V100, is there any way I could actually get my hands on these cards without having to buy it?
So, one thing is, you generally should not be buying data center cards, because they don’t have active cooling.
You cannot put them in your desktop rig, even if it’s a very good rig.
So, don’t buy them, please.
You won’t have any use for them at home.
But if you want to play around with them, then AWS has instances for both V100 and A100.
They are not the cheapest, but probably you can find someone who would help you with this.
Or, if you want to just play with the consumer equivalents of those cards, then, yes, you could buy 30 series of the GPUs.
Hopefully, you can buy them by now, because a few months ago, it was a big quest.
They were sold out the moment they appeared, and it was incredibly hard to buy them.
And, unfortunately, I don’t know the exact situation on the ground now, but hopefully it’s better.
Just to confirm, so, what we’ve been talking about, the A100 and V100s, those are the data center GPUs.
But, like, when I talk to, like, gamers or, like, you know, machine learning enthusiasts who just have a few GPUs in their basement,
they’re going to be buying different things.
They’ll still be Ampere and Volta, is that right?
They’ll still be Ampere and Turing.
Volta didn’t really have a consumer-grade card.
They had something, but it’s hard to find and not necessary.
Okay, so I want to change the topic a little.
So, when I look at the PyTorch codebase, typically I don’t see any references to Ampere and Volta specifically,
except maybe in comments here or there.
Instead, I see a lot of references to SMs.
So, like, for example, when we build PyTorch, we can specify what set of architectures we want to build for via, like, the Torch architecture list.
And usually I have to list a bunch of, like, these SM51, blah, blah, blah, blah, things like that.
SM is the core part of a code architecture.
So, GPU is a massively parallel processor.
And to enact this parallelism, each GPU consists of a few SMs.
For the first GPU generations, the number of SMs was on the order of 10, say 10 to 20.
For the recent generations that we were talking about, it’s closer to 100 SMs.
And each SM, in turn, is handling about 2,000 threads.
So, you can see the level of parallel execution that’s going on on the GPU.
And SM pretty much has everything that is needed for the GPU to process the data very fast.
It has a few compute cores that would be doing your integer low precision of floating-point computations.
It has on-chip memory that can be used to very quickly access and write some intermediate results.
And, of course, it has the schedulers that would tell the threads when it’s time to go execute something and when it’s time to wait.
So, if I actually want to talk about the nuts and bolts about, like, you know, what actually we’re targeting, I don’t talk about Ampere or Volta.
I just talk about, you know, what architecture these actual SM units support in the chip.
Is that right?
Yeah, but there is also more or less one-to-one mapping between those SM61 or 70 or 75 architectures that you specify when you are compiling the code and more human-readable Volta during Ampere that we are talking about here.
What’s your favorite way to remember what the correspondence here is?
I don’t, unfortunately, have a favorite way.
It’s just, you know, when you see it enough times, then you remember.
Or look it up on Google, I suppose.
Yeah, and on the wiki pages, there is also usually something.
So, there’s one more piece to the puzzle.
In talking about the hardware, like the actual, you know, silicon you get for any of these GPUs, but there’s also another part which is important, which is the CUDA version that, you know, you’re using to actually, you know, run the software stack on top of these GPUs.
How should I think about the relationship between CUDA versions and the various GPU hardware that I might be using?
Actually, it’s not just CUDA version.
There are two pieces of software that are required for you to use your GPUs for computation.
One is a CUDA-capable driver that comes with its own version and system.
And another one is a CUDA toolkit, which probably you refer to as a CUDA version.
And so, all these three components, that is hardware driver version and CUDA version, have to be in sync for you to be able to use your GPU.
Exactly in sync?
No, not exactly in sync.
And this relationship have been relaxed recently, so it’s even easier to get to the working configuration than it used to be, say, a year or two ago.
But generally, if you have a card of some architecture, let’s say Ampere, there is the minimum version of the CUDA toolkit that you need to be able to compile code that will run on this architecture.
And for Ampere, that would be a 100 architecture, that would be CUDA 11 to release.
Then, your driver also should be at least the necessary version to run this hardware, and for Ampere cards, that again would be a driver corresponding to 11 or 11.2 toolkit.
Now, I was saying that it doesn’t have to be exactly in sync.
Previously, your driver version had to be newer or the same as your toolkit.
Now, with enhanced driver compatibility, NVIDIA allows the older version of driver for the newer toolkit, just as long as the major version for both driver and toolkit match.
And also, yet another way you can run your newer cards with older software, and you still need the driver to be able to run the code.
But if the toolkit that you use to compile is too old and doesn’t support the newer hardware yet, you can still compile not to a binary, but to an intermediate thing called PTX.
And then this PTX would be compiled to a binary by the driver itself, and even if you have some pretty old code compiled with the old CUDA toolkit, you could rely on the driver to JIT compile it and be able to run it within your card.
This is, however, not recommended because, A, this JIT compilation will be pretty slow.
If you want to, for example, run Pyroge in this mode, you will have to wait for half an hour to an hour for all kernels to be compiled.
And then the performance will probably be pretty bad.
So you will be able to run, but you won’t be happy that you did.
Half an hour, that’s really long.
Yes, that’s really long, and we’re actually having big discussions whether we should disable this thing and error out altogether or whether we should allow people to do it.
And there are some companies that have a hard time redistributing newer versions of software that actually rely on this being able to run old software on the newer cards.
So that’s why we cannot disable it outright, even though for, in most cases, I think people would be happier at just erring out and not seeing how long it takes to do something.
All right, so changing topics again.
So we talked a bunch about A100s and V100s, the data center versions of the cards, because we work with, you know, a bunch of people who are running their deep learning models on big research clusters that, like, you know, have lots of GPUs of this kind.
But GPU usage outside in the wild is very wide and heterogeneous.
Are there other models that are worth knowing about in this market?
We already talked about the consumer-grade 30 series.
Anything else people should know about?
Well, if they cannot get 30 series for some reason, or if they want something cheaper, then touring cards, that is, 75 series, are still an excellent thing.
And they are good for gaming, not as good for gaming, not as good as 30 series, as Nvidia says.
But still, they were the first one to introduce the ray tracing technology, and they still provide a pretty good performance for the compute workloads.
I guess if someone wants to use GPU for their small projects, the biggest consideration is probably the amount of memory that the particular GPU has, because in most cases, your workloads would be limited by how much you can put on your GPU.
Just get some recent video card, or Ampere series, make sure that it has, I don’t know, 8 gigabytes memory, at least.
And at least for some small experiments, that should be enough.
They used to give you free Turing GPUs, but recently, all my Colab instances that I was able to get were just K80s, which is a Kepler GPU that’s pretty old, was introduced to in 2014, if I’m not mistaken.
Wow, that’s really old.
Yeah, that’s really old.
But I guess, as PyTurge developers, that means we do have to ship Kepler compatibility by default, don’t we?
Yes, I guess since both Colab and AWS still have those K80 instances, we do have to support them.
Okay, so that’s everything that I had on my topic list for today.
Natalia, are there any final closing thoughts you want to give us before we close out?
I do appreciate that we talked a lot about consumer-grade cards, because this is what most beginning researchers are working with, and that’s their introduction to CUDA.
And I’m very happy that when NVIDIA started CUDA, they made this decision that absolutely every GPU is going to support CUDA.
Not only, like, not only, like, higher-level models, but pretty much everything.
And fun fact, the first ImageNet competition that Alex Kraszewski won with his AlexNet, it was trained on a couple of consumer-grade cards.
So it shows you that consumer-grade is all you need, basically.
That’s pretty cool.
All right, well, that’s all we had to say for today.
Talk to you all next time.All-about-NVIDIA-GPUs
EP57 Torch-vs-ATen-APIs
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about the A10 and
the Torch APIs in the PyTorch library and how they affect how we think about API design as well as our
intermediate representations that we send to graph mode compilers. Now you may not realize it but
PyTorch actually has two APIs. The first API is the API that I’m going to call the Torch API and
it’s the one you know and love. It’s the Python API that you know you use when you interact with our
library as a normal PyTorch developer user. So it’s a documented Python API that uses all of the
regular idioms that you’d expect from Python. And in fact we have a limited amount of programmability
for this API as well via Torch function which if you don’t know what that is you should go
listen to my podcast about Torch function. But basically we can override the meanings of
these Python API functions including functions that are entirely written in Python that is to say
they’re just plain Python and they have a little check at the front that says you know if the
if any of the inputs are tensor subclasses then defer to them to figure out how to implement this
function. All of these things constitute what we call the PyTorch Torch API and it’s you can also
get a list of all of these methods via the Torch overrides module which gives you a bunch of overwritable
functions and methods that you can actually change the behavior of when you subclass a tensor.
So this is the normal API that everyone knows and loves and you might be thinking hmm if I want to
work on PyTorch internals then clearly I’m going to expect to see a lot of functions that have that
reflect the Torch API. Well that’s not quite right. When you’re working inside PyTorch’s internals
you’re more likely to work with a different API that I’m going to call the A10 API. A10 in this case
stands for the A tensor library which is a internal C++ library that you know sort of PyTorch’s Python
frontend is built on top of. So the A10 API is a more limited API in the sense that instead of being the
entirety of the Python language anything that is supported in Python is supported in the API.
The A10 API operates on a restricted set of types called the JIT schema. This restricted set of types
originated from the fact that we were working on a TorchScript compiler frontend and we didn’t want to
support every single type in Python. So the JIT schema says what types that the JIT API supports but
these types map both to Python as well as to C++ and they’re selected to be some limited subset that’s
tractable for us to map to all of these languages. So from the start on the functions that make up the
A10 API has limited set so you won’t see a function for example like a map in the A10 API. Map in the
PyTorch API it’s a very obscure function but it takes a function a Python callable and runs it on
every element in your tensor slowly but you know it’s something you sometimes want to do. We can’t do
that in the A10 API because we don’t have a concept of a function that’s portable across languages. So
there’s similar limitations like this. So the A10 API has limited type system and in fact basically every
function that you can think can think of in the PyTorch API maps to one or more operations in the
A10 API. Sometimes this mapping is quite obscure for example prior to Joel Schlosser refactoring our
convolution implementation we had maybe 30 different internal A10 convolution operations whereas you know
in the public PyTorch API there was you know one or three depending on if you count conv 1d 2d and 3d
as being separate things. So the A10 API is exhaustively enumerated inside native functions.yaml
and we it’s not documented like if you squint most of these will be similar to the PyTorch Python API
but some of them will be different and you’ll sort of have to read the code to find out what the
difference is. But the difference then is that because the A10 API is what we actually operate on
in C++ most of our internal subsystems for example Autograd are written in terms of the A10 API.
So for example if you wanted to look up a derivative formula for something in PyTorch you wouldn’t find
a derivative formula for a function directly in the Torch API the Python API. Instead you would have to
find what A10 function it mapped to and then look up the derivative formula for that A10 function.
hopefully pretty obvious most of the time sometimes not so obvious. Now although we said that the
Python API is overwritable via Torch function the A10 API is also overwritable by tensor subclasses but
you use a different API for doing this namely Torch Dispatch and Torch Dispatch sort of also interposes at
this lower level where all of the subsystems are already finished running. So it’s more appropriate for
that situation when you know you want PyTorch to have done most of the work and now you just want to
do a little bit of extra information in this case.
Although the A10 API is primarily oriented at you know existing at the C++ level and being the you know
library implementation that the PyTorch Python API is implemented on top of we also do expose A10 operations
directly via the torch.ops module. The torch.ops module essentially has a sub module for every
namespace of operators and the A10 operators are put in the A10 namespace. So for example if you wanted to
call the native add you would say torch.ops.aten.add and that would go through a different code path than the
traditional PyTorch API. You usually don’t want to use this API directly. It’s mostly intended for people
who are one programming Torch Dispatch where when you get called in Torch Dispatch you’re given one of
these torch.ops.aten functions to tell you hey you know this is not a regular Python Torch API this is an
A10 API or perhaps when you use our custom operator registration API the torch.ops gives automatic Python
bindings whereas most of the Python bindings in the traditional PyTorch API are automatically generated.
So if you think about PyTorch as just an eager library it’s not too hard to understand torch versus
A10 so torch is the front end it’s the Python API and internally it backends to A10 which is a lower level
C++ API it’s a little more factored but it might have some more internal functions for various things we need
to do and depending on what level of interposition you want in PyTorch’s internals you might use Torch or
you might use A10. But there’s another way to think about these APIs and they are that way is to think of
them as intermediate representation dialects. When we have PyTorch you know eager mode matters a lot but
graph mode also matters and PyTorch also allows people to target you know take PyTorch programs turn them into
graphs of operations and then send them to various backends and now because we have these two APIs
you also have two ways you can end up with your IR you can end up with an IR that has the Torch API or
you can end up with an IR that targets the A10 API and depending on your trace acquisition mechanism
you’ll get one or the other. So how do you end up with the Torch API aka the Python the public API?
well if you use Torch Fx tracer on that tracer operates at the Python level it actually doesn’t
even go and execute any of the internal operations and it will collect up a Fx graph that contains
all references to public Torch API functions. This is by the way one of the reasons why Fx is such a popular
graph representation for PyTorch. It’s because you know when you look at these graphs they look exactly like what you’d expect to
see you know based on what you know about PyTorch’s Python frontend. However there is a downside to this
because Fx tracing operates purely at the Python level without interacting with any of PyTorch’s internal
subsystems there’s some there’s some basic functionality that you don’t get when you’re working with the Fx tracer.
For example, if you want to take a graph and look at the backwards for it, there’s no easy way to do this
with the basic Fx tracer. And so there’s another tracer called the AOT autograd tracer which can take
a Fx graph and retrace it through the C++ implementation using Torch dispatch to get out a backwards graph. But
this backwards graph won’t be for the PyTorch Python API, it will instead be for the A10 API. So you’ll get
it actually also uses Fx. So Fx IR can be thought of as a container format, which can have several dialects in it.
And so in in this case, when you use AOT autograd, you get out a Fx graph that contains A10 operations.
More concretely, when you look at the various, you know, function calls in the graph, instead of
being calls to torch.add and torch.sub, they’ll be calls to torch.ops.a10.add and torch.ops.a10.sub.
Actually, technically, you’ll also even know which overload you had.
This A10 ops IR, Fx IR is closely maps to torch script IR, which also operates on the level of A10 operations.
And then depending on your backend, you will have some backends that expect Fx IR in the torch form,
and some backends that expect Fx IR in the A10 form. For example, if you have a Fx graph mode pass,
for example, like the quantization path, that pass that’s going to expect code that is in the
targeting the torch IR. But if you have, for example, some pass that was previously targeted torch script,
for example, NV fuser, it’ll be easier to get there using the A10 IR.
And so when we have these two IR dialects, we can start to think about, you know,
can we transition from one to the other? Now, clearly, we can go from torch to A10,
because that’s basically the process that happens when we execute our eager code. We take a bunch of
user calls to the torch API, and then, you know, do some infrastructure to get it down into lower level
A10 calls. And we can trace through those using any trace mechanism that operates at the A10 operator
level, whether it’s torch dispatch tracing, or, you know, lazy tensor tracing, for example.
But what about going from A10 to torch? Well, hypothetically, this should be possible because
the A10 API is a well defined API, and the torch API is a well defined API. So you should be able to
implement the A10 API in terms of the torch API. Unfortunately, no one has actually gone around
and done this, but we think it would be a useful capability and we want to add it to PyTorch at some
point in the near future. Another consideration is how how dynamic or static the IR produced by the various
tracing mechanisms are. When you do FX tracing, the graphs you get are very, very symbolic.
For example, when you call dot size on a FX proxy, you don’t get back an actual tuple of numbers. You
get a symbolic proxy object that represents the sizes, but in fact, it’s just going to record your subsequent
uses of the sizes. And the fact that everything in FX tracing is symbolic is one of the reasons why
sometimes you can’t easily trace models because while they’re relying on actually knowing something
concrete and FX isn’t willing to give that information to you. In contrast, essentially all APIs that go
through PyTorch’s internal subsystems in C++ all require very concrete values for all the sizes, strides,
dtypes that are involved. That’s because in our C++ implementation, we literally have, you know,
lists of N64s floating around and how are you going to replace that with some sort of proxy object.
Work by Nick Karaviko is working on extending our internal representation to allow for symbolic integers
so that we can trace some level of dynamic shapes. This work is in early stages, but we’re hoping to get
it done this year. There’s one more teaser that I want to leave you with, which is that we are looking at adding
a third API. So you might be thinking, wow, why do you want so many APIs? So one reason is that the ATEN API,
despite, you know, being more lower level than the Torch API is still essentially intended to be basically the
the same thing as the Torch API. So for example, torch.ops.atend.ad, that’s still a broadcasting type
promoting operation. And for some backends, that’s still a bit too implicit. You might want to have
your type promotion and broadcasting be explicit so that the backend can easily say, oh, I see this is a
this is a non-broadcasting ad or oh, I see this is a broadcasting ad. So the prim ops API is a concept
where we have a even smaller, even more simple layer of operations under ATEN. Now, obviously decomposing
operations like ad into their constituent type conversions and broadcasts is not good for eager
mode performance. So the prim ops formulation is not intended for regular usage in PyTorch,
but instead for use with compilers, which can recover performance, even if you’ve atomized
a, you know, point wise operation into a lot of itty bitty small parts. And also for symbolic analysis
applications where you would like to, you know, target a simpler, um, set of operations that is
more, um, that’s more factored, um, and easier to understand and then have it sort of, um, take your
complicated surface PyTorch program and de-sugar it into a bunch of small operations that are individually
easy to analyze. So that’s a lot of the stuff that’s going on right now. And that’s everything I
I wanted to tell you about today.
Talk to you next time.Torch-vs-ATen-APIs
EP58 Python-exceptions
Hello, everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about
exceptional handling in PyTorch, specifically how we handle the boundary between Python and C++
in PyTorch. So where to start? Well, let’s start off by talking a little bit about C++ exceptions.
Love them, hate them, they’re kind of a very interesting language feature. So C++ exceptions
are based off of the idea that, hey, we want a mechanism for doing error handling in the C++
language, which doesn’t cost anything when there is no exception. And as a result, exceptions have a
very interesting performance characteristic, which is that when your code goes well, the exception
handling logic doesn’t really cost you anything besides binary size. But when you do raise an
exception, then things go very, very slowly. There is a very slow stack unwinding process that uses some
look aside tables to figure out how far you need to go you to actually use this table, you need to take
out a lock, it’s very, very slow. And because of this, and also because of the binary size bloat that’s
associated with exception handling, a lot of environments, e.g. mobile, don’t really want to
compile with exceptions turned on. And so, you know, you don’t really want to use exceptions most of the
time when you’re writing normal C++ code. But of course, there are some situations where exceptions
are appropriate. And I think PyTorch’s use of exceptions is quite appropriate. So PyTorch specifically
uses exceptions whenever there’s some sort of, I’d say user error. So you know, you add two tensors
together, but their shapes mismatch, we need to raise an error to the user, we do an exception in the
situation, it would be a really big pain to try to manually pipe back the error status through all of
our code. In this sort of exceptional situation. Now, if you’re a goal language developer, that’s the sort
of thing that you’re used to doing, right? Like, hey, you know, explicit is better than implicit, but
these really are edge cases. And most of the time, you’re not going to hit them. And it wouldn’t be a good
thing in our code to actually have to explicitly deal with all the error handling all the time. And plus,
it wouldn’t look very Pythonic. And as I’ve mentioned in earlier podcasts, you know, we’re all about
writing C++ code, that looks a lot like the Python code you want to do. So these exceptions, they don’t
happen normally, please don’t write code that raises exceptions and expects to catch them, right? The
point of the exception is just so that we can bubble it up to Python, turn it into a regular old Python
exception. And you know, usually this will fail a user’s program. But if, you know, there’s something
that they actually want to do with the exception, like say, they’re in a REPL, and so you can just
bring back control to the user, well, we want to give the user the ability to do that in that situation.
This does sometimes cause some problems. So for example, we had a bunch of linear algebra operations
that when the matrices were ill conditioned, they raised an exception. And some people, you know,
caught those exceptions, because they knew that they could use some other algorithm in these situations.
And this was very, very slow. And we actually added extra API’s for getting back the error status in
those cases, as a Boolean, so not raising an exception in this case. So exceptions, therefore,
really exceptional things, don’t use them for, you know, things that you expect to happen when your
code is running normally. All right, so we’re using C++ exceptions to handle things inside, you know,
the bowels of the C++ and PyTorch. But remember that once we hit the Python C++ language,
boundary, we actually need them to be treated as Python exceptions. And now CPython, the Python
implementation that most people use, is not implemented in C++, it’s implemented in C. And as
such, it actually has no idea what is going on with C++ exceptions. So you actually have to do some
conversion. So the convention in Python for handling exceptions, and because it’s C, you do have to do it for
everything explicitly. And in CPython source code, it does handle everything explicitly, is you are obligated
to check the return types of all functions you call. And normally, these functions will return py object
pointers. But if a error was set, if some sort of Python exception was set, the object that will be returned
returned is in fact, a null pointer. And there is some extra state, you know, off to the side of some
global state, which gets populated with the exception info in this situation. Global error reporting state is
very, you know, 90s error, no style reporting. But remember, Python has a global interpreter lock. So you’re not
really at risk of some other, you know, thread stomping over your exception state in this situation.
So if you return a null pointer, that means an error has happened. And there’s, you know, you’re supposed
to go ahead and propagate this null pointer up until some point where exception handling can actually
happen. So to interoperate between C++ exceptions and Python exceptions, it seems fairly simple, what we
need to do is we need to catch the C++ exception. Before we go to the Python boundary, then we need to go ahead
and, you know, take out this exception, look at it, convert it into a Python exception that we can also,
you know, save to the global state saying that there’s a Python exception. And then we just need to return
null pointer in that situation. Seems easy enough, right? Well, you have to actually remember to call the
macro that actually does this. So in a kind of poorly named set of macros, we probably should rename these
macros. They’re called handle th error and and handle th error. So when you’re writing Python binding
code, you need to make sure that you, you know, start off with a handle th error, which will set up this
try catch block, and then an end handle th error, which will, you know, sort of handle the end of
the try catch block, including the catching exception, turning it into a Python error, and then returning
null pointer. So c Python knows what’s up. But wait, there’s more. So we also use pybind 11 to do some Python
binding inside of our source code. And pybind 11 has a different convention than c Python. c Python says
return a null pointer, and we’ll handle it. Pybind 11 says, Oh, we’re a C++ library, we like exceptions,
too. And so in fact, pybind 11 knows how to deal with exceptions. And in fact, we install a handler,
handler, thanks, Peter Bell for adding this, which will know how to automatically convert exceptions into
into the form that is expected by the c Python interpreter. So you don’t have to use handle th error
when you’re doing pybind bindings, question mark, actually answer is no, you do, you still have to use
them. But that’s another story, which we will talk about in the second part of this podcast. But yeah,
so pybind 11 has a different convention. And if you’ve actually gone ahead and set the Python error state
already, there is a special exception in pybind 11 that says error already set. And that’s the one that
you can throw to have pybind 11 say, Oh, I see, you’ve already set the info. So I’m not going to do
anything, return a null pointer in that situation. Now, there, it’s not obvious that c++ exceptions should
map to Python exceptions. But we have a bunch of sort of precant exceptions, they’re all defined in c10
exceptions dot h, like not implemented error, and similar things like type error. And so if you want
your c++ exception to turn into particular Python error handling class, just make sure you use the
correct, you know, error class or other number macros that also let you, you know, specialize what type
you get in that situation. All right, so if that was everything that handle th error did, I’d be done
with this podcast in eight minutes, but it’s not there’s actually more. So, um, so exceptions are pretty
nice. And you know, we like using them a lot to handle error cases. And there’s something else that’s pretty
nice, which is warnings. We love warnings, uh, probably a little too much. We probably, uh,
PyTorch has just, you know, sort of grown warnings over time and like people have stopped reading them
and it’s bad and we should get the warnings to be less chatty. That’s a topic for another time.
So, uh, warnings are pretty useful because, hey, sometimes people are doing things that are kind of
bad and we don’t want to error on them, but we do want to let people know that, you know, something
bad is up. Like for example, using a function that we’ve deprecated and plan to remove in the future.
And a lot of this code only actually gets exercised in C++. So we want some way of reporting warnings.
Now it’s easy enough to, um, you know, write a C++ warning function that just prints some stuff out to
standard error, but similar to how exceptions have their own handling in Python, right? With the,
you know, good old fashioned Python exceptions, warnings also have handling in Python. There’s a
warnings module. There’s a concept of warnings filters and warnings handlers. And it would be
nice if the warnings raised by PyTorch interoperated with his framework. And they do.
So what we have is we have a way of mapping C++ warnings into Python warnings. So when you use the
torch warn macro, which is the way of, you know, basically raising a warning from C++ code, what it
will actually do is it will convert it into a Python warning and, you know, send it off so that you can,
for example, ignore it, uh, as, as these things typically do, um, when you are actually dealing
with it in your Python code. Now it used to be implemented such that, um, we would take out the
global interpreter lock because remember when we’re in C++ code, we’ve released the global interpreter
lock so that other threads can keep going. And, uh, so we would have to reacquire it and then, you know,
fiddle around with Python state to actually raise the warning, but this sometimes caused deadlocks.
So Albin, um, a few years ago submitted a patch to make this better. Um, and the idea is that,
well, there isn’t really any point in reporting the warnings to user land until we actually, you know,
get back to the Python interpreter. So we can basically defer all of the warnings we want to
raise until we, um, you know, go back to Python. In fact, the CPython API has a dedicated function
for doing this sort of thing. It’s basically at a callback, which when the next time the gill is
acquired, uh, we’ll do these callbacks. And, and this thing is protected by its own very tiny lock.
So you can take it out, uh, without fear of deadlocking the gill. But, um, we didn’t use that
for this particular mechanism. Instead, we have our own little buffer, um, that warnings get rid into,
and then we have some way of propagating to Python when we return. And how does this work? Well,
we piggybacked on top of the existing handle th error macro. So how do you get some code to run when
you’re exiting a code block? Well, in C++, the way to do that is RAII. So you allocate an object on the
stack. And then when, you know, you’re exiting the scope by returning or by raising exception,
then the destructor for this object will get called all happy, right? Well, no. So I mentioned that, uh,
uh, when we have exceptions in C++, um, we turn them into Python exceptions. And so at the point in time,
when we’re handling the warnings, basically feeding them into the Python interpreter, uh, we might have an
active exception at this point in time. And now there’s a problem. When you print a warning to,
when you, when you put a warning into Python’s warning system, you actually might be running
arbitrary code. Why? Well, you need to actually construct the warning object. And there’s also
some handlers, which, you know, might actually just go ahead and process the warning immediately,
when you do it. And all of this code can raise errors. And so what do you do if you are raising
an exception and the unwinding code also tries to raise an exception at the same time? Well,
C++ has an answer for this. It’s, uh, you know, abort your program term, uh, immediately, um, you know,
unceremoniously killing everything that’s going on. Well, that’s kind of bad. And we don’t really want
to do that, right? We want to make sure we always get to Python in this case. So if that happens,
then you have to basically not run the warning handlers, if there’s an exception being risen,
because, you know, you are not going to be able to deal with another exception being raised at that
point in time. And so the way we do this is just, if that happens, uh, we don’t actually give you the
warnings in Python. We’ll just print them to std error in C++ and they get, and they vanish into the
either. Well, you can still see them in standard error, but, um, they won’t be available in Python.
And that’s pretty reasonable because this only happens when you are raising an error anyway.
And remember those are exceptional situations. And so, you know, you really shouldn’t be doing that.
Well, there is one subtle point though, which is that, uh, remember how I said that, uh,
if you set a Python error, you know, the global, uh, flags inside, uh, um, Python, and then you return
null pointer, Python will know what to do with that. That technically worked, um, before, even if you were
using the handle teach error macros, but now you’re not allowed to do that because, uh, if you, um,
if you are just returning a null pointer, then the warnings handler will run and it won’t know if there
is a Python error or not. And it might accidentally try to raise an error again. And that’s, that’s bad.
Okay. So that’s it for error handling. So if you don’t remember anything from this podcast,
remember to put your handle th errors and and handle teach errors around your bindings. Otherwise your
exceptions won’t work correctly or use pybine 11. But if you’re using pybine 11, you still probably want
to use these macros or the nifty, um, you know, wrap, uh, warning a handler, uh, uh, function,
which I will post in the podcast liner notes. If you need to look it up, um, which is just a nicer
way of doing the same thing, uh, without using macros, make sure you do that because otherwise,
if you raise warnings in your code, um, those won’t work either. And yes, this is probably too hard to
remember and we probably should have a lint about this and we don’t really have a good linting framework.
That’s a good topic for another time. All right. That’s everything I wanted to say for today.
Uh, talk to you next time.Python-exceptions
EP59 New-CI
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I have Eli Urigas with me to talk about our new continuous integration system,
which we migrated over from CircleCI to GitHub Actions.
Eli, do you want to introduce yourself?
Hey, everybody. I’m Eli Urigas.
I work on the PyTorch DevInfor team over here at Meta,
and I’ve been working on the team for probably about two years.
Excited about all the CI options that we’ve been able to provide
the PyTorch organization over the past couple of years.
All right, so let’s get started.
So the first question I have for you is,
so we used to have a CI system that was on CircleCI,
so why did we move to GitHub Actions?
Like, if it’s working, why break it?
Yeah, so that’s a great question.
So this project kind of started at mid-2020, I think.
We kind of identified, there’s a big cost motivation
in terms of moving from CircleCI to GitHub Actions.
One of the things that’s great about CircleCI is that it comes in,
it’s a fully featured kind of CI system,
but one of the negatives of that is that
there is kind of a cost implication that comes with that,
and kind of one that we foresaw growing
as the PyTorch organization grew.
But as well, outside of that, outside of the cost motivations,
we kind of felt that there was a platform out there,
like GitHub Actions, that provided a little bit more flexibility
and extensibility that CircleCI just didn’t have at the time.
And I think one of the biggest things for us,
GitHub provides, is the API and webhook architecture
that it has, allows us to do a lot of things with automation
that are just not possible with other CI vendors.
We talked about this with Jenkins,
we talked about this with CircleCI,
we had a lot of different vendors that we had talked to,
and outside of all of them,
GitHub Actions just hands down provided one of the best models
to be able to extend their product
and be able to provide automation and tooling
that no other provider had.
As well, one of the bigger things about GitHub Actions
is the shareability of Actions just in general.
We saw a great opportunity to be able to use off-the-shelf Actions
as well as be able to build our own Actions
and share them amongst different projects.
And we see that as a big opportunity for the future for ourselves.
If I recall correctly,
CircleCI did have this Orbs functionality for sharing Actions.
I guess it didn’t work too well for our use cases?
Yeah, I tried out the CircleCI Orbs feature.
I really wanted to be a big proponent
of what Orbs were trying to do.
Having that feature just seemed like a good idea.
But unfortunately, I just was never able to get it to work correctly.
I’m sure it’s a great feature for the people that use it,
but unfortunately for our use case,
it just didn’t provide the level of accessibility
that we needed in terms of our shareability.
All right.
So my familiarity with the CI system
is way back in the day
when we initially had it on Jenkins
and then when we ported it to CircleCI.
So I actually know very little
about how the GitHub Actions version of the CI system is set up.
Can you just briefly walk me through
what the major components of the CI is?
Yeah, so our CI is a little bit interesting.
If you worked with Jenkins,
it might be a little more familiar now
than it was with CircleCI,
at least for the infrastructure portion.
So right now, our CI is run on infrastructure
that runs on AWS.
It auto-scales to our needs.
So if a workflow is fired off,
it sends a request to our auto-scaling system,
which creates a node,
and then that node gets connected to GitHub,
and then their provisioner or whatever
actually schedules a job to be run on it.
So that part should feel familiar.
In terms of the way our CI is set up today
with workflows,
initially we used a Jinja templating type approach,
which is kind of similar to what we did
with the domains previously.
So if you are familiar with Torch Vision’s
CI configuration right now,
that’s kind of an approach that we took earlier.
The evolution of that is kind of related
to GitHub’s feature set for actions just in general.
So the reason why we chose Jinja templating
first and foremost
was an initial lack of reusable workflows.
So when we started the migration,
GitHub Actions lacked the idea of reusable workflows,
lacked the idea of being able to use
consolidated actions with regular actions,
meaning that when we tried to do the thing
about shareability,
we weren’t necessarily able to do that initially.
Also, there was initial lack of rerun workflow
from failed,
which is a feature from CircleCI
that a lot of people love to use.
And that led to the need
to generate multiple workflow files
just in case a singular workflow failed
and was flaky.
It didn’t affect all of the rest of the workflows
from actually passing and running
and reruns just in general.
So right now,
it should feel still a little bit familiar.
We tried to use a lot of the same scripts
that we had used before.
So everything in the .Jenkins folder
is probably still used.
A lot of the scripts from the .CircleCI folder
are still used.
One of the things that we wanted to make sure
when we did this migration
was that we kept a lot of the things familiar
in terms of the test scripts.
We didn’t want to change those in particular
because we wanted to de-risk the migration
when we actually did it.
Well, we probably shouldn’t be storing
our build scripts in .Jenkins anymore.
Yeah, that’s an item for us to fix later on.
It hasn’t been a higher priority for us.
We’ve been working on other features,
but it is actually one of the things
that we want to do work out later on.
And just in general,
we’re trying to make it so that our CI scripts
are not vendor specific.
Ideally, we’d want to move it to a .CI folder,
but that work hasn’t begun yet.
So I want to just make sure I understood correctly.
So basically, in the bad old days
when we were on CircleCI,
there wasn’t a way to create a parametrized job.
Actually, there was a way to create
a parametrized job, right?
But we needed to use templating
to basically instantiate all the different versions
that we wanted to do.
So we’re not using templating anymore?
We don’t need to do that?
Right.
So Michael Suho, who is an engineer
who helps us out with the PyTorch DevInfra work
every now and then,
has done a great effort
to consolidate a lot of our workflows.
So to give you more of a full story,
reusable workflows was introduced as a feature
a couple months ago,
and it’s finally matured to the point
where we can actually use it
inside of our CI workflows.
And so Michael Suho went through,
did a bunch of work to consolidate our workflows,
and now we’ve moved our pull request workflow
into a singular file,
our trunk workflow into a singular file,
our nightly workflow into a singular file.
So it’s easy to kind of know
where these things are coming from.
And the added benefit is that
if other people wanted to use the workflows
that we use on PyTorch PyTorch,
they actually can now.
Okay, so previously,
we generated a lot of little workflow files,
one for every job we were doing.
So these would be like things like,
you know, Xenial,
you know, Python 3.6,
CPU, something like that.
And so now they’re all put together
in one giant workflow.
Okay, well, that certainly reduces
the number of entries
in the GitHub CI status screen
that I have to scroll through.
It definitely does.
I think at one point,
we got all the way up to like 182 checks,
which is a little bit ridiculous.
And I think everybody kind of recognized
how ridiculous that was.
That’s a lot of checks.
One of the other pieces of infrastructure
that I noticed that changed is the HUD.
So I actually,
I wrote the first version of the HUD.
It was like this crappy React application
that we did in a few days.
And we have a new HUD.
What’s up with that?
So the new HUD is based around the idea
of being super fast,
having a data set
that we can make queries on really quickly.
It’s based on a technology called Rockset,
which we picked out.
The old HUD,
after you had got done with it,
was based on RDS,
which we identified as kind of
a slower version of Rockset,
so we moved on to Rockset.
Tell me more about what Rockset does.
So Rockset is just a database
where we can make queries on.
We send all of our webhook data
to Rockset
so that we have a set of GitHub data
that we can do a lot of different things on.
And so, yeah,
one of the cool things about HUD right now
is that it’s super fast.
It works really well.
Mini HUD is a new feature
that we’ve added
that allows you to be able to,
instead of having a timeline view,
you can have this mini view
that just tells you
what are the failing tests,
what are the failing jobs,
instead of having to go through a full view.
And we just wanted to make sure
that when we created the new HUD,
we wanted to improve
our error reporting experience.
we understood that
the GitHub Actions log view
wasn’t necessarily
the best way to view logs sometimes
because of how slow it loads.
So we moved logging into that,
into HUD,
and just made it a little bit easier
to kind of surface areas
in general,
or surface errors in general.
I heard we’re also using Rockset
to do some new features,
like apparently we can search
for flaky tests
in the corpus now?
Yes, actually.
So this is a really cool feature
that was recently added
by Carrie and Jane
on the PyTorchadev Infra team as well.
Basically, we have,
now have, thanks to Rockset,
a flaky test view.
So we can actually view
the history of a particular test
over the past 14 days,
and we can disable tests
from that view within HUD as well.
On top of that,
there’s a bot
that will actually go through
and automatically disable tests
if you reach a certain
flakiness threshold.
So we’re kind of doing
a lot of work
to kind of ensure that
PyTorch CI in general is green
without having to have
active thought be put into
what type of test to disable.
What happens to the tests
after they get disabled?
We’ll ping the on call,
and Jane did a lot of great work
putting together a list of POCs
for each of the test files.
So we’ll contact the POC
and the on call
for that particular test file
to notify them
that their test was disabled
and to hopefully,
ideally,
have them go through
and fix it.
Okay, so up until now,
we’ve spent a bunch of time
talking about
how the internal architecture
changed,
which it sounds to me like
it didn’t change too much,
but we now have Rockset,
which we’re using
to aggregate our data,
and we’ve also consolidated
our workflows
using GitHub’s actions.
So if I’m just a plain old
end user of the CI,
do I care?
Are there other things
that are nice about
being in this new CI universe?
I think one of the things
that is really nice
is just having
a singular view,
like having a singular place
that you go to
to make your PR,
have your PR tested,
and you don’t have to leave
GitHub in order to be able
to do your own work.
I think one of the biggest things
that I kind of disliked
about CircleCI
is that I had to click through
a bunch of different things
in order to be able to view
all of the CI
that I had
at a single point in time,
and having that all on GitHub
I think is a good experience
for all of the developers
out there
to be able to just have
a singular place to be at.
Now GitHub Actions
is a relatively new offering,
and I know that we went
through some growing pains
where they didn’t support
various features
like parameterized workflows.
Is there anything else
that we’re still waiting for
from GitHub?
I think a lot of the features
that we’ve asked for
have been completed and done.
One of the bigger features
that we’re asking for
from GitHub right now
is kind of having a view
of what our self-hosted runners
are doing at that moment.
That would be a really great feature
for us on the infrastructure side
so that we can understand
what our runners
are being used for
at what times,
what percentages,
to be able to make
better decisions on
maybe we need to increase
the amount of Linux runners
that we have.
Maybe we need to increase
the amount of Windows runners
that we have.
Maybe there’s a workflow
that is particularly greedy
that we need to disable,
and we don’t have
that data right now.
And having a webhook event
that would provide
that data for us
I think would be amazing.
I mean, but to speak about it,
our GitHub partnership
has actually been
very, very good.
We have regular meetings
with the GitHub team,
the GitHub Actions team,
just to make sure that
we can provide,
we have a forum
to provide feedback,
and they’ve been
really, really good.
And they’ve implemented
a lot of the features
that we’ve asked for
in the past.
For example,
parameterized builds,
anything else?
The rerun workflow
from failed feature
was a feature gap
that was from CircleCI
to GitHub Actions
that we put as one
of our highest priority items.
This is actually
what enabled
the consolidation
of workflows
to actually work correctly.
So remember
when we had talked
about the Jinja templating
that we did,
the little workflow files
that we did,
and a lot of that
was due to
having this system
where we couldn’t
rerun workflow
from failed.
So that was a feature
that we’ve been asking for
for a long time
that GitHub was able
to provide for us
recently,
which was a really
awesome feature
to see go GA.
Okay,
so if I want to make
a change to
something in the CI,
like I want to add
a new configuration,
where should I look
in that situation?
So right now
you can look in,
I think it’s
.github
slash workflow
slash pool.yaml
that should provide
a really good
baseline view
of what our
CI workflows
and how our
CI workflows work
just in general.
Michael Suo
has done a really
great job,
again,
I want to give him
a shout out
of making
reusable workflows
kind of at the
forefront of our
CI offering.
So in essence,
it really should be
as simple as
copy-pasting
one of the
workflows that are
already inside of there
and just kind of
molding that
into the thing
that you want it to be.
In the old system,
I remember I had to
make new Docker
images sometimes
for configurations.
Is that still
necessary?
Yeah,
that’s one of the
things that’s still
a necessary evil
of our CI system.
I call it an evil
because it’s
one of those things
that not a lot of
people understand,
but yeah,
it is still one of
the things that you
do have to do,
unfortunately.
So what’s next
for the CI system?
What should I be
hoping to see
in the future?
So there is
a big effort
going on.
One of the things
that we’re looking
for in H2 is
we have a big
project called
Project Nova,
and it’s going to
be around the
idea of
standardizing our
tooling with
reusable workflows
and consolidated
actions and
rolling out that
tooling to all
of our domain
libraries and
ecosystem libraries
as well.
As part of
Project Nova as
well, we’re going
to be giving
more and better
access to GPU
runners across
PyTorch projects.
Right now we
identified a need
for PyTorch projects
to have GPU
runners just in
general.
This is like a
baseline requirement
and we understand
that our
infrastructure right
now doesn’t
provide that
great of an
experience.
So what we’re
trying to do is
we’re trying to
make it a little
bit easier to
maintain the
infrastructure and
then trying to
make it so that
low traffic
repositories still
get the same
level of access
that PyTorch
PyTorch gets to
GPU runners.
So does that
mean that if I
want to spin up
a little project
that I don’t
want in PyTorch
PyTorch, I can
easily get CI for
that project now?
Yeah, that’s part
of the idea.
Yeah, Project
Nova is about
that.
One of the key
results that we
want to see out
of the project is
a bootstrapping
process of a new
project that is
less than a week
of engineering time.
That’s the goal.
We want to be
able to create a
runbook, we want
to be able to
create the tooling
and give the
access to the
infrastructure to
kind of make it
simple.
A lot of
researchers aren’t
super competent
when it comes to
doing CI work,
and we want them
to be able to
focus on their
core competency
while we focus
on our core
competency.
And that’s kind
of what the
idea of providing
this tooling is
about.
Okay, well,
thanks a lot for
joining me today,
Eli, and thank
you for being
here.
Thank you for
having me.New-CI
EP60 Dispatcher-questions-with-Sherlock
All right, hello everyone. Today I’m here with Sherlock Huang, who is newly joined here at Meta, and this is an interesting new format that I wanted to do.
Basically, Sherlock is going to ask me questions about things in PyTorch from a sort of newcomer’s eye.
Although, Sherlock, you’re not really a newbie because you’ve been working on Onyx Runtime for quite some time before coming over here.
And I’m going to answer them, and we’ll see how this goes.
And today’s episode also has a video component with it because there are some diagrams that we’re going to reference as we’re going.
All right, Sherlock, so do you want to get us started?
Yeah, thank you, Edward, for inviting me here.
Yeah, so today’s probably going to focus on the dispatcher component.
So as I was reading your blog, so I come to this impression that originally dispatcher was designed
to handle mostly just, you know, the device type and data type dispatching.
So over time, it grew into this big magnetic and central place for many, many features.
So can you tell me a little bit about the history of how we come to this stage and a little bit of the history about the dispatcher?
Sure.
So your guess about where the dispatcher used to be implemented is right.
So in the beginning, well, in the way, way, way beginning, we had Torch.
It was for Lua Torch.
It was written entirely in C.
And essentially, all we had was we had, like, separate copy-pasted files, one for CPU float tensor, one for CPU double tensor, one for CUDA float tensor, one for CUDA double tensor.
And then there was just some bindings to the Lua programming language that actually, like, figured out where you would go so you didn’t have to, like, write individually which operation you wanted to do.
So this got ported to PyTorch.
And so the first version in PyTorch, there was some binding layer in Python that, once again, basically knew how to get to the right implementation.
And when Zachary DeVito rewrote our bindings so that we had a C++ library intermediating between Python and the Torch C libraries, before it was, like, directly to C and was very hard to understand.
The original thing that we needed was simply, yeah, to dispatch on the device type, and then to dispatch on the D type.
So there was a virtual method that we used to do the device dispatch, and there were a bunch of macros for basically letting you stamp out multiple copies of each implementation for D types.
So since then, we’ve added tons and tons of more features to this dispatcher.
And I do have a, there’s a more recent diagram talking about dispatch keys.
And the model we have now in C++ is that there is an order of various operations that we can do in the dispatcher.
And we want to, we basically run things in order depending on whether or not they’re applicable to some computation or not.
So like, in this example, Autograd is in red.
And that’s because Autograd is a very common layer people want to do.
So you hit Autograd, and then you do CPU.
So Autograd was like one of the first layers to get added afterwards.
And then all of these other ones sort of got added over time.
Did that answer the question?
I guess I didn’t answer this question you had over here, which is how to like, think about VMAP.
But at least Autograd, that’s where it lives.
It lives here.
By the way, Torch Dispatch, it’s like a back-end.
So there’s just a Torch, a Python key over here.
And that’s how Torch Dispatch gets handled in the back-end section of the dispatch.
Yeah, that answers the question.
So it seems to me that the order of the dispatch key is extremely important.
So as we add so many features, how do we determine the order of the dispatch key?
Is there any principle behind deciding the order?
This is a great question.
So the order is indeed important.
And in fact, there is not a single well-defined order necessarily in all cases.
For example, Functorch, which is the new library for doing jack-style transforms on PyTorch,
it provides two levels of functionality, VMAP and Autograd.
And you can actually have them ordered one way or the other.
And these correspond to different but both useful operations.
One of them computes per sample gradients, whereas the other one is like normal.
You have a batch computation inside of Autograd.
So that’s kind of troublesome.
And some of the more recent work has been about getting us away from this fixed set.
But the order that is in C++, and this order is kind of important
because it’s the one we can efficiently implement.
This order is basically sort of worked out based on what the average use case in PyTorch is.
So for example, there’s a question about tracing versus Autograd.
So why is the tracer before or after Autograd?
Well, actually, the tracer key is this interesting legacy concept for TorchScript tracing.
And we’ve been talking about this new thing called AOT Autograd, which knows how to trace Autograd.
And that’s implemented using the Python key.
The Python key is after Autograd.
So indeed, you get the traced forward and backwards in this situation.
Another example is Autocast in Autograd.
Autocast is before Autograd.
Why is that the case?
Well, it’s because if you do a bunch of casting, you need to also know how to differentiate through
a cast to lower or higher precision.
So Autocast doesn’t handle that.
It just, you know, inserts the new operations and then Autograd handles it.
So there’s a lot of thinking about like what you want the semantics to be.
And the ordering of the dispatch keys is like our best guess about what you want in a situation.
And hopefully it is useful, but sometimes it’s not.
Yeah, this is great.
So I also heard about like in terms of dispatch key, there are two categories.
There are like backend related dispatch key and then there’s another feature keys.
And backend keys is always the end destination of the dispatching, right?
That’s right.
Well, almost because the Python key, which we treat as a backend, can in fact start executing
other PyTorch code, which will go through the dispatcher again.
It’s a sort of re-entrant mode of execution.
But most normal backend keys don’t do that.
They just actually do the compute.
So let’s say like user want to specifically override the dispatching order.
Is there any way that a user currently can do that?
Uh, that’s also a good question.
So in the C++ dispatcher, there’s a fixed order and that’s it.
You, you’re, you’re out of luck if you want to reorder things.
So how does Funktorch do it?
How does Funktorch?
Let’s see.
Do I have batch?
Yeah.
I had in, in, in this picture batched is, um, before autograd.
So that’s the order you get, um, when you use PyTorch only.
So how does Funktorch let users reorder it?
So the basic idea is that, um, you, you, you have, you have an inner tensor and you have
an outer tensor and each of these tensors gets its own copy of the dispatch tree.
And then what you do is you say, okay, on the outer tensor, skip batched and go to autograd.
Cool.
You do your autograd stuff.
And then you go to the backend key.
It’s going to be the Python key.
Or, um, in, in Funktorch’s case, there’s a special key that they’ve got for like going
back to the front.
And this goes to the inner tensor.
It goes back to the beginning.
And, uh, then you, this time hit batched.
And then finally you get to the true backend.
So you basically like, if, if it’s not in the right order, you just stack as many of these
as you need, um, sort of nulling out all the things you don’t care about.
And all of these, you know, layers are optional.
You don’t have to do them if the functionality isn’t relevant.
Yeah.
This is the same idea, similar idea as the tensor subclass in private.
So when you wrap tensor over tensor over tensor, you automatically get, uh, multiple, uh, stacks
of this dispatch key and then you can compose them in any way that you wish.
Uh, that’s right.
So with tensor subclasses, you, so Funktorch’s implementation is done in C++, but with tensor
subclasses, you can do it entirely in Python simply by having a tensor subclass that contains
another tensor on the inside.
And we actually have an example of how to do it this way in subclass zoo.
Um, it’s in the Funktorch.py file.
Yeah.
So I want to dive a little bit into this backend, um, select key.
So, um, so in, in, in the, in the, in the diagram, it seems that, okay, we do a bitwise
for, um, all the, uh, hardware keys.
Uh, but end up, uh, the one that ended up being selected is the left most of the, uh, dispatch
key, right?
Uh, uh, but in reality, like a single operation can only run on a single device.
So, uh, basically the multi-dispatch behavior doesn’t apply to the backend select.
Is that the right understanding?
Uh, sort of.
Okay.
So there are a few things going on here.
So, so this multi-select is for handling, uh, a multiple dispatch where you like have a CPU
tensor added to a CUDA tensor.
And, um, this figures out, Hey, you want to go to the CUDA key, not the CPU key.
In that case, backend select is for a different situation, which is when you don’t have any
dispatch keys in the inputs in question.
So backend select is used primarily for factory functions, which don’t take any tensors as inputs.
And because they don’t take tensors as inputs, there’s no tensor input dispatch keys to get
you to the CPU or CUDA factory function that you need.
Indeed, you would instead have to look at the device argument to figure out which one
you want, but we didn’t write any like special logic, which is like, Oh, if your argument is
a device, then I know how to extract a dispatch key from it.
And so we just said, well, we’ll just put you in this backend select kernel.
It will go and look at the arguments, figure out what to do, and then eventually take you
to the correct kernel.
I see.
So for, let’s say there is a binary op and one input has a set of dispatch keys say,
batching and tracing, but on the other input, it can have another set of dispatch behavior.
So when we plug these two inputs into this binary op, so you’re saying that both, it will take
a union of this dispatch key and invoke every single feature that both of them have, right?
Yeah, that’s right.
And then the implementations of the feature, like batched or autograd would be responsible
for knowing how to deal with a tensor input that quote unquote, wasn’t batched or wasn’t autograded.
In JAX, we call this lifting.
You have to lift a tensor that doesn’t have that functionality into the world of autograd or batching.
So it seems to me that like, okay, the destination is always on this backend device selection part.
So has there been any consideration on, for example, breaking this one joint dispatcher into multiple ones,
especially, for example, for the backend ones, because it’s always the destination.
It doesn’t seem to be mixing with other orders.
So Brian Hirsch recently landed a PR that gives us a lot more dispatch key space.
And the way that he does it is by treating backend specially.
So they don’t, you can sort of have something that is both autocast and autograd and XLA,
but you can’t have something that’s both XLA and CUDA.
So he encodes those differently.
It’s still one in 64.
So we didn’t actually separate them.
And the dispatch table is still sort of set up as one table.
But morally, now we are treating backends differently than the layers in question.
I see.
Is there any other category concept in this dispatch key?
For example, we have this multiple autograd dispatch key.
We have this view and conjugate and negative view.
So it seems to me like the dispatch key is not completely flattened.
There are structure within it, but somehow it’s all just appears to be flattened.
So is there any consideration to put them in a more structural way?
Oh, that’s a good question.
So for the layers, I actually don’t think there is more structure, except in the sense that you might want to reorder them arbitrarily, which is what Functor is about.
So my general way of thinking about every layer key, so backend keys are terminal, right?
They don’t ever call anything else.
Layer keys can call into other layer keys.
And we sort of normally go down the dispatch key chain as we get there.
So the way I think about a layer key is it’s basically a rewrite of some Torch API operation, some ATEN operation, into some smaller ATEN operations, right?
And you continuously, like, desugar the ATEN ops into more and more ATEN ops until you finally got in the backend, and then those are the actual operations that you need to run.
So in that sense, there isn’t really any meaningful grouping, right?
And any transformation from a single ATEN op into several ATEN ops is a valid transformation.
And we might group them up because some of the transformations do similar things, are implemented in a similar way, like conjugate and negative.
Those are very similar.
But, you know, like, you don’t need to bunch them up from the perspective of dispatch because they do want to be ordered in this way.
Because that tells you what order the transformations happen.
I guess there is an interesting point here to be made, which is whether or not sometimes transformations are commutative with each other.
But we don’t encode this logic in any way right now.
So another mystery that seems to me is, like, all this new feature end up landed in the dispatcher.
So is it by design or is it just part of the constraint that we end up in the dispatcher?
So, like, if you look back into all this feature that was added to the dispatcher, is there any particular ones that could have lived outside the dispatcher or done in a different way that eventually somehow still got into the dispatcher?
Oh, that’s a good question.
So let’s be a little concrete for a moment.
So let’s say, for example, let’s talk about Autocast for a moment.
So Autocast was an interesting feature.
It wasn’t even developed by folks at Facebook.
Michael Carilli over at NVIDIA had implemented Autocast as some, I don’t even remember how he, I think he was, like, monkey patching the PyTorch source code and basically doing a lot of work to, like, basically automatically insert casts to lower precision, you know, without having to modify your program.
And so I think we were at the PyTorch dev day and I heard what Michael was doing and I was like, hey, you know, you don’t have to do it this way.
We’ve got this thing called the dispatcher.
And in particular, something you can do in the dispatcher is you can write what’s called a fallback kernel.
So this is a single polymorphic kernel that, like, will operate on all of your operators.
So you don’t, you can just write one of these fallback kernels.
Actually, I think, I think for Autocast, it’s a fall through kernel.
It just ignores the execution if you, you know, if it’s not an operator that knows how to do casting.
And other than that, it was just a very convenient way to insert functionality into PyTorch, interpose it, you know, without having to, like, do stuff like monkey patching.
Now, today in 2022, we’ve been adding a lot of new features that let stuff happen in Python user land.
For example, I’ve got a PR, a Torch function mode, which I’ll be landing soon, which basically lets you do this kind of interposition at the Python API level.
And this can’t happen in C++, because once you’ve got into C++, we have only the, like, narrow C++ set of types.
So all Python objects have gone away.
We’ve translated them into C++.
But sometimes you want to, like, do some operations in Python.
And that’s why Torch function modes are kind of like dispatcher layers, except they’re living at the Python level.
And then you probably could have implemented autocast in Python, and you would only do it in C++ because you had a speed concern, or you needed to work with the C++ front end, or something like that.
So, yeah, it’s a really good question.
And so I guess historically the answer is there wasn’t a good way to do things other than in the dispatcher.
But we are now adding more hooks, like Python Dispatch and, like, Torch function mode, which lets you do these in user land.
Yeah, yeah, thank you, Edward.
That’s all the questions I have for today.
Okay, thank you, Sherlock, for asking some great questions.
Talk to you next time.
Thank you. Bye-bye.Dispatcher-questions-with-Sherlock
EP61 AOTAutograd
Hello everyone and welcome to the PyTorch Dev Podcast. Today I have Haris He with me who is
going to come and talk about AOT Autograd, a system that is in Functorch which lets you capture both
the forward and backward traces of PyTorch operations. And then you can send them to a
compiler and then get back a compiled kernel and then stick them back in your PyTorch program just
like any other old function that’s available. So Haris, can you tell me a little more about what
AOT Autograd is? Yeah, so AOT Autograd is kind of just as you said is essentially a
compiler integration point for PyTorch, yet another one. So kind of the main premise behind AOT Autograd
is that we want something that makes integrating compilers into PyTorch training easy. So we have
other APIs like Torch.fx or TorchScript but one of the things that makes
integrating against training difficult is that they kind of have these very specific APIs in the case
of TorchScript or they just don’t support Autograd in the case of FX. So the premise behind AOT Autograd is
that we want to provide an integration point that makes integrating compilers into training seamless
and basically as easy as integrating compilers for the purposes of inference.
And the fundamental reason why this should be doable right is that the operations during training
are just tensor operations and these are exactly the same tensor operations that occur during inference.
The shape may be a little bit different but the actual operation should be the same.
So as long as we can represent the operations that occur during the backwards pass
So to actually achieve this AOT Autograd actually has to do a bunch of things
which I guess normally we would think of as separable components but AOT Autograd just
puts them all together in one package. So for example, you mentioned the other tracing mechanisms
like TorchScript or FX. So AOT Autograd does come with a tracer. Is that right?
Correct. So one of the aspects of AOT Autograd is it uses this tracer built on top of a new
mechanism called Torch Dispatch. And Torch Dispatch is kind of this new extensibility point that
probably could do with its own Posca episode. But kind of what Torch Dispatch does is it’s a
multiple dispatch integration point that sits below the dispatcher. So unlike something like
Torch.fx that sits at the Python level or something like jit.trace which sits above Autograd,
Torch Dispatch sits below all of that and therefore allows you to capture Autograd.
So that’s kind of the tracing mechanism that AOT Autograd leverages to capture the
forwards pass and backwards pass. That’s right. So we are basically able to run both the forwards
and backwards and trace all of that, including the code that normally is running in C++. But once you’re
done tracing all of that, there’s still more stuff AOT Autograd does, is that right?
That’s correct. So one of the tricky things about, so one of the things that AOT Autograd,
or like one of the premises here, is that tracing is fundamentally a pretty good way of capturing
machine learning models. And the reason for this is that tracing, most users code ends up being fairly
lacking in dynamic control flow and things like that that make tracing difficult. And a lot of what you
need to do with tracing, a lot of what tracing is able to do is it’s able to eliminate the Python data
structures or like weird ways that users write code, like they might use lambdas, they might use other
data structures, and it captures that. So that’s kind of what tracing is good for. But the problem is
that oftentimes, there’s a lot of things that break tracing. So for example, users might want to log
their tensors, they might want to branch on user input, they might want to branch on the loss and
do different things. And so what we want to do with AOT Autograd is we want to allow you to apply
a compiler to an arbitrarily small subsection of your program. And this is not, like this is not
naturally fitting into the tracing paradigm. Because like, if you capture a sub part of your model,
the forwards pass and backwards pass do not actually run at the same time. So it’s not like there’s like
a single function that you can trace. And so what we need to do is we need to capture the forwards
and backwards pass simultaneously by pretending it’s a single function. And then we need to do something
else to be able to split the forwards and backwards pass into two separate graphs that we then run at
different times.
So just to emphasize here, normally, you think of tracing the entire model. But with AOT Autograd,
you’re just tracing a little piece of it, or the entire thing, if you can manage it. But more frequently,
it’s just going to be a little fragment of it. And that bit needs to be its own microcosm.
Getting it’s forward and backwards. And then I guess AOT Autograd, the name Autograd in it comes
because, in fact, the main thing it does is it creates a custom Autograd function that wraps up
the forward and backward that can interoperate with the rest of your ego code.
That’s correct. Yeah.
You mentioned this cut thing. What’s that?
So once you’ve traced your joined forwards and backwards graph, you now need to convert this
like single joint graph into two graphs, like one that runs in your forwards pass and one that runs
in your backwards pass. And it might be clear that actually there’s actually some leeway in how you’re
willing to do this. So there’s some strict dependencies such as operations that need to be in the forwards
pass or that need to be in the backwards pass. But there’s other operations where you have a choice of
of whether you want to put it in your forwards pass or backwards pass. And so this choice actually
ends up mattering in certain cases. So you might imagine that if you
if you put an operation in the forwards pass or backwards pass, this might expose more fusion
opportunities or other things like that. And so one of the things we’ve kind of figured out is that
one of the most important optimizations you can do here is something called rematerialization,
also often known as gradient checkpointing.
So what we so we’ve kind of come up with an approach that minimizes the memory transfer
between your forwards pass and backwards pass. And this is kind of done using a mincut algorithm
that, you know, allows using a mincut algorithm. And kind of one of the neat things about this
approach is that in combination with a fusing compiler, this allows us to improve both the
runtime as well as the memory usage of the function.
So if I understand you correctly, what happens is we trace out the forwards, we trace out the
backwards. Actually, when I run traditional PyTorch Autograd, there is a choice made, right,
about what I compute in the forwards and what I compute in the backwards. But that choice is fixed.
It’s whatever my derivative formulas were implemented. But then with AOT Autograd,
once I’ve got this trace, I’ve got both the backwards and the forwards, I can, I can basically
renegotiate the boundary in that case. Is there a really good example of some place where this is
really profitable? Yeah. So I think a pretty natural example is let’s imagine you have a sequence of
operations like a cosine. So you’re just calling, you know, cosine on a tensor, you know, five or 10 times.
So if you think about the autograd formula for cosine, it requires saving the, saving like the
output tensor of your cosine operation. And so if you call cosine 10 times and you’re going to save
the output tensor, you’re going to save 10 different output tensors, right? Because like, you know,
the way autograd works is you apply the autograd formula to each operation individually. And then you,
you know, like multiply them together and apply the chain rule. So if you just run PyTorch Eager autograd,
you’ll end up saving 10 different tensors between the force pass and the backwards pass.
But instead often what you should instead do is you should just not save any operations in your
force pass. And so therefore, like you just get like a straight line graph in your force pass that
doesn’t save anything. And then your backwards pass, you should simply recompute your force pass in your
backwards pass. And so this allows you to reduce, instead of saving 10 tensors for your force pass,
you only save one. And instead of reading 10 tensors for your backwards pass,
you only read two tensors. And because the these cosine operations are what’s known as bandwidth band
operations, fusers can fuse them and make them like you’re kind of dominated by the memory you’re writing
and not the actual computation. So because we’re reducing the memory usage or like the memory reads and
writes, we can actually improve both the runtime as well as the memory usage.
And AOT Autograd does this today?
Yeah, that’s actually pretty surprising to me, because I remember when I first read about the
mincut algorithm, I just imagined this was, you know, sort of changing the sort of what tensors we say,
like, there would be some intermediates, and we would choose to save some, but not the other.
And then you just move the boundary around. But fundamentally, the computation wouldn’t be
changed. So I guess that’s not exactly what you’re doing here, right? There’s also rematerialization
going on. How does the algorithm figure out if something should be rematerialized?
Right. So I think the the way I think of this, the way I think of this is that basically,
the value we really care about is the gradient of the input, right? So like you’re the only reason
you’re doing your force pass is to compute the gradient. So perhaps like the right way to think
about this is that give it, let’s say, like, you know, you’re given both the inputs to your force
pass and the inputs to your backwards pass. And you’re allowed to save any arbitrary values,
such that computing, like your the gradient input is the easiest. So for example, you can sit,
you can, you know, compute it from your. Let me just cut in for a moment. You said gradient inputs
twice, but actually, you’re given the inputs and the grad outputs. And we want to compute the grad inputs.
All right. Yeah, sorry. That’s correct. And so like one strategy you might do, right,
is you might take both, like you might compute your the gradient of your inputs by taking the input to
your forward pass, as well as your grad outputs. And so this corresponds to basically rematerial or like
recomputing the entirety of your forwards pass during your backwards pass. But you can imagine that other
strategies, for example, perhaps you’re doing a matmul on your forwards pass, you might want to
start compute, you might want to start computing later down in your graph, and you know, skip the
extra matmul during your backwards pass. So like this is kind of, you’re doing like a mint cut,
not exactly to partition the two graphs, but kind of to decide what computation you’re going to perform
in your backwards pass. In other words, the forwards computation actually always is the
same no matter what. So we’re not really partitioning the graph in that sense.
Or no, so the forwards pass is kind of the way I think of it is, is that it’s implicitly defined
by what you choose to save for your backwards pass. And because like the thing we’re trying to
minimize here is a memory, memory bandwidth costs. And any each input you need to save in your forwards
pass corresponds to one input you need to read in your backwards pass. So luckily, minimizing,
minimizing memory bandwidth actually ends up being symmetric for both your forwards pass and your
backwards pass. So that’s kind of like one non obvious thing that makes this easier.
Uh, so I misunderstood. So, so we are going to change what the forward passes, but we are going
to sort of move stuff over into the backwards pass if we think it will be profitable.
That’s correct. Yeah.
Okay. Um, so we’ve talked about how AOT Autograd, you know, traces through our code,
gets the backward traces. And then we’d also talked about how we split them up and then eventually put them
in a Autograd function so that they work with eager mode. But of course we need to actually run a
compiler on these traces to, um, you know, do something useful with them. So can you tell me
a little bit more about, um, you know, how, how AOT Autograd works with compilers?
Right. So, um, so what AOT Autograd does, right? So we trace things out and, uh, we actually trace
things out into this kind of, you know, like standard, uh, PyTorch graph format, uh, called like, uh,
called torch.fx. Um, and then we simply take this fx graph and we pass it to a, uh, arbitrary compiler.
Uh, so for example, one thing we might do is we might take this fx graph, we might torch script,
and then we might pass it to like a torch script, uh, fuser such as, um, um, mvfuser or NNC. Um,
so one of the like complexities here, uh, so, you know, like the kind of pitch here, right, is that
if you have a compiler that works for, uh, inference, um, you, this like AOT Autograd allows you to also
apply that compiler, uh, in your backwards pass. And one of the, uh, that’s right. Because we don’t
actually, um, pass it, we don’t expect the compiler to do differentiation. We just pass it the forward
and the backward separately. Right. Um, but like one of the, uh, one, so that’s kind of the pitch,
but one of the areas where that’s not quite true, right, is that there are certain operations, uh,
that only occur in your backwards pass, uh, that never occur in your forwards pass. So for example,
uh, PyTorch has operations like 10H underscore backward, um, that, uh, you know, are used to
compute like, you know, backwards formula for 10H, but oftentimes compilers won’t have support of this
operator because, uh, it never shows up in your forwards pass. Um, so, uh, one of the things we
kind of do as part of AOT Autograd is we’ve written a bunch of these, uh, decompositions, uh, to basically
rewrite these, uh, operator, like, you know, 10H backward in terms of other, uh, more common operators, uh, that compilers
can fuse. So now to sum it up, AOT Autograd is a tracing mechanism. It’s a min cut, uh, algorithm,
and it’s also a number of decompositions. That’s a lot of things packed into one box.
Right. So, I mean, like one way to, you know, view AOT Autograd is like, it’s the specific kind
of product where, you know, providing to, you know, and trying to improve performance. But I think another
way of viewing AOT Autograd is it’s kind of a, uh, meant to be a, like, extensibility point for PyTorch.
So there’s like many things that are, uh, there’s many things in PyTorch that like are hard to do
because we’re in eager mode. And AOT Autograd is basically like an easy way of, you know, getting
a graph, uh, for your forest pass and your backwards pass that allows you to do arbitrary
things, you know, like rewrite them or, you know, reinterpret them in other ways. Um,
yeah. So for example, you know, like it’s really easy to like change, uh, this remit,
like this min cut rematerialization approach with a different algorithm. Uh, so for example,
you might want to like save more memory at the cost of doing more compute or things like that.
And that’s kind of like one of the things that we’re trying to support.
So that’s a lot of cool stuff. And if I understand correctly, there’s a lot of things that, um,
we also want to do with AOT Autograd in the future. Can you tell me about some of them?
Um, right. So one of the, like the most, uh, I guess obvious things, right, is like,
we’ve kind of implemented a couple of, uh, optimizations such as rematerialization,
um, as well as, you know, hooking it up with the operator fusers. Um, but you know,
there’s way more optimizations, uh, like still left that we haven’t really even like
touched the surface on. So for example, uh, one of the things that you might want to do is kind
of layout planning. Um, like you, you might want to, you know, change the layout of your operators
so that like, you know, the mammal or the cov has like a more favorable, uh, performance. Um,
and kind of what, uh, or, you know, other things you might do, you want to do like memory planning.
And one of the kind of interesting things here about AOT Autograd is that, uh,
uh, the setting that we’re operating in is kind of often fundamentally different from what, uh,
many, like what, uh, a lot of the people in, you know, the literature or have kind of, you know,
historically looked at, um, in that a lot of people usually kind of assume that they just
get like the entire graph, um, in a single, uh, or like they get the entire model, you know,
forwards and backwards in a single graph. Uh, but kind of the, what we believe here is that, uh,
this will early, you know, like what we kind of believe in PyTorch, right? Is that like,
this is not really true. And a lot of times our users want, you know, the flexibility of PyTorch and
they want control flow and things like that. Um, so a lot of, you know, things like layout planning or,
um, like, you know, memory planning become trickier, uh, when it comes to like operating in this setting.
So that’s kind of one of the things we’re thinking about. Um, another one of the things we’re thinking
about is that, uh, in some sense, like AOT Autograd, uh, is pretty inspired by, you know,
JAX’s, uh, like JIT and, you know, how it’s composable, uh, with, uh, JAX. So, you know,
you can apply JAX.JIT in an arbitrary location and, you know, it composes with Autograd or it composes
with VMAP and things like that. And, uh, you know, currently we also have things like VMAP and, uh,
uh, AOT Autograd, unfortunately, currently does not compose with those. So, um, although you can
use, uh, AOT Autograd to compile, uh, you know, things like VMAP and, you know, we’ve used that for,
you know, compiling things like Jacobians or Hessians, um, it does not allow you to compose in the other
direction. So you can do, uh, AOT Autograd of VMAP, but you can’t do VMAP over AOT Autograd. Uh,
so, you know, kind of figuring out composability, uh, in that manner is kind of one of the other
things we’re currently thinking about. The way I think about this problem of, um, you know,
running a VMAP over an AOT Autograd is, uh, in some sense, it’s not AOT Autograd anymore,
but it’s AOT everything that PyTorch supports. So, you know, that includes Autograd, which we do support
right now, but it also includes batching and functionalization and all of the other fun
user transforms going on. Horace, do you agree with this point of view? Uh, yeah. So AOT Autograd is
kind of like the current name, but, uh, in the future, yeah, perhaps they’ll need to be named
something more generic. All right. Well, um, that’s it for our time today. Uh, thank you very much for
joining. Uh, thanks for having me on. Cheers.AOTAutograd
EP62 Strides
Hello everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about strides
in PyTorch. This topic I have blogged about before, and I’ve also written a little bit
about it, but given that Mike Rubery has recently raised a proposal for stride-agnostic PrimTorch
semantics, I thought it would be a good time to talk about what is meant by strides and
some of the interesting characteristics that matter when you’re dealing with this concept.
Okay, so what is a stride? Well, a stride, as its name suggests, simply says how much
you need to go to find the next element in some memory. So remember, when we represent
tensor data, and these are these multidimensional arrays, it’s actually not fully specified how
exactly you map a coordinate, aka a set of indices, to an actual location in memory. Now, with
a one-dimensional array, you might imagine that, you know, you represent it in a very normal
way, which is that, well, you have your first element at the location of the start of your
array, and then you find the next element by going to the next slot, and so forth and so
forth. And we would call that having a stride of one, because you just go to the next position,
one over, in that situation. But you’re not limited to only being able to do that. For example,
let’s suppose we have a two-dimensional array, say it’s a five by five array. Well, to go to the next
element in a row, you would still go forward one in memory. But what if you want to go to the next
row? Well, in that case, you wouldn’t be able to find the next element one over, you would have to go
five elements over, skipping past all of the elements that were stored for the first row to get to the
first element of the second row. So going by column, it’s stride one, but going by row, it’s stride five.
This is a lot easier to visualize if you’ve actually got a diagram in front of you. So I highly recommend
checking out one of my blog posts, you can find it inside the podcast description to see a little bit
more about you know, how exactly this works. But essentially, all the stride is, is it’s a specification
for any given dimension, your tensor, how far you have to go in memory to find the next element there.
And so typically, you know, the innermost dimension, the one on the right, when you’re talking about like
the size, that’s going to have a very low stride like one. And then the outermost dimension, the one
on the leftmost side, like the first dimension, it will tend to have the biggest stride because well,
you’ve got to get past all of that other data over to the right hand side, before you can actually get
to the next, whatever it is in your dimension. Okay, so that’s what strides are. And mathematically,
when you think about how to index into tensor, it turns out your indexing formula is very simple,
you just take, you know, the index zero and multiply by the stride for zero, plus the index
for one multiplied by the stride for one, and so forth and so forth, so forth. So it’s a very simple
formula. And it’s pretty easy to implement, well, unless you have a arbitrary dimensionality tensor.
And, you know, that’s sort of where we come from with PyTorch. Strides have been with PyTorch,
since before PyTorch was even a thing. Torch, the library that PyTorch is derived from,
also had strides. Strides are pretty useful. And there are two primary reasons why PyTorch,
and also NumPy, NumPy also has a concept of strides, support it. And those two main reasons
are views. This is the original reason we had strides. And the second reason, memory formats,
which was added on at a later point in time during PyTorch’s history. Let’s unpack these two use cases.
So what do I mean by a view? Well, notice that I was talking about how to find the next
piece of data, right? And I said, well, you know, if I’m looking for the next element in the row,
I just go by one. And for the next element in the column, I go by five. And so when you are talking
about tensor data, sometimes you want to talk about a subset of the tensor data, and treat it as a tensor
in its own right. For example, you have a two by two, sorry, a 2d matrix, and you want to extract out
a row from that matrix. Well, extracting out rows is pretty easy, because all you need to do is you
just take whatever your offset is, where the row starts. So if it’s the zeroth row, you’ll start at
beginning. But if it’s the, you know, fifth row, you’ll start at, you know, memory location 25, say,
and you just, you know, adjust the length so that you only see that row. And so if even if you don’t
have any concept of strides, it’s very easy to represent, you know, subset rows, sub rows of a
tensor in this way. And so if you were like doing stuff with C++ vectors, for example, there’s a very
handy utility class we have in PyTorch called array ref, which is a non owning view onto vectors. And you can
you can do this, you can have it take out an array ref, to a arbitrary row in a, you know, virtual
2d vector, where you just have everything stored contiguously, instead of having a vector of vectors.
Okay, so that’s very easy. But what if you want to say, for example, return a tensor that represents
a column of your, of your tensor. Now that is not going to be so easy. Because if you look at each of
the individual elements, they’re going to be laid out in memory differently than on the row, the row,
all the elements were one after another, you know, move one, find the next one, move one, find the next
one. But for the column, we said, well, you have to move five to get to the next one. So they’re not
together, they’re so called non contiguous. And so in this situation, if you only had the ability to say,
well, here’s where you should start reading. And then you should just read out a contiguous chunks of
data, that’s n long, you would have no way of representing a column without actually copying
out the data so that it’s contiguous. And so when you have a tensor representation that supports strides
natively, and that’s what PyTorch has, and that’s what NumPy has, we have the ability to represent
column tensors without doing any copies, because all we do is we say, okay, well, let’s have a one
dimensional tensor, we’ll start it at the beginning of the tensor. But instead of having the stride be
one, which would be the normal situation with a 1d tensor, we would have the stride be five, say,
so that I know, okay, to find the next one, I skip over five, and then go forth, and so on. And now,
of course, many operations in PyTorch only know how to handle contiguous inputs. So when you do a,
you know, strided tensor like this, sometimes there won’t be any profit, you’re simply just delaying
the inevitable. Once you actually do a computation on it, we’ll go ahead and, you know, zip through the
data, gathering it together into contiguous tensor, and then running the operation. So it’s the same as
if you had done the copy ahead of time. But there are two important differences in PyTorch. So one is
that because the view of the tensor shares storage, right, we didn’t do a copy initially,
the copy only happens lazily, when an actual kernel needs to be run. If I mutate the original tensor,
or if I say mutate the view, it will show up in the other place, either the view or the original tensor,
depending on what you did. So that’s really handy. Because one of the things that you know,
is really nice about working with PyTorch is you can, you know, go and explicitly mutate tensors,
mutate views, and use that to sort of initialize your tensor, if you need to. Now, it’s not recommended
because it doesn’t work well with Autograd, but it works, and you can do it. And that’s very useful.
The other important thing about having views being represented in this way is that sometimes we are
able to handle a non-contiguous input without, without having to do a copy. So in that case,
we’ve saved ourself from having to do a bunch of data movement. And instead, we, you know,
we’re able to fuse the gather operation directly into the kernel instead. So one way that I like to
think about views is they’re a very limited form of lazy evaluation, right? Instead of doing the gather
immediately, instead of doing the collection of all the columns elements into contiguous tensor,
we defer it, and we wait until the actual kernel needs to get run. And then that’s when, you know,
we figure out, oh, okay, the kernel actually supports the situation. Hooray. Otherwise, oh, maybe the
kernel doesn’t support the situation. Oops. By the way, so this is a little side note. But when you’re
writing kernels in PyTorch, you do have to think about whether or not your kernel is going to know
how to deal with non-contiguous inputs or not. And this is actually kind of a pain because a lot of
kernels don’t really know how to deal with discontiguous inputs. It’s a lot of work to actually
handle strides. And it makes your indexing formulas more complicated. And it makes your kernel slower,
because if your kernel can assume that everything is contiguous, then it doesn’t need to do all of this
indexing arithmetic, you know, all the multiplying strides by sizes to actually figure out where the
location it’s going to get out data from is. So a lot of kernels just want to assume a, you know,
I’ve just got a contiguous thing. And so historically, if you wanted to write code like that by hand,
what you would have to do to be strictly correct is you would have to go and
contiguous. So okay, so there’s two situations. So one is if you’re writing a functional kernel,
you would need to go and check if the input was contiguous. And if it was not, you would have to run
contiguous on it to get the contiguous input. There’s some helper functions like expect contiguous,
which help you do this without incurring a ref count bump when there’s no contiguity in this case.
And the other thing that, um, they let you do is they let you, um, uh, let’s say that you have a
kernel that takes an out parameter. Well, if the out parameter is strided, you’re expected to be able to go
ahead and directly, um, you know, respect that striding because, you know, Hey, maybe it’s some view and that
view is aliased with some other base tensor. And the user actually did wanted the output of the, uh, uh,
computation to get scattered in this way. So to actually do this, we have to first allocate a,
uh, contiguous tensor, which is going to be our output. Um, go ahead and run our kernel writing
the data into the contiguous tensor, and then finally scatter it out into all of the actual user requested
output tensor in the situation. And as you can imagine, this is very easy to forget to do. And there’s a lot
of kernels that don’t do this correctly. Fortunately, um, if you’re writing structured kernels, um, there’s
a very nice, uh, new piece of functionality by Yukio Sirachi, where basically you can say, Hey, um, my
kernels, uh, can’t handle, uh, strided, uh, outputs. Um, they can only handle contiguous outputs and you say
set output contiguous, and this will go ahead and handle all of the ensuring that the output is in fact
contiguous under the hood for you and do the copy out to the real output if necessary. So you can just
write your kernel without worrying about this stuff. So it’s pretty handy. You should use it if you’re in
that situation. All right. So that’s it about what, uh, strides are good for with a views. Now there’s
another thing that I said they’re good for, and that is a memory format. What do I mean by memory format?
Well, memory format refers to, um, some conventions about where exactly physically we put data when we are
talking about them. So for example, um, you may have heard of the terms channels first and channels
last. What exactly is meant by these terms? Well, they refer to a very common, uh, uh, layout, a question
you have to decide when dealing with, um, image data, which is specifically when you represent the image
data, do you represent it as a, as a 2d matrix, uh, representing the image data? And, um, you have a
copy for the red values, a copy for the green values and a copy for the blue values. So just imagine,
you know, three distinct images, monochrome images, each one of them representing their respective color,
but the images themselves being contiguous, or do you represent them in a, um, sort of bundled manner
where you have the channels, uh, you, you have the value RGB for the first pixel, then RGB for the
second pixel and so on and so on. And the difference between these two formats corresponds precisely to
channels first or channels last channels first being CHW where the height and the width, um, pixels have the
lowest stride. And then to get to the next channel, you have to do a big jump and channels last AKA HWC
where the channel has the smallest stride. So to get to the next channel, you only have to do a single
step. Now, depending on your situation, whether or not you’re on CPU or CUDA and so forth, um, whether
or not you want to lay out the memory in channels first or channels last order, um, differs. And it also
depends on what operations you’re doing. Sometimes operations are faster. You have channels first,
and sometimes they’re faster if you have channels last. Historically PyTorch’s memory
format convention is that we always do channels first. So whenever you have any APIs that taken
data that is supposed to represent images, um, we’ll always take them as NCHW and being the batch
dimension, C being the channels and then the height and the width. So what if you wanted to actually run
some code that actually handled them with a channels last memory format? Well, to do that strides come to
the rescue. So the NCHW layout is what we think of as logical layout. It just says, you know, when you’re
accessing the tensor from the user program, you know, the H and the W dimensions are the second and the third, uh,
well, it’s two index and the three index. And then the channels, the one index dimension. So if you want to
actually change the physical layout in memory, so that the, you know, channels layout is index three,
all you need to do is set your strides appropriately. So you can keep the same logical layout. And so even
when you’re doing channels last or channels first, um, you always see a NCHW on tensor as far as you’re
concerned from a user, but by modifying the strides underneath the hood, we can change the physical
layout so that it’s actually laid out with channels first or channels last. So this is how we support
memory formats in PyTorch. You, we don’t have, um, the thing that TensorFlow does where there’s an extra
flag you have to pass to say convolution saying, Oh, channels are, you know, in the, um, beginning
position or they’re in the end position. Instead, we always assume channels are in the channels first
position. And if you want a different memory format, you just modify the strides to get there. Cool.
Okay. So what is going on then given all of this information, what is going on with, uh, Mike’s
proposal for stride agnostic semantics? Well, this comes down to the fact that although strides are very
mathematically simple to express, that is to say, you just multiply the size by the stride, and then you
do that for every dimension in your tensor, this actually leads to a little bit of, um, uh, too much
leeway in the representation for strides. In particular, let’s suppose that I have a tensor and I have a, uh,
I have a dimension whose size is zero or its size is one. When a size, a dimension size is zero. Um, you can
see that the stride in this case doesn’t matter. Why doesn’t it matter? Well, you have no elements in the
tensor at all. So you’re never going to try to ask for the next element because there’s no elements at
all to ask for. So I can put whatever I want in the stride of a size zero tensor. In fact, I can do
whatever I want for any of the operations because I will never ever get called out in this situation.
Now, although you can put whatever you want, um, as far as the indexing is specified, you may not want
to do that for memory format, because remember our memory format is telling us whether or not this is
a channels for a last or a channels first tensor. So, you know, having the stride set up correctly
for this case, it does matter sort of, I guess a similar situation shows up when you have a size
one tensor, right? So once again, there is an element this time, so that’s great, but you’re never going
to ask for the next element because there’s only one element. So you, you know, can have whatever you
want in the stride once again, because you’re never ever going to observe it in the situation
because the strides are over-specified in this way. Um, we do have a convention for what the
contiguous strides of a tensor are supposed to be even the zero and the one case. But when you ask if a
tensor is contiguous, we actually accept all of the possibilities for zero and one. So basically, um,
we know that there’s flexibility here and we don’t, um, hold it against you if you pick something else
in that situation. But that means that there’s a lot of flexibility for what kernels choose in this
case. And they often just do whatever the heck they want in these situations. So that’s kind of
annoying. And when we are doing stuff like prim torch, where we’re trying to re-implement all of
PyTorch directly in Python, um, this is a pain because the way we do testing is we go ahead and, you know,
run the original implementation and run our new implementation and check if the strides match up.
And while lo and behold, sometimes they don’t because, you know, there’s these degrees of freedom
and they let the strides, you know, wobble in a way that doesn’t actually matter.
Okay. So I told you that the strides are, uh, over-specified in some situations because of these
degrees of freedom from size zero or in size one, but wait, it gets worse. So remember what I told you
about memory format, right? So you have these NCHW and NHWC tensors and, um, you know, depending on having
one or the other, um, your kernels might run faster or not. So one of the things that we need in this
situation is if you want to actually be able to run your network in NHWC, for example, that’s the non-default
situation. You need operators that actually propagate this NHWC format through the entire network. So if I do an
addition on a tensor and it’s NHWC, it better stay NHWC because there might be a convolution coming
up afterwards that actually, you know, could have benefited from that other ordering. Now there’s a
problem though, which is that when we write up the rules for how strides should propagate, we have a
very complicated situation. What if a user passes us a tensor that is NCHW and another tensor that’s NHWC,
that is to say their memory formats disagree. What do you do in this case? Who knows? Um, we have some
algorithm for determining what exactly we should do in this situation, but it’s, you know, kind of
complicated and hard to describe. And most people just don’t close their eyes and, you know, hope
something reasonable happens in this situation. And so once again, you know, in Mike’s proposal for
stride agnostic, he’s basically saying, Hey, you know, it’s a lot of work to mimic this stride behavior.
And, you know, what are we even getting out of it? So since this is my podcast, I get to, you know,
harp on what I think that what we should do in this situation. So I agree that it’s probably not a great
idea to spend a lot of time worrying about what exactly happens when you have a tensor that is,
uh, you, you add to like a channels first tensor and channels last tensor together. That just means
you’ve done something wrong and you know, that’s fine. And we shouldn’t force ourselves, uh, to make
sure that the semantics exactly match in this case. But as I’ve mentioned before, we do use strides for
two very important use cases, views and memory formats. And so, although, you know, maybe strides in
their full glory, it’s just too much for our puny, uh, little brains to deal with. We should make sure
that we do have a good story for at least the two use cases we care about. So that’s everything I
wanted to say about strides today. Um, see you all next time.Strides
EP63 Weak-references
Hello everyone and welcome to the PyTorch Dev podcast. Today I want to talk about weak references.
Some useful background for today’s podcast. So we have a podcast about reference counting from way,
way back then. Still relevant. If you haven’t listened to it, give it a listen. I’m not going
to go over reference counting basics. And you might also be interested in the Python resurrection
podcast. Also just check the links in the podcast. That one’s not, that one’s optional. You don’t have
to listen to that one, but it’s some useful context as well for discussion about weak references.
Okay. So weak references, what are they and what are they good for? So a weak reference is a reference
to an object that doesn’t keep the object live. So let’s imagine that you’ve got a tensor and it’s
got a lot of data in it. And, you know, you want to be able to store a reference to it because you’re
keeping it in a cache or something like that, but you don’t actually want to hold on to it because
maybe, um, uh, you’re just caching it, right? So if everything is done using it, then your cache is,
you know, never going to actually let the tensor get freed in that situation. But the cache is purely
advisory. If no one’s using the object anymore, then you would like the cache to automatically free
it in that situation. You don’t want the cache’s reference to the object to be strong. You’d like it to be
weak. Another common situation that this sort of thing shows up in is let’s say that you have a
cache that is keyed by a tensor. So you’re mapping a tensor to another tensor. So let’s think about the
key tensor in this situation. This key is basically in the hash map so that we know how to correspond,
uh, the, um, you know, input tensor to whatever the cached output is. But once again, we don’t want to
keep this input live. If all the references to the input are dead, then there’s no way I can ever
actually pull out that tensor from my cache. So I really don’t want the cache to keep it live in
that situation. One last example for weak references. So, uh, in, uh, in Python, um, uh, object
manipulation is very flexible. So you’ve got all these objects lying around and you can basically
mutate them willy nilly, however you like, right? You can like add extra fields, do whatever you want.
Um, unless the object, you know, doesn’t support underscore underscore dict, uh, Python supports adding
arbitrary attributes to objects, but there’s a problem to this, right? The attributes on the
object form a sort of global namespace. So if you, you’re, you know, using, you know, one name,
like, uh, say name, for example, for your own nefarious private purposes, and someone else wants
to also put something else on, uh, the same field name, well, that’s going to be a conflict and your
code is just not going to work. And so because of this situation, uh, it’s not really safe to just
arbitrarily write random attributes onto tensors. You’d kind of like them to be, you know, some
private in some way. Now, of course you can mangle the name of the attribute to make them private,
but there’s another way you can also implement this. And that is once again, using a weak map,
you just have a weak map mapping, uh, you know, any given tensor to the attribute you want to store for
them. And as we, uh, said earlier, we do want the entries in this map to get garbage collected
if the tensor goes dead. And, you know, that’s what a weak map exactly would do. And similarly,
uh, because we have separate maps for all of our various users that want to store metadata,
then you actually, you know, don’t ever have a possibility of conflict because each map is its
own heap allocated object. And, you know, they’re not being addressed by some string name. Something
else that’s really good about doing it this way is that you can also just delete the entire weak
map when you’re done. And then all of those attributes go away. So you don’t have to like worry
about, well, you know, I’m done with all of my private attributes. How do I get rid of them at
some later point in time? You know, you just use a weak map to do that. Okay. So weak references,
Hey, they’re kind of useful. So we do support them in PyTorch, um, in several ways. One is, um,
in C++, obviously, if you use, um, the shared pointer, uh, smart pointer type, uh, shared pointer
comes with built-in support for weak references. Also our, uh, intrusive pointer, um, see our previous
podcast that also supports weak references. And of course, Python, um, with, uh, Python
objects, they also support weak references. So there’s actually two weak reference mechanisms,
um, either a C++ mechanism or the Python side mechanism. And you can use either one if you
have an object that’s bound from both places. So what I want to do is I want to explain a
little bit of how, how these are implemented and then some consequences of these implementation
decisions. Let’s get down to it. So how are C++ weak references implemented? Well, um, when we talked
about reference counting, we said reference counts were a field on an object saying how many references
there were to the object so that when the, you know, field was still, uh, you know, positive, that meant
the object was live. And when that count goes down to zero, now we know the object is dead. So weak references
are just a, um, you know, extension to this where not only do we keep a strong reference count,
we also keep a weak reference count on the object. So the weak reference count, as its name suggests,
counts how many weak references there are into the object. Now that, uh, do note that, uh, when I have a
weak reference, um, uh, count, it’s actually not only weak references. There’s actually one extra weak
reference and that’s, um, for the strong reference count on the object. So the invariant here is as
long as the strong reference count is greater than zero, then my weak reference count has one, uh, is at
least one where that one is from the strong reference count. And then you can have as many extra weak
references to the object as you like. So how do these two reference count fields interplay with each
other? Well, the algorithm looks like this. So long as the strong reference count of the object
is greater than, uh, zero, um, the object is live. Um, and when the strong reference count goes to zero
and you know, this zero, this testing, if the strong reference count has gone to zero is an atomic
instruction. When it goes to zero, that is when the object becomes dead. So no matter how many weak
references you have to an object, it doesn’t matter, right? Weak references don’t keep an object live.
Only strong references keep an object live. So when all the weak reference, uh, so when all the strong
references are gone, then we kill the object and we say, okay, um, we are done with this object.
However, ordinarily, when we want to deallocate an object, we would just go ahead and free the memory
associated with this object, but that’s not okay. We’ve got a bunch of weak references to the object
that are pointing to this memory. And I just go ahead and free that memory. There’s no way for those
weak references to know, Hey, um, you know, there’s no object here anymore. I can’t actually, um, give you a
strong reference, by the way, um, when you have a weak reference to a still live object and you say,
Hey, I would like a strong reference from this week reference. You, I’d like to de-reference the
weak reference. All we do is we attempt to atomically exchange the strong reference count with one greater
than the strong reference count. And, um, that will succeed. So long as the strong reference count
wasn’t zero. And if it was zero, then we’ll just say, Hey, there’s no element available in this
situation. So we’ve got these weak reference counts, but they need to be able to access the,
you know, reference count fields that are stored on the object. Remember, this is an intrusive, uh,
reference count in our case or the control block in the case of a shared pointer. And so if I just go
ahead and deallocate that, then that’s no good, right? I don’t actually have the data anymore. It
would just be in a sand violation in that situation. So what I do is I actually keep the object live.
Now, wait, you might be saying that sounds very silly. If I keep the object live, then what’s the
point of having the weak reference distinction? Aren’t I supposed to deallocate the object in this
case? And indeed, uh, for, you know, objects that are sort of stored, all the data is stored in line,
weak reference are kind of useless in this situation. And so with shared pointers, um, the way this is
dealt with is actually the reference counts are stored in an extra object called the control block.
And the control block is the only thing that gets stays live. You actually deallocate the
object in that case, as long as you didn’t use make shared, that is to say, which allocates the
object and the control block together. But for an object like tensor, um, we have something else we
can do, right? Because the tensor object itself doesn’t contain all that much data. It is, it is a kind
of fat object and it’s got a lot of fields on it, but really most of the data usage of a tensor
is coming from the, uh, the data, the actual tensor floating point data that is associated with the
array in question. So all I need to do is I just need to deallocate that. And then I’ll have a little
stubby, you know, tensor data structure left, which, you know, is not, uh, which is taking up some space,
but it’s taking up far less space than the actual tensor data in question. And so we’ve got a method on our
tensor objects that does this. It’s called release resources. So just to go over the algorithm first,
we, uh, you know, have the strong ref count is greater than zero. We do a bunch of stuff.
When the strong reference count goes to zero, we go ahead and release resources. If there are still,
um, uh, sorry, we go ahead and release resources, right? Because those are the resources that are not
being used anymore. And then as soon as the weak reference count goes to zero, oh, by the way,
when the strong reference count goes to zero, we also decrement the weak reference count by one,
right? Because remember there was one weak reference count, uh, associated with the strong
reference count. So when the weak reference count goes to zero, then we know there really are no
pointers into the data, uh, into the object in question. And now we can actually free it from the
heat. All right. So that’s cool. So that’s how C plus plus side support for weak references are
implemented. You have to allocate an extra field for maintaining the weak reference count. And then
there’s a bunch of extra stuff that happens, um, at the allocation time in the common case,
uh, when you, uh, deallocate a tensor, there aren’t any weak references to it. So the strong reference
count goes to zero that causes the weak reference count to go to zero. And then we immediately delete
the object in that situation. We are, we actually have an optimization for this, uh, courtesy of Scott
Walchalk, where we don’t have to do the, uh, atomic compare and exchange anymore. You, you just do a
relaxed, uh, load on the week count and check if it’s one. And if it is, you just go ahead and delete
it in that case. Okay. So what about Python? So Python also implements weak references, but it actually
does them in a quite different manner. And, um, Python’s, uh, implementation works because, um, remember
Python has a global interpreter lock. So it actually doesn’t need to work in a multi-threaded setting.
I talked a lot about of atomics in the straw, uh, in the C plus plus side of things. And really
C plus plus’s implementation is, uh, by and large, uh, you know, sort of, it has to look this way
because it’s supposed to work in a multi-threaded setting. So how exactly do weak references work in
Python? Well, it’s pretty simple. Every object that is able to be refer, uh, referenced as a weak
reference has an extra field called the weak reference list. What exactly is the weak reference list?
Well, it’s literally a list of all the weak references that point to this object. So weak
references in Python is a special object. And so whenever you create one of these to point to an
object, we actually just go ahead and put that object on this list. And you know, that would be
hella unsafe in a multi-threaded environment, but in Python, it’s fine because there’s a global
interpreter lock. So whatever. And so now, um, these weak references don’t actually, uh, don’t
actually, um, increase the true Python ref count. So Python ref count does the normal thing when it
goes to zero as part of the deallocation process, we go ahead and go through all the weak references
pointing to this object and say, okay, well, you are no longer valid. So you can’t, uh, you can’t use
this weak reference to go ahead and, uh, run this object. And because, um, you know, we know what all the
weak references to this object are. We can also go ahead and run finalizers. So that’s, that’s also when
finalizers get run in Python. We just iterate through all the weak references. Those weak
references can have finalizers attached to them. And that’s just some code we execute when we do it,
by the way, uh, the fact that finalization can resurrect an object because, you know, finalization
is just arbitrary Python code. Maybe when you’re done finalizing, the reference count has gone
back to one or greater. That is exactly what we’re using to implement, uh, you know, uh, tensor
py object resurrection, which we talked about in a previous podcast. Okay. So that’s about how
Python side weak references work. So let’s talk a little bit about some consequences of these
implementations. So one thing to know about is that, um, when you use weak references in Python
to specifically do tensors, you have an extra, uh, uh, because of Python object resurrection,
there’s a little extra work you have to do. So the work you have to do is there’s a private method
on tensor called fix weak ref. And what it does is it, um, makes sure that the sort of ownership
pointer between the Python object and the tensor object looks the correct way. Let me explain why
this is needed. So I mentioned that we’ve got this thing called Python object resurrection,
which says that when a Python tensor object would have died, we check if the C++ object
for it is still live. If it is, we go ahead and flip the ownership pointer so that the C++ object
owns the Python pointer. And whenever we take out a new Python reference to the, uh, py object,
making it live again, we go ahead and flip the reference back. Well, the problem with weak references
in Python is they constitute another way of accessing the Python object, um, that it might be ostensibly,
uh, sorry, a Python object that isn’t a normal, um, you know, sort of give me a tensor from the Python
API bindings. And most importantly, this way of referencing the Python object is not interposable
by us. So we have no way of seeing when this sort of thing happens and then going ahead and flipping
the ownership pointer if it’s necessary. So you have to tell us, um, this yourself. So this is something
to be aware of if you’re working with weak references. And if you’re working with weak references in Python,
you probably want to do them with tensors. So this is something you need to know about. Like it’s very,
very important to do. Um, another consequence of, um, this design is, um, so I mentioned that release
resources is about releasing resources that, you know, sort of take up a lot of space when the strong
reference count goes to zero, but maybe there’s still a weak reference count. Release resources is
a virtual method because, um, there are maybe multiple tensor subclasses and they might have different
resources that need to be deallocated. So it’s actually, um, and this was discovered by Scott
wall chalk. Um, it’s actually quite a performance, uh, problem to always be, uh, to always be, uh, running
the release resources, uh, method, um, whenever a strong reference count goes to zero, because, um, most of
the time there aren’t any weak references. So you can just go ahead and delete the object entirely
and like, that would be fine, right? That would also, uh, do the same thing. And in particular,
the delete, uh, um, method would not actually, well, okay. It’s also virtual, but you’re safe.
You’re going from two virtual calls to one virtual call. So Scott has a patch that basically makes the
call to release resources optional. It only gets called if we’re actually in the situation where we’re
trying to keep the object live for weak references, but we know that all the strong references are dead
and we want to delete the data. Um, so, you know, there’s a lot of this kind of optimization that
goes into making a smart pointer, uh, implementation. And so it’s, it’s quite tricky actually. Like the
basic algorithm is not too hard, but then you want to like reduce the number of atomics and, uh, you know,
get it as efficient as possible. And that’s when things get pretty complicated. In fact, it’s,
it’s so complicated that, um, Scott’s original version of PR has a bug in it. And the bug in it is
essentially related to, um, how we maintain the reference counts. And when we have the, um,
when we’re running release resources, because release resources is actually, it’s a pretty
much an arbitrary piece of code that gets run at the end of the object. It’s basically like a finalizer.
And so because release resources can trigger arbitrary, other destructors to run. One of the
things that it can do is it can actually cause a weak reference to the tensor you’re currently
deallocating to be dead. So you need to make sure that while you’re running release resources,
you don’t accidentally deallocate the object you’re working on while you’re doing it, right? Because,
um, you’re ostensibly running release resources because it’s being kept live by a weak reference.
But if that weak reference dies while you’re releasing the resources, you need to keep the
object live until you’re done running release resources, and then you can delete it. So, you know,
just the kind of thing to be worried about. Okay, that’s everything I wanted to talk about today.
Uh, see you again next time.Weak-references
EP64 Learning-rate-schedulers
Hello, everyone, and welcome to the PyTorchDev podcast.
Today, I want to talk about learning rate schedulers
on request of Nelson Alhaj.
What is a learning rate?
Well, remember, deep learning is all about optimization,
and optimization is all about starting off at some point
in your very hyperdimensional parameter space
and then slowly making your way to a set of parameters
which does better.
And so every step we do is based off of the gradient
we compute for computation.
And so the learning rate simply says,
once I’m at some point in my parameter space
and I figure out where I want to go, my gradient,
how far do I go in that direction before I stop
and reassess the landscape and compute my gradient again
and go further?
So that is the learning rate.
So why does the learning rate matter?
Well, you can think about the situation
and if you have a very, you know, spiky landscape
where there’s a lot of different changes to the gradient,
then if you do a very large learning rate
and you make a very large step when you’re doing an optimization,
you may have been locally improving the loss for a very small amount of the step,
but then the landscape changed and now you’re climbing back up the hill
and you just went too far and you overshot the place you wanted to go.
A very common diagram, and sorry, this is a podcast so I can’t show you a picture.
A very common diagram is imagine you have some sort of valley
where the valley sort of is slowly going down until you get to the global optimum.
In this case, we’ll have the global optimum be something that’s low
because this graph is representing our loss.
So the lower the losses, the better.
So if you start your ball, the ball being, you know,
the point we are at on the parameter space on the side of the valley,
then if you go too far, you will bypass the sort of the bottom,
the deepest point of the valley and hop to the other side of the valley.
And then you’ll sort of zigzag back and forth
until eventually you get to the final destination.
But you’ll do a lot of wasted steps along the way.
So, you know, a lot of sort of optimization techniques
and a lot of playing around with learning rate,
it’s all about sort of trying to get to your final destination,
you know, more directly without, you know, overshooting every single time.
That being said, you don’t want your learning rate to be too small either
because, well, if it’s a really small learning rate,
then you’re just not making very much progress at any given step in time.
So, you know, if you don’t make very much progress,
you might just never actually get to convergence on your network in this situation.
Okay.
So learning rates are kind of important.
And, you know, certainly when you’re writing a Python model,
you’ll have some sort of optimizer.
And your optimizer is going to make some decisions
about how exactly it’s going to go about exploring the state space.
But most optimizers have a hyperparameter called the learning rate,
which is just a global number you can toggle to say
sort of how, you know, far or close you should go.
There are some optimizers that automatically determine a good learning rate,
but there are also optimizers which don’t.
And so that’s just a parameter you have to do.
So a learning rate scheduler is a way to sort of automatically modify
this parameter on your, this hyperparameter on your optimizers
in some way that’s sort of non-standard, right?
Because there’s a lot of things you might want to do.
Maybe while you are warming up,
you know, while you’re doing the initial,
you know, few steps of your commutation,
you don’t want to, you know, go too far.
So you want to sort of just slowly explore your local space,
ramping up until you actually hit your final learning rate.
Or maybe, you know, as time goes on,
you want your learning rate to decay and get smaller and smaller
so that, you know, you’re, you know,
after you’ve done all the major learning at the beginning,
you’re going to finally get closer and closer to your,
the final optimum.
And now you want to make smaller and smaller steps
so you are careful not to overshoot in the situation.
So there are tons and tons of, you know,
different learning rate schedulers.
Honestly, there aren’t that many.
So if you like look at torch.optim,
that’s the directory that has all of our optimizers.
We have tons and tons of optimizers
because there are lots of, you know,
ways to go about doing optimizations.
Our learning rate schedulers,
they fit in a single Python file.
So, you know, it’s really not,
there’s not that much stuff going on there.
But it’s something that, you know,
people do care about.
And that’s what I want to talk about
in the podcast today.
So we have to,
so where I want to go next is
how exactly does the learning rate scheduler API
in PyTorch works?
It’s kind of surprising.
And we basically haven’t changed it
since, you know,
the very beginning.
I think it was like 0.1.
Someone submitted a pull request
to add learning rate schedulers
to PyTorch.
And we were like,
okay, we’ll add it.
And we have basically
not changed the API
since then.
A lot of APIs
in PyTorch’s
neural network,
you know,
library in Python
have not changed.
So this can result in some weirdness
in the API
that, you know,
things we learned over time
and we haven’t been able to fix them.
Well, let’s talk a little bit about
what this learning rate scheduler API looks like.
So the learning rate scheduler API
is sort of based off of two things.
The first thing it’s based off of
is it’s based off the optimizer API.
Why is this important?
Well, optimizers in PyTorch
are stateful.
So the standard model
for, you know,
a PyTorch program is,
you know,
you’re off doing your optimization
and the way things work
is you go ahead,
you run your computation,
you run your forwards and backwards,
you compute the gradients,
and then you ask the optimizer
to do a step.
And the step,
you know,
is a method on the optimizer.
It’s a stateful method.
It has side effects.
And what it does
is it goes ahead
and reads out all the gradients,
updates the optimizer’s internal state,
and updates all the parameters
to actually, you know,
make them all go well.
So, you know,
the model that people have
is, you know,
they’re looping through
their, you know,
batches of inputs
and every batch they do,
they call the optimizer step at the end.
So our intrepid contributor
back in the day
looked at this API
and they were like,
okay, well,
let’s do something similar
for learning rate scheduling.
So what they did
was they said,
okay,
we’ll have an API
for, you know,
modifying the learning rate.
It’s going to be
a learning rate scheduler object
and you will call step on it
to modify the learning rate.
Unlike optimizers,
you know,
optimization needs to happen
every mini batch, right?
Because, you know,
every batch you do,
you want to actually update
the parameters
with what you did.
Typically for a learning rate,
you only want to do that
for an entire epoch.
You don’t want to modify
the learning rate
until you’ve actually finished
processing the entirety
of your input data set.
So learning rate
has to have its own step function,
but okay, fine.
So, you know,
you have your optimizer step,
you have your learning rate step,
and, you know,
you just call them
when appropriate,
either at the end
of your training iteration
or at the end
of your training epoch.
That being said,
actually,
in the beginning,
learning rate schedulers
were implemented
in a kind of funny way,
and whether or not
you call them
before the optimizer step
or after the optimizer step
was something
that sort of
wasn’t well specified.
So we had to make
a BC breaking change
to sort of fix it
so that, you know,
the behavior was uniform.
You always call it
after the optimizer step.
It was pretty confusing
because, you know,
CFL APIs are confusing.
It’s hard to, you know,
make sure that
they do exactly
the right thing.
And, you know,
you don’t even notice
half the time, right?
Because the learning rate
is just this hyperparameter
and obviously your optimization
is still going to work
even if you stay stuck
on your old learning rate,
you know,
one epoch more
than you expected.
So the kind of people
who notice this sort of thing
is like if they’re like real,
they have some network
that’s super sensitive
to the initial conditions
or maybe they’re trying
to reproduce a paper
and they’re like,
huh, how come the learning rate
is not the same thing
as, you know,
on what I saw in the paper?
And, well,
that’s because, you know,
we messed up the stateful API.
That simple.
Okay.
So I mentioned that
the learning rate API
was based off of this
optimizer stateful API, right?
So it’s like you say,
okay, you know,
when I’m done,
I run step
and that will update
the learning rates everywhere.
But the second thing,
and this is important,
is that PyTorch’s
learning rate schedulers
were essentially cribbed
from Keras.
So, you know,
Keras was, you know,
existed back then.
Keras has been around
for a while
and Keras had
a learning rate scheduler API
and basically,
besides, you know,
statefulizing up the API,
because Keras'
learning rate API
is sort of based
on a sort of callback model
where, you know,
the optimizer calls
into the learning rate
callback to figure out
what to do.
basically grafted
into the stateful API
but using the same algorithms
that Keras was using
to determine
learning rate.
And in particular,
the way Keras
computes learning rate
is you are
at some point
in your computation
and you are,
you know,
you basically are
at some epoch,
you know,
epoch 10,
epoch 20,
whatever,
and you have a formula
which says,
given this epoch,
what should my learning rate be?
So this is a closed form formula.
It, you know,
takes in the epoch,
produces the learning rate
and that’s what you set everything to.
So, okay,
so we’ve got the stateful API
but what the stateful API
is doing under the hood
is it’s just going ahead
and running this closed form compute
to figure out
what the next step should be.
So actually,
there is no,
there’s no statefulness
beyond the fact that,
you know,
you just call step
and this internal state mutates.
Actually,
this is why
the step function
for the longest time
accepted
an epoch parameter
and you could use this
to sort of,
you know,
time travel
your learning rate schedulers
into the future,
right?
Epoch 1,
Epoch 2,
Epoch 100,
whatever,
you know,
that’s fine.
It’s going to work.
Why does it work?
It’s because
there’s a closed form formula,
right?
And we can just zoom
straight to that spot
and,
you know,
that seemed reasonable-ish.
The problem with giving people
stateful APIs
is they start looking
at what the stateful APIs do
and they start expecting them
to actually be stateful.
So pretty early
in PyTorch’s life,
we got a feature request.
And the feature request
was,
I’d like to have
so-called chainable
or,
as I like to say,
composable
learning rate schedulers.
So the ask here
was,
you know,
sometimes people
want to combine
various learning rate strategies,
right?
They might have
a learning rate strategy
where they are doing,
you know,
they’re doing
some sort of decay
as the training run
goes on and on and on,
but they want
some special behavior
at the very beginning.
And so they’ll have
an extra learning rate scheduler
just for handling
that sort of situation.
And it’s not really obvious
how to mash together
two learning rate schedulers.
Certainly,
if they’re using
the closed form solutions,
they’re just not
compositional at all
because let’s say
that you have
one learning rate scheduler
and you call it
and it figures out,
oh, hey,
the learning rate
should be five
at this epoch
and it sets
all the learning rates
to five.
And then the next
learning rate scheduler,
you know,
says,
okay,
well,
this is the current epoch
and the learning schedule
should be eight.
And then it sets
eight to everyone.
And actually,
you know,
like they basically
don’t communicate
with each other at all,
right?
The closed form solutions
are actually not
compositional in this way.
So people were like,
hey,
you know,
it would be cool
if,
you know,
actually if I,
you know,
did a learning rate
schedule step
and then another
learning rate schedule step,
it actually did
what the API suggested.
That is to say,
you know,
we’ve got the stateful API,
a step should,
you know,
transform the learning rate
to the next learning rate.
And like,
you know,
that’s what I expected to do.
And through the efforts
of Chandler Zoho
and then later
Vincent Quineville Blair,
we actually did exactly that.
We took all of the
closed form formulas
that were previously
in the learning rate scheduler
and we essentially
figured out
how to turn them
into the single step functions
that would give you
the same result
as the closed form solution.
So now you could actually
compose these things
because you would say,
okay,
well,
first I apply the step
implied by the first
learning rate schedule
and then I apply the step
up implied by the second
learning rate schedule.
And now you actually
have compositional
learning rate schedulers.
Woohoo!
Well,
actually the first time
we tried to land this,
it broke
because remember,
you’ve got time traveling epochs
and if you’re going
to do time traveling epochs,
I don’t know how
you’re going to do
the stepping thing
because like,
how does that even work?
You don’t have
a closed form solution anymore
and so basically
you only have a choice
of either going ahead
and playing out
the epochs one by one
if you have a time
traveling epoch
or you can do
what we actually did
which is we’ve got
a closed form solution
in our back pocket.
it’s like an underscore
method closed form
and we just call that
and we’re like,
it’s not going to be
compositional
if you’re time traveling.
All right.
So,
the reason
why
I did this podcast episode
is because
essentially
Nelson
came to me
and was like,
hey,
what the heck
is going on
with
these learning rate schedulers?
It feels like
someone
had a dare
to make it
as stateful
as possible
and they
followed through
with the dare
and the answer
to that is
yes,
that is basically
what happened,
right?
we started off
with a stateful
API
wrapping over
a functional
closed form
computation
of learning rate schedules
but people
were like,
hey,
you know,
stateful API,
I’m expecting it
to be stateful
and so
we turned
the insights
into the stateful
version.
Was this the right
decision?
I have no idea.
I managed
to trick
several people
into,
you know,
making this possible
so
if it was
a bad decision
I suppose
it wasn’t
obviously
a bad decision
but
with the benefit
of hindsight
I’m not
really sure
I would have
gone about
doing it
the same way.
Probably
the distinction
that we
probably should
have made
is
there are
some
learning rate
schedules
that are
compensational
and some
that are not,
right?
So like
if I’m
going to
do
a
exponential
learning rate
and then
I want
to compose
this
with something
that sort
of fiddles
around
with the
initial
conditions
of my
learning rate
what I’m
probably
expecting to
happen
is I
start off
with my
exponential
rate
exactly as
is
and then
I’m just
going to
do a
transformation
on that
learning rate
afterwards
on a
thing
and so
probably
people weren’t
expecting to
like arbitrarily
compose
an exponential
learning rate
with a
step
LR
learning rate
all sorts
of random
compositions
probably
that’s not
actually what
people want
to do
they probably
only want
a set
of
compositional
ones
but then
the basic
learning rate
schedules
those probably
just want
to be closed
for
maybe
I have no
idea
one of the
things about
learning rate
schedules
in PyTorch
is as I
said it is
very simple
the API
is not so
simple
sorry
we’re kind
of stuck
with the
stateful
API
but it’s
very easy
to write
your own
learning rate
scheduler
and so
you know
with a lot
of things
in PyTorch’s
you know
core library
sometimes
they’re just
not very
well put
together
and
it’s been
okay
it’s because
people can
just write
their own
you know
schedulers
and do
their own
thing
and that’s
always been
you know
one of the
things about
PyTorch
it’s that
you know
hey
if there’s
some piece
you don’t
like
well this
is just
a library
you don’t
have to
use us
you can
you know
write your
own thing
and really
all the
learning rate
scheduler is
doing
is it’s
going into
the optimizers
and just
updating
their internal
learning rates
so you
can absolutely
as I said
it’s just
one file
you can
go ahead
and do
your own
thing
and people
have
gone ahead
and done
their own
thing
I know
for the
very least
like
ClassyVision
had their own
you know
implementation of
learning rate
learning rate
schedules
learning rate
schedulers
I mean
I talked
about what
learning rates
are
and
what
really
is going
on here
I think
is just
a question
about
PyTorch’s
API design
right
one of the
things that
made
PyTorch
really really
successful
was that
we let
people
work with
NN modules
in a
imperative
mutable
way
it’s just
very very
natural
for people
if you
look in
the Jax
world
people are
trying to
discover
how to
make
neural networks
work
with a
functional
API
where you
don’t have
stateful
operations
it’s
you know
pretty
interesting
I think
they’ve
come up
with some
pretty good
stuff
but it’s
also
non-obvious
what exactly
that API
should be
because it’s
just like
just less
natural
for people
and so
as a result
there’s lots
of libraries
exploring
different
corners of
the design
space
I actually
think they
will probably
figure out
a really
good design
in the end
but it’s
going to
take you
know a
dozen
libraries
to get
there
or I
guess we
can be
in the
PyTorch
world
where it’s
like hey
mutation
everywhere
hooray
and you
know it
also is
very very
complicated
it’s
probably
more
complicated
than the
functional
API
but I
mean people
seem to
like it
so who
am I
to quibble
with them
this is me
right a
formerly
rabid
purely
functional
programmer
I used
to work
on the
GHC
Haskell
compiler
and now
I’m like
hey you
know mutation
is great
I just use
all of my
functional
programming
tricks to
help reason
about what
the code
is supposed
to do
in the end
all right
that’s everything
I wanted to
talk about
today
talk to
you nextLearning-rate-schedulers
EP65 History-of-functorch
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I’m joined by Richard Zhou, who is going to come talk to us about the history of Funktorch.
Hi, Richard.
Hi, Ed. Thanks for having me.
All right, Richard. So before we get started, let’s just briefly talk about what is Funktorch and what does it let you do?
Cool. Yeah, so Funktorch was inspired by Google’s JAX framework, which was released in either like 2018 or 2019.
I don’t remember at this point.
The novel thing that JAX brought to the community was the notion of composable function transforms, and that’s what Funktorch provides as well.
So let’s unpack those three words a bit. There’s a lot of meaning behind them.
So a function transform is an API that takes in a function and returns you a new function that does something else.
It transforms your function to do something else.
JAX offers a grad transform.
You pass it a function and it returns you a new function that computes gradients.
It offers a JIT function.
You pass a function.
It returns you a new function that makes your code run fast.
And they also provide this new thing called VMAP.
And you pass VMAP a function and it will return you a new function that accepts tensors or arrays with an additional input.
So VMAP is like doing a for loop of your function over your data.
And instead of actually doing a for loop, it’s making your code faster.
So I remember when JAX came out, you know, way back in 2018.
And it was like this cool kid on the block.
And, you know, there’s a lot of, you know, buzz on Twitter about it.
And, you know, some of the things that they were doing seemed really legitimately like useful.
Like, for example, VMAP was just a total, like, game changer in, like, how, you know, you should go about doing batching in your computations.
How did that turn into, oh, I guess we should, you know, build our own thing, Functorch, that was inspired by it?
Right. A lot of us on, like, the PyTorch team, we tried out JAX and we thought it was really cool.
JAX has its, like, core library of function transforms that can compose with each other to provide, like, various different other quantities.
And so people were actually asking us for similar features in PyTorch.
They’ve been wanting, like, efficient Jacobian computation.
They’ve been wanting fast per sample gradient computation, like, all of which is easy to do with JAX.
And JAX showed that we could get these through, like, the compositions of their transforms.
One of the motivations for Functorch was instead of, like, us designing a different subsystem to compute, like, per sample gradients, you could instead compose two things, like, vmap and grad to produce per sample gradients.
Instead of designing a separate subsystem to compute Jacobians, you can compose, like, vmap and grad in some order to also just give you Jacobians out of the box.
Furthermore, we thought this would be more future-proof.
We could see people wanting, like, other, like, crazy quantities in the future.
And if we provided these function transforms in PyTorch, then the rationale was people could just compose them to do whatever they needed in the future.
So now you’re making it sound like we just, you know, started off and thought, hey, we’re going to, like, you know, we want to build these new features and we’re going to do them with function transforms.
But that’s not actually how it worked out, right?
I’m jumping around a bit.
Because didn’t we do a version of per sample gradients entirely directly in PyTorch itself initially?
Yeah.
So initially, we actually did try to prototype vmap in PyTorch using the regular PyTorch dispatcher.
And what happened there was it could compute some quantities, but not, like, all quantities.
So you could compose, like, the vmap we had prototyped in PyTorch with PyTorch autograd to give you Jacobians.
But we had this problem that it couldn’t actually compute per sample gradients.
Oh, yes.
I remembered wrong.
It wasn’t that we implemented per sample gradients, but that we did vmap in PyTorch, but it had a fixed ordering.
So you could, yeah, you could, what is it, vmap over grad, but not grad over vmap?
You can grad over vmap.
You can grad over vmap, but you couldn’t have vmap over grad.
This is one of the things people find, you know, it’s one of the, like, mind-bending things about function composition is that you can do them in whatever order you want in Jaxx.
And PyTorch’s traditional API has always been about, you know, well, there’s a fixed order.
And it just turns out that’s not enough sometimes.
Yep.
So how did we actually decide that, no, we actually need the full generality of function transforms?
Yeah, so, um, we, so, okay, so in addition to, like, figuring out if we needed a full generality of function transforms, like, we also took a deeper look at Jaxx.
And we’re like, hey, um, Jaxx is, like, very functional.
And we weren’t, so, like, one of the things we could do is just, like, copy paste Jaxx into PyTorch.
And that’s not, like, a good idea at first glance, because, like, Jaxx has all these limitations on its code.
Like, Jaxx is purely functional.
Uh, there’s no mutations, but PyTorch, in PyTorch, people actually use a lot of mutations and, like, views in their programs.
And that actually matters.
So at the beginning of around 2021, Horace Ho and I, uh, we started to take a step back, um, to take a more holistic view at, um, this idea of bringing function transforms, uh, to PyTorch.
And so, like, we wanted to answer questions, like, do researchers, like, actually want to use transforms in PyTorch?
Um, did this composability matter to them?
And, uh, were users willing to give away some flexibility on what they could do with PyTorch, like, to limit themselves to this world where composable function transforms actually works well?
And so Horace and I, we ended up talking to around, um, 10 to 20 researchers, we did some user studies, and we, we basically just code emailed some of them, uh, we, we messaged some on Slack asking if they wanted to talk, to talk.
And also asked publicly for interest.
And we got, from, from those user conversations, like, we got yes.
The answer was yes.
Like, people did want to use transforms.
Like, they wanted to use VMAP.
Um, they wanted, like, their code to go faster.
Who doesn’t want their code to go faster?
Like, you ask, you ask people on this, and they can’t say no.
And then, uh, like, users were, we saw that users were indeed actually, like, as we thought before, like, looking at, like, wanting different compositions of, like, VMAP and grad.
So people, like, they wanted to compute per-timeable gradients in PyTorch, like, fast Jacobians.
We talked to some meta-learning researchers, and that’s, like, a, a bit of a crazier composition of VMAP and grad.
Like, you can do that to easily be able to express some, uh, models and model-agnostic meta-learning.
And we also saw, like, some scientific computing use cases where people just had really overhead bound code, and they just wanted their code to go faster.
That’s a lot of different users.
Um, I suppose that, like, if we were just reading the tweets about Jax, um, you could have figured all those out.
But I do remember, like, the fact that you guys did these user studies was really useful.
Do you, do you remember, um, you know, what were some of the outcomes from the user studies?
Yeah.
So we, we didn’t just talk to users while doing user studies.
So, before all of this, like, um, Zach DeVito, who you guys might know, um, he sent us a slide, of course, a nice slide deck on design thinking from the Stanford D School on, like, just how to design things for users.
And they were, like, the TLDR is that there are, like, four steps, um, to designing something.
You want to do some brainstorming, uh, some prototyping, some talking to users.
Uh, you give the user some bad prototypes, and then you iterate on these prototypes until they, um, become, like, less bad prototypes.
And you don’t have to do this, like, in, like, that order.
Like, they can be done in any order.
After the, like, we started talking to users, we actually started giving people some of these prototypes.
And we, like, worked with these users on these prototypes and just saw which ones they liked better, which ones they didn’t really like, and kept on iterating until we got something better.
If I recall correctly, you ended up with two prototypes in the end that you were considering what to do with.
Is that right?
Yes.
So, we had, like, two prototypes in the end.
One of them was based off of the PyTorch dispatcher.
And so, like, PyTorch Autograd is written in C++.
Uh, we had, like, a old prototype of vMap in C++ as well.
And there was a layering problem, as we discussed before, where you could layer, um, you could do grad of vMap, but you couldn’t do vMap of grad.
And to solve that layering problem, we just introduced a new mechanism inside the PyTorch dispatcher that lets you just flip the layers.
Um, we call that the dynamic layer stack.
What was missing from this prototype was a way to do compilation.
Um, and that’s prototype number two.
In prototype number two, um, we, given a user function, we just captured it using this mechanism called Torch shot FX that produces a graph.
And then, using that graph, we could lower it to a backend, like, NNC.
Now, the problem is, uh, we couldn’t actually do things like vMap over the function or, like, call grad over the function because, um, our vMap and autograd were written in Python.
Um, this Torch shot FX tracing only worked in Python.
And so, and Jack’s, like, all of these things really work together.
So, like, our vMap and grad were written in C++, Torch shot FX was written in Python, um, and Jack’s, like, you have all these things written in Python and, like, you just, you do the vMap and grad transformations and then you can trace underneath all of that.
But we found a solution to this problem, right?
Yeah, we ended up finding a solution to this problem.
It was, like, very inspired from things you, you, you told us, actually.
Like, using the Jack’s, uh, design as an inspiration, we were like, okay, why don’t we just, like, somehow get Torch shot FX into the dispatcher and, like, just trace out everything.
And so, like, that’s what we ended up doing.
Uh, Horace, like, ended up prototyping a basic version of this, um, that he called Python key.
And then, Ed, you implemented a more general version of this called Torch dispatch.
And so, what that let us do was actually trace in, like, what the dispatcher was doing using Torch shot FX.
So, we ended up smashing all of these prototypes together.
And, um, we had, um, vMap implemented in C++, Autograd implemented in C++, this mechanism that let us, like, toggle, like, switch between those two, uh, which one comes first.
And then we had, um, Torch dispatch plus Torch FX, which actually let us trace out what we were doing and compile it.
And during this time you were giving these to users, was this, was this the winner, um, for, for the users you were working with?
Yeah.
So, the users actually, they liked both of the prototypes initially.
So, they liked both of the prototypes initially.
Um, in the Python-only Torch Trap X prototype, we actually re-implemented Autograd and vMap a little bit.
Um, so it wasn’t more that the users liked this prototype.
It was more that the users liked both of the two existing prototypes.
And we wanted some way, like, some technical way to not repeat ourselves.
Okay.
So, at this point in time, if I recall correctly, um, you guys had been greenlit to, like, work on Funktorch properly and make it into a real thing.
Um, what, what happened at that point?
Yeah, so we took this final prototype, uh, we created the Funktorch repository, uh, on GitHub, and that’s where we replaced the prototype.
And we were interested in getting even more users to actually install the prototype and use it.
Previously, the prototype was just on a branch of PyTorch, but by putting it into the repository and separating out from PyTorch, we were able to release it, like, do separate releases of it, um, and actually get it out to more users than we could before.
So, there were, um, two general work streams, um, that we were going for.
So, like, one was, uh, compilation.
And with the compilation, what we had was an API equivalent to Jax.jit.
We had this thing called NNCJit, which basically trace your code, lowered it down to NNC.
But Horace was curious to see if, um, we could actually use that with regular PyTorch, existing PyTorch models without changing the existing PyTorch models too much.
And that’s how AOT Autograd was born.
Um, AOT Autograd is this mechanism that lets you trace both the forward and backward paths of a function or a model, and then lower that to a compiler.
On the transform side, just this vmap grad, et cetera, uh, side, we did some significant hardening work.
Um, you’ll notice that I’ve been talking about vmap and grad, but there’s, like, a, another transform called JVP for forward mode AD.
And, um, we didn’t have it at that point, Jax did, and we were like, some folks were asking us for it.
We ended up leveraging a bunch of existing work streams in PyTorch to really harden the function transforms.
On the forward mode AD side, um, Alvin and Jeffrey from the PyTorch core team were already working on a forward mode AD implementation in PyTorch, and we ended up, like, reusing that for
Funktorch’s JVP transform. Um, we also want to do some significant testing, like, of our operator coverage.
And, uh, one thing that was just beginning to brew, um, in PyTorch, uh, back then, was, um, this thing called OpInfos.
Uh, so Mike, Ruberi, and Natalia, um, they were, they made this database of operators with sample inputs.
And you could test things using, by just querying the database, you could test an operator by querying a database for an operator, some sample inputs, and then just, like, running the operator with the sample inputs.
And so we leveraged that to actually do, like, full-on, like, operator testing for VMAP, um, and for, like, all the other Funktorch transforms.
And so we basically went along these two routes, and we kept on hardening, uh, Funktorch until, like, our beta release in March of 2022.
So it’s been, uh, more than half a year, uh, I guess half a year? More than half a year since then.
Yeah. What have you been up to since then?
Yeah, so the beta release was fairly recent.
There was, there’s, like, two different stories of what’s happening to, um, the transform workstream and to the compilation workstream.
But, like, in general, like, our design philosophy has been to, um, try to, in, like, the past half year is, has been to try to make sure that Funktorch fits well into PyTorch.
Um, so, in particular, we’ve been trying to move Funktorch, um, into PyTorch and just upstream everything.
In fact, we, we’ve done it, right? Like, Funktorch now lives in the PyTorch repository.
In fact, we have upstreamed everything, but there is still some API work left to be done.
On the eager transform side, um, we’ve just been trying to make sure that, like, the Funktorch transforms, um, compose well with existing PyTorch constructs, uh, and vice versa.
So that’s been number one. And number two is, um, a lot of the Funktorch APIs sort of exist in PyTorch,
but just as, like, not, they, they don’t work as well.
So, like, there’s this older Torch vmap in PyTorch, uh, Funktorch’s vmap supersedes that.
Um, then, PyTorch has an NN functional module API. Funktorch also has that.
And so we’re working on trying to, like, pick one to just, like, go with in the future and deprecate the other one.
Um, PyTorch already has ways to compute, like, Jacobus and Hessians in the Torch autograd functional library.
However, you cannot actually vmap over those. Or you can vmap in some cases, but it doesn’t
work in all cases. And users have tried this. And so, like, on the transform side, we have been
hardening, uh, Funktorch, um, in order to be able to wholesale replace parts of, uh, existing PyTorch APIs.
On the compilation side, um, this thing called Torch Dynamo came along. So, Torch Dynamo is this, uh,
Python bytecode tracing JIT. What it really means for users is that you can rely on Torch Dynamo to
capture PyTorch code and not have to, like, constrain yourself to limitations of tracing.
Like, one thing we did hear from users was that sometimes it was annoying to just write code that
could be traced by Jaxx.JIT. Um, Torch Dynamo sort of lets you completely avoid this. You can write
whatever you want and it will capture the pieces that are actually, um, traceable. And so, like,
this gives you a better UX. Um, there’s no need to worry about the constraints of tracing.
So, we felt like Torch Dynamo was a better user story, um, for the PyTorch compilation story.
Um, however, AOT Autograd is still alive somewhere. Torch Dynamo works at the Python
level. It gives you a Python program in order to do things like, sorry, it gives you a trace of a Python
program in order to do things like, uh, compile through model training where you need to compute
gradients. Then you need something to read into, like, what the C++ Autograd code is doing. And that’s
where AOT Autograd comes in. In fact, I wouldn’t say it’s just alive. I’d say it’s, like, an integral
part of using Dynamo to optimize training code. It just would not work without AOT Autograd at all.
Right. All right. So, what’s coming next for us from Functorch?
Yeah. So, we started Functorch, um, Horst and I started Functorch because we were in all of what you
could do with Jaxx and there’s still a long way to go. I’ll just give you two items. I don’t want to
promise too much. Um, so, the first one is compilation and performance. So, we haven’t
really kept an eye on numbers. Like, from working with our users, we found things like, um, like,
your code computation is something like 5x faster than what it used to be for some, what it could
have been for some use cases. But we, although we know that Functorch is faster at actually computing
these quantities than, like, naive ways to do them, uh, we don’t actually know what the baseline is.
So, definitely want to focus on just finding out if we can, like, get additional performance out and
we want to make sure that Functorch actually works well with the compilation pipeline.
So, that’s number one. Uh, number two is we’d like to improve the set of PyTorch-like functions or
functions written in PyTorch that can be transformed over using Functorch. And in particular, we don’t support
some existing PyTorch constructs yet. Users have really asked us for, uh, uh, autograd.function support.
And so, that’s, like, one of them. Um, some other things that users have asked us for are things like
data-dependent control flow, uh, where you do things like write an if statement that is conditional on, like,
the data of a tensor. So, if x squared and zero, do something else, do something else. Or, like, while loops,
that we’re at a while loop condition is conditional on a tensor. And so, people want to actually write
data-dependent control flow and do things like vmap over them. And that’s something you can do in JAX.
You cannot do that in Functorch yet.
All right. Well, thanks a lot for coming to tell us about the history of Functorch, Richard.
Cool. Thanks for having me, Ed. Goodbye.History-of-functorch
EP66 PyTorch-2.0
Hello, everyone, and welcome to the PyTorch Dev Podcast.
So you may have seen in the news that we have announced the release of PyTorch 2.0.
If you haven’t seen it already, Sumith has a keynote talk from the PyTorch Dev Conference,
which you can go check out to see a, you know, sort of very quick overview of all the concepts behind PyTorch 2.
Today’s podcast is going to be the beginning of a series of podcasts diving deep into all aspects of PyTorch 2.0.
Today’s podcast, I just want to talk a little bit about the high-level constraints behind PyTorch 2,
sort of just do a little bit of an elaboration over Sumith’s talk, you know,
go into a little bit more of the details about, you know, what we were thinking about
and, you know, what you should expect when you start digging into the components of PyTorch 2.
Accompanying the release of this podcast are two docs that we wrote about half a year ago,
sort of setting the goaling for PyTorch 2.
It’s the PyTorch 2 manifesto and the PyTorch 2 architecture documents.
I went through them and didn’t have to edit them very much.
So we did a pretty good job of setting up what we wanted to do half a year ago.
And if you are more of a fan of the written text, you can go check those out.
And they’ll also talk about the things we’re going to talk about here.
Okay, so PyTorch 2, what is it?
Well, you know, if we think about the user experience, what we’ve got is we’ve got a new function called torch.compile.
And when you put it on your models, things go faster.
So that’s basically like at a very, very high level, what to expect.
But of course, this is the PyTorch dev podcast.
So we want to look a little deeper.
So the question we have here is what exactly is torch.compile doing when you actually do this?
What the heck is going on with all the components?
Why is this different from the various different compilation methods like torch script and fx that we’ve done before in PyTorch 2?
All right, well, let’s try to unpack this.
So at the top level, when you look at PyTorch 2, there are a few important components.
So first, there’s a graph acquisition mechanism.
That’s torch dynamo, where you essentially have a symbolic evaluator for Python bytecode.
It goes ahead and looks at your Python code.
It tries to understand as much as it can.
Whatever it can understand, it, you know, sort of steps through it, bytecode, bytecode, bytecode,
and gives you a graph representing the tensor operations that happen during that segment.
If it doesn’t understand something, then it says, oh, well, whatever, and goes ahead and uses the Python interpreter,
the regular Python interpreter, as a backup mechanism.
So you have Dynamo.
When you have Dynamo, you’ve got a bunch of these graphs.
And what you need to do is you need to actually, you know, incorporate these graphs into a, you know,
Python program that might have a bunch of regular Eager kernels in it.
Because we said that this is not a full graph capture mechanism.
It’s a partial graph capture mechanism.
And so to do that, well, we need some sort of mechanism with integrating with the traditional Eager Mode automatic differentiation system.
And that mechanism is called AOT Autograd.
It takes a graph and turns it into a custom Autograd function that knows both how to run forwards and backwards.
And of course, these forwards and backwards are also represented as graphs.
And then what we do is we go ahead and send those on to a compiler.
And the compiler that we’ve been, you know, advertising the most with the most recent release is Torch Inductor,
which is what we call a define by run compiler built on top of Trident, which, you know, just actually knows how to go ahead and compile a bunch of code.
So three big components, right?
So you’ve got the graph acquisition, then you’ve got graph lowering, and then you’ve got actual backend compiler.
And, you know, if you aren’t paying too close attention, this might sound like, you know, the regular story that you’ve heard over and over again about all sorts of things, you know, when you want to compile deep learning models.
So what makes PyTorch 2 different?
Like, why did we not do this, you know, five years ago when we embarked on building TorchScript?
How come we couldn’t use TorchScript to do these things?
You know, what is peculiar about the system that we’ve set up here?
So there are a bunch of things that I want to call your attention to.
But the first and foremost one is that PyTorch 2.0 is a partial graph mechanism.
Now, I’ve already mentioned the word partial graphs.
And to just unpack the definition of partial graphs a moment, what I mean by partial graphs is that when I’m running my compiler, I don’t expect to actually necessarily be able to compile my entire program.
Now, if I can compile my entire program, that’s great.
I’m not going to, like, purposely stop myself from compiling the entire program.
But it’s a non-goal to get it all the time.
And this is, you know, in deep contrast with lots of other sort of mechanisms, like, you know, if you think about TensorFlow or you think about TorchScript, these are all, you know, predicated on sort of whole graph acquisition mechanisms where you want to get the entirety of your program into some format.
And indeed, there are some good reasons to want to get the entire graph.
For example, if you want to ship a model to mobile, something that PyTorch does support but, you know, is not as first class a citizen as, you know, if you were, for example, programming on top of TF Lite, to ship a model to mobile, you would need to actually have the entire model, right?
You couldn’t actually, you know, have an interspersed mix of, you know, a bunch of operators that you’ve compiled from partial graphs and then a bunch of Python code.
That wouldn’t work.
Well, unless you were, you know, going to ship a Python interpreter to your mobile phone, which, you know, maybe is a good idea.
But, you know, let’s set that aside for a moment.
So, you know, there are a bunch of use cases where you just don’t want to have a Python interpreter and so you naturally gravitate in towards, you know, having a, you know, full graph export mechanism or, you know, you might try and say, okay, well, I want my entire programming language to be differentiable and I’m going to build my deep learning compiler on top of an entire programming language that I can understand.
But hey, we’re PyTorch, we’re built on top of Python, we have a lot of users using Python, they don’t necessarily need to export their graphs to a runtime that doesn’t run Python at all.
And in return, what we get for saying, okay, well, sometimes we just don’t understand your Python code, and we’re going to fall back to the Python interpreter.
What we get in return for making this assumption is we don’t have to do the sort of mind-crushing coverage problem that is, well, now you need to understand the entirety of the Python ecosystem.
Whenever there is something in your program that we don’t understand, whether or not it’s a Python language feature, a call to an external library, or even an operator that, you know, is kind of very weird and unconventional.
And then you can see that in the same way that we don’t understand the size of the input. And so we can just say, well, okay, fine, we’re going to stop compiling here.
And then we’re going to go ahead, and we’re going to go and, you know, go back to the Python interpreter. And sure, you just got a partial graph, but that’s fine. As long as your partial graphs are big enough, you know, you’re going to get most of the benefits from compilation.
Why is that? Well, you know, to think about this, we have to think about, you know, why was PyTorch eager mode viable at all in the first place?
And the reason why PyTorch eager mode was viable in the first place, because, you know, naively, you might expect that, hey, you know, you’re writing Python all the time, you know, isn’t that going to be really slow?
Aren’t you going to have a lot of framework overhead? The answer is yes, there is a lot of framework overhead in PyTorch.
And in fact, PyTorch is not a very good match today, well, prior to PyTorch 2, for handling overhead-bound models.
But what it turned out was that, you know, with lots of operations that people wanted to do, you do a single, say, matrix multiply call, and that actually needs to do a lot of flops.
And so actually, the operation takes a lot of time on the GPU.
And as long as you can keep the GPU busy, right, you don’t have to outrun the bear, you just have to outrun, you know, the next lowest person, in this case, the, you know, actual GPU processing.
So as long as you can run your Python code faster than the GPU can actually crunch the numbers, then you’re fine.
It doesn’t actually matter how long or how much overhead your framework has, because you can just go ahead and hide it, because you’re waiting on the GPU anyway.
And so this was true for PyTorch for a very long time.
And it turns out that GPUs get faster and faster over time.
And this is one of the reasons why, you know, we knew strategically it was really important.
It was an existential problem for PyTorch.
If we didn’t get our act together and figure out a way of running, you know, bigger chunks of code so that we weren’t overhead bound, whenever people upgraded to V100s and then to A100s and then to H100s, the GPUs get faster and faster.
And then suddenly, you know, you’re at this point where previously you could cover it up, waiting for the GPU to come back.
But now the GPU is so fast, you can’t cover up the framework latency at all.
So, you know, we’re saying, hey, okay, GPUs are getting faster and faster.
And so dispatching kernels one by one, as you wrote in eager mode, is just not cutting it anymore.
But if we can take a bunch of kernels, and it doesn’t have to be the entire program, right?
It just has to be enough kernels so that we can bundle them all up and run them all at once.
And now, once again, the GPU compute is now taking a long time.
If we can do that, then you’re fine.
And you don’t, you’re once again back in the regime where you’re, you know, bound by the GPU.
And you’re, you know, you’re happy because, you know, you didn’t require a whole graph export mechanism.
So, you know, we can fall back to Python, whenever there’s something that doesn’t work very well.
And you don’t rely on a whole graph mechanism, because, hey, you can fall back to Python whenever you need to.
But at the same time, you’re getting large enough partial graphs, so that you can cover up the overhead of actually dispatching to the GPU.
And that’s perfect, because we’re actually hitting this new sweet spot where we’ve pushed the Pareto frontier.
Previously, you had a, you know, make a trade off between, oh, you know, nice user friendly Python native experience, versus, you know, not so user friendly, but compiler experience.
And so now we have a new point in the trade off space, where we can still, you know, get the nice ear mode UX that everyone knows and loves about PyTorch.
But at the same time, we’re actually compiling things.
Now, of course, we do have to give up some stuff to get here.
And, you know, one of the big things we have to give up here is the stack is kind of complicated.
And, you know, Dynamo, right, is a symbolic Python bytecode interpreter.
What does that mean?
It means that, you know, when you run Python programs, your Python interpreter turns your Python source code into a bunch of bytecodes.
And then there’s an interpreter that goes over the bytecodes one by one and actually executes them.
So we needed to reimplement this interpreter so that we could, you know, go ahead and look for tensor operations and handle them specially, right?
That’s basically the entirety of which Torch Dynamo does.
And we had to do it.
And, you know, that’s a new piece of code, which is sort of complicated and can have bugs in it.
And, oh, yes, we do have bugs in Torch Dynamo.
And then, of course, we need, you know, the rest of the stack, such as AOT Autograd for actually performing differentiation and then, you know, Inductor for actually compiling code.
So there’s a lot more stuff going on in PyTorch right now.
And so you might also have the question, which is, is it worth it, right?
Like when you write traditional e-remote programs, you know, it’s sort of very simple.
You know, you call a function, you execute the code in the function, and then you’re done.
And that’s it.
Nothing else to do.
Whereas in this new stack, you know, there’s all of these different moving parts, you know, like how can you even tell what’s going on?
And so this leads us to a second thing, which I think is really, really important for PyTorch 2, which is that all of the important code in PyTorch 2 all lives in Python.
So what do I mean by that?
Well, Torch Dynamo is a, you know, symbolic bytecode interpreter.
Traditionally in CPython, you would, of course, want a bytecode interpreter to live in C because, hey, it’s kind of important, right?
It needs to run fast.
Well, we have plenty of caching, right?
Once we have processed a given frame and, you know, evaluated all this bytecode, we don’t need to do this evaluation again.
We’re just going to, you know, jump straight to the actual, you know, graph that we extracted and compiled before.
So we can actually run Torch Dynamo in Python and it is implemented entirely in Python.
You can set PDB breaks in it.
You can, you know, single step through it.
It’s actually a really nice way of understanding, you know, what is going on.
And it’s fine.
Like, I actually was worried a lot about the performance overhead of, you know, running Torch Dynamo in Python.
But it turns out it doesn’t matter.
Like, there’s plenty of other parts of the system that are slow.
And similarly, Torch Inductor is a back-end compiler.
And, you know, traditionally, back-end compilers are written in C++ or some sort of similar compile language.
When we wrote the first version of TorchScript, we actually, we wrote it in C++ specifically because we wanted static types.
Knock on wood.
But Torch Inductor is written entirely in Python as well.
So, you know, once again, if you are so inclined, you can go and check out all the different pieces of it.
Now, it does back-end to Trident, which is written in C++.
But there’s a sort of very clear abstraction boundary.
There’s, you know, a Trident front-end language that’s written in Python that we generate.
And so, you know, sure, Trident can have bugs.
And, you know, Trident also has bugs.
But you don’t have to use it in this situation.
Because, but the parts that are actually generating the Trident code, the parts in Torch Inductor,
those are entirely in Python.
And now, I lied a little because AOT Autograd isn’t entirely written in Python.
It’s got a lot of stuff in C++.
But the stuff that AOT Autograd runs in C++ is sort of just pre-existing components of PyTorch.
And this is another important constraint when we were thinking about what to do with PyTorch 2,
which is that, you know, we had this shiny new bytecode interpreter in Dynamo.
And if we wanted some sort of automatic differentiation system,
one way you could go about doing it is just by re-implementing our AD system in Python,
so you could Dynamo trace through it.
But we decided not to do that.
Now, whether or not this was the right call or not,
it certainly saved us a bunch of time in terms of implementation.
Our choice was to instead reuse the pre-existing C++ Autograd engine.
You know, and as a benefit from that, we get all of the edge case handling,
all of the sort of battle-tested work that we’ve put into the engine over the years.
All of that transfers over to PyTorch 2.
So you don’t have to worry about Autograd working differently when you run into PyTorch 2.
All we’re doing is we’re just going ahead and tracing the set of operations that the original Autograd engine would have done,
and then, you know, using that as the basis for a compiled program.
Now, one downside to that is we had to work pretty hard to get dynamic shapes to work in this situation.
So that’s why it’s not entirely clear to me if it was a win.
You know, we traded off, you know, having to do some fairly major surgery to the internals of PyTorch
to, like, make it support propagating dynamic shapes throughout.
But, you know, like, we have a system that, you know, just really is reusing most pre-existing components of PyTorch.
So, you know, in this sense, it really is additive.
We’re not like, you know, the truly new parts like Dynamo and Inductor
have no pre-existing analogs in PyTorch.
And the parts that do have overlap with PyTorch where we’re actually using the same code in these cases.
This is not entirely true.
So in some cases, you know, we have operator implementations,
and we opted to just go ahead and re-implement them in Python.
But that’s a very small part of the system.
And sort of the core subsystems are all shared in this case.
Okay, so what have we talked about?
So we’ve talked about, you know, what is PyTorch 2, right?
So PyTorch 2 is a way to make your programs go faster.
And the way it does that is by, you know, allowing us to compile fragments of PyTorch code,
but without the constraint that you have to compile the entirety of your program.
And what that means is that, you know, unlike TorchScript,
where you have to actually go and, you know, modify your programs so that they are TorchScriptable,
in PyTorch 2, you know, you can generally just slap a Torch.compile on any function,
and it will generally work.
Now, you might not get good performance.
If there’s too many graph breaks, then, you know, you might not see any benefit at all.
But, you know, it’ll always work.
Or, you know, if it doesn’t work, you should send us a bug report.
And, you know, if there’s anything weird, you know,
we will be able to handle it without having to do special workaround code.
And it turns out that this is good enough.
We get speedups, pretty good speedups, in fact, without having to capture the entirety of the model
and without having to give up the nice ear mode UX.
And the rest is execution details.
Coming up in the future, what we’re going to try to do is I’m going to try to walk through
all of the components in PyTorch 2.
You know, if you’re wondering how it works or, you know, you’re just trying to get involved in the process.
You know, there’s a lot of different pieces.
And, yeah, I’m looking forward to sharing a lot more about PyTorch 2 with you in the future.
That’s all for today.
Talk to you next time.PyTorch-2.0
EP67 torchdynamo
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about the very first part of the PyTorch 2 stack, namely Torch Dynamo.
This is the component that you interact with when you, for example, use Torch.compile.
That just means that you’re turning on Dynamo.
Dynamo is going to collect up graphs and then pass it on to a compiler.
So there’s a lot of things that go on, but the very first thing that we have to do is
actually get the graphs from your Eager program.
And this is where, you know, Dynamo does something a little different.
As we’ve mentioned in many different places, the idea behind Dynamo is we are going to go
ahead and take your Python program as is, do a analysis on the bytecodes in your Python
bytecode, and use this to figure out what the actual tensor operations on a given piece of
Python code are.
So what I want to do in this podcast is I want to go in a little more detail about what exactly
that means and what that also, what the implications of setting up a graph capture mechanism in this
way are.
Because there are some questions that you might have about whether or not Torch Dynamo will
work on a given program or not.
And those questions can often easily be answered by just knowing a little bit about how Dynamo
is supposed to work.
And in particular, knowing how Dynamo is supposed to work can also help answer a question, which
is, you know, does this not work just because there’s a bug in PyTorch or does it not work
because there’s some deep fundamental reason?
And so I just want to pull back the covers a little in this podcast to help you, you know,
make assessments like that about whether or not Dynamo is correct for a given situation
or not.
All right.
So let’s talk a little bit about the high-level UX behind Dynamo, and then we’ll dive a little
bit into the big design concepts here.
So the UX behind Dynamo, right, is it’s the Torch.compile UI.
So you have this method called Torch.compile.
You can decorate a function with it.
And what Torch.compile does is somehow makes your program go faster.
And the way it makes your program go faster is you have a Python program.
This Python program normally does some stuff, right?
It does some tensor computation.
It might also, you know, print some lines out.
It might also, you know, go ahead and modify some Python data structures.
And Dynamo’s job is to take this Python program, this stream of Python bytecode instructions,
and turn it into two pieces.
One is a graph of tensor operations.
This graph of tensor operations is what we’ll actually pass on to the compiler
and hopefully compile into some form that can run more quickly.
The other thing Dynamo does is it takes your Python program
and rewrites it into what I call a residual Python bytecode program,
which simply goes ahead and calls that graph, that graph of tensor operations that you saw before,
as well as does whatever extra Python operations that were necessary
because, you know, those were the Python operations that your program did.
So somewhat unusually, for example, if you had a function and it, for example, added a number
to some global variable in your program, normally you’d think, well, that’s kind of weird, right?
Like, that’s not something that I want to show up in my tensor program.
It’s just good old Python code.
Surely Dynamo can’t handle that.
Well, the answer is no.
In fact, Dynamo can.
Dynamo sees that there is this operation going on when it’s analyzing the bytecode
and it makes sure to replicate it when it extracts out your program.
So just because you go ahead and increment a counter in the middle of your program
doesn’t mean that we do what’s called a graph break.
That is, say, Dynamo throws up his hands and is like, well, I don’t know what’s going on,
so I’m just going to bail out and ask Python to do the thing.
Dynamo actually understands a lot of operations in Python code.
And this is important because in order to get good compilation results,
we need to be able to capture enough of our program.
And to capture enough of our program, well, we need to not be breaking on every little thing.
I also want to point out that there’s an important philosophical consideration behind this,
which is that we always have the Python interpreter available.
So although Torch Dynamo does a lot of work to understand
as many constructs in the Python language as possible,
it also has permission to not understand things.
If something is too complicated or, you know, too annoying for us to implement,
maybe you’re calling some, you know, giant third-party library
and it’s, you know, doing web requests or something like that.
If there’s something too strange, too unusual, Dynamo has the ability to say,
okay, fine, I am just going to run your code as is, as in Python.
And, you know, we’re not going to actually see the rest of your program.
So we hope to capture as much as we can, but we are not forced to catch everything.
And this was really important, you know, when we were working on Dynamo,
well, when Jason Ansell was, you know, developing the very early versions of Dynamo,
because in fact, there were a lot of features in the Python language you need to implement
to get a lot of benchmarks going.
But he didn’t need all of them implemented all at once at the beginning.
He could start off with, you know, just a subset of the features he needed,
and some models would work well, and some models would have lots of graph breaks.
And then as we improved Dynamo, there would be less and less graph breaks in your programs.
So, you know, that’s also kind of the situation you should expect here,
which is that, hey, you know, maybe you run your program through Dynamo,
and you get a single graph.
Hooray, nothing left to do.
But maybe you run your program through Dynamo, and you get a lot of graph breaks.
Well, don’t despair, right?
Maybe in the next version of PyTorch, or, you know, maybe even before the stable release,
there might be work done to actually understand the things that are tripping you up.
And then, you know, you can figure it out that way.
There’s actually a configuration flag in Dynamo that you can turn on to give warnings
whenever there are graph breaks.
And, you know, if you think you’ve got a reasonable model that, you know,
should work and is graph breaking, send us a bug report.
And, you know, we’ll look into it, because, you know,
we’re definitely interested in helping Dynamo understand more things.
Okay, so what do we say so far?
So we’ve got Dynamo, right?
It understands your Python program and converts it into a series of tensor operations
and a series of residual Python operations.
And I also want to talk a little bit about what kind of graph you get from Dynamo, okay?
So, you know, if you have tried playing around with a custom backend,
Torch.compile makes it really easy to play around with a custom backend,
because you can just pass in a function to be your compiler,
and you’ll just get an FX graph, which represents the computation in question.
So, you know, FX graphs, and you can see my podcast on FX graphs if you’re curious more about them.
An FX graph is just this very simple, you know, data structure representing Python programs.
So, you know, it’s got a list of nodes, you can iterate through the nodes,
and there’s various calls on the nodes to various Python functions.
And it’s very, very flexible.
It’s really just a container format.
It’s not a true IR, because all of the function calls inside an FX graph
are just actual callables that, you know, are the ones that you actually call in Python.
So what exactly does an FX graph that Dynamo gives you look like?
Now, if we weren’t, you know, doing Dynamo at all, right,
I could just have my program be represented as a single function call in my Dynamo graph,
which called into whatever the original user code was.
But Dynamo doesn’t do that, right?
Because one of the things that Dynamo is doing is it is understanding what exactly your Python code is doing
and making sure that it produces a graph that is always valid to use in some later iteration.
So if I just have some black box Python function that is the only thing in my IR that, you know, I can file.
Well, for one, how the heck am I going to compile that?
And the answer is, well, with tracing, but, you know, like having a single function is not all that useful.
But also it’s because Dynamo needs to keep track of, you know, whether or not I, whatever, you know,
this graph is valid in the future.
And to do that, it actually needs to look into things.
So what you’ll actually expect to get is you’ll actually expect to get a bunch of PyTorch operations.
If you had a bunch of calls to user functions, you should expect those to be inline into Dynamo.
So you’re not going to see a bunch of recursive function calls.
You’re just going to see a straight line program that has all the operations you need.
You’re not going to expect to see loops in the graph because, in fact, you know, FXIR does not natively support loops.
All your loops will be unrolled.
All your conditionals will be flattened.
You won’t see conditionals in your Dynamo graph.
You’ll basically have a straight line program of a bunch of Python calls.
Now, this is very nice and normalized, but it’s actually not that normalized.
So here are some things that you’re not going to get directly from Dynamo.
So one thing you’re not going to get is you’re not going to get a backwards graph.
To get the backwards graph, you need another component, AOT Autograd, which I’ve had a podcast about with Horace before.
But we’re going to do another podcast about, you know, the PyTorch 2 specific implications of AOT Autograd.
Suffice to say, you aren’t going to get the backwards.
So you want to use AOT Autograd to do that.
In fact, there’s an API change winding its way, which we’re probably going to change the default behavior of Torch.compile.
If you feed it a function, we’re not going to give you all the Torch functions anymore.
We’re going to give you it.
We’re going to call you once for forwards and call you again for backwards.
That probably is more likely what users want to see.
So, you know, stay tuned for the API change.
This doesn’t affect you if you’re using just the inductor backend.
But for all you compiler backend writers out there, this is probably a change.
And before this change actually lands, you probably do want to be using AOT Autograd because, you know, you actually do want Autograd support for your compiler.
You also get some other things.
So some other things you don’t get from this graph.
So it’s going to be calls to the Torch API.
It’s going to be the calls to the Python API.
It’s going to look very, very similar to the actual function calls that were in your original program.
Now, we actually can normalize this IR a bit, right?
So these Torch function calls have all of the weirdnesses of the Python API.
For example, you can call reshape on a tensor and you can pass to reshape either a tuple of sizes you want, or you can just, you know, get rid of the tuple and just pass them in one by one as positional arguments.
This doesn’t get normalized at all.
You’ll see exactly what the user saw in that question.
To get this normalization and to also just, you know, you know, tease apart some of these high-level operations, you might want to lower to Aten operations.
Once again, this is something, this is not something that Dynamo does built in.
This is something that AOT Autograd, now it’s a little, AOT Autograd is actually doing a lot of lifting.
It’s not just doing Autograd, but it’s also lowering things to Aten.
AOT Autograd is responsible for lowering from Torch Ops API into Aten API.
So, you know, you’re not going to get that by default.
You need to opt into AOT Autograd to get that.
One more thing that you’re not going to see in the IR is you’re not going to see, sorry, actually, what’s something you’ll see in the IR and maybe you don’t want to see is if the Python program had views or it had mutation, all of that is going to also be captured faithfully.
So, really, all Dynamo is doing is, you know, it’s inlining away and removing all the Python constructs from your program, but you’re really just getting like a forward-only, you know, very idiomatic PyTorch program.
And that’s sort of easy to understand, but it’s actually not so easy for compilers to deal with.
In fact, compilers have a lot of headache dealing with mutation and views.
Just ask, for example, XLA, where, you know, their HLO IR does not actually have a concept of mutation or of views.
So, in order to also get rid of those, once again, you can probably guess where I’m going with this, AOT Autograd is responsible for what’s called functionalizing away those operations so that, you know, you get a very nice, functional, clean IR that’s good for compilers.
So, you know, what is Dynamo doing, right?
All Dynamo is doing is it’s understanding the Python code.
It’s figuring out how to remove Python constructs.
So, you’re never going to see a Python class or a Python or even a Python named tuple inside of the Dynamo graphs.
All of that gets flattened away.
You know, you’re just getting a bunch of tensors and doing operations on those tensors and then returning a bunch more tensors.
But beyond that, beyond what Dynamo can understand at a superficial level by just looking at the Python code, looking at the Python byte code, you don’t get any normalization beyond that.
That’s all AOT Autograd’s job.
Okay, so with this understanding about, you know, what Dynamo actually does and doesn’t do, we can also, you know, think about, you know, what kinds of problems are likely to show up due to Dynamo itself as opposed to other parts of the stack.
So, for example, if you are, you know, seeing that, you know, you’ve got a graph and it doesn’t look quite right, like, you know, maybe there are some operations in it that, you know, don’t look quite correct.
And this is before you’ve gone ahead and sent it to AOT Autograd.
So this is like, for example, if you just, you know, pass in a simple backend compiler that prints the FX graph in question, well, that means that it is a problem in Dynamo.
And this is one of the reasons why Torch.compile has a backend.
It’s called Eager.
It’s a very pointless backend.
All it does is it takes the FX graph and then runs it directly as is.
But it’s really useful for figuring out if you have a Dynamo bug at all, right?
So you’ve got your program, you’re trying to run it, it’s doing something weird.
So you replace the backend with Eager and now, you know, we are not doing anything interesting except running Dynamo.
And if it still fails in that case, well, you know, you found a Dynamo bug.
Similarly, if you are, you know, running Dynamo and you’re like, well, this is kind of weird.
Some of my Python state doesn’t look quite right after running Dynamo.
Well, that’s also likely to be a Dynamo bug.
And once again, you can figure out if that’s the case by switching Torch.compile to Eager.
So Torch.compile Eager says use Dynamo, but don’t actually run any of the compiler.
Don’t even run AOT Autograd.
Because AOT Autograd is its own sort of complicated component in its own right.
It also has bugs.
And so sometimes, you know, you want to like run AOT Autograd and Dynamo, but not anything else.
That’s the backend called AOT Eager.
And so by, you know, sort of varying your backends, you can sort of use this to sort of figure out which part of the compiler stack is, you know, breaking.
And this is really useful.
I use this all the time when I’m working on PyTorch to figure things out.
Okay.
So we’ve talked a little bit about Dynamo, right?
What is Dynamo?
It, you know, processes Python bytecode to get you the tensor graph and a bunch of residual Python operations.
What do you get as an output?
You get a graph.
The graph has a bunch of tensor operations in it.
It doesn’t have any Python types in it.
It doesn’t have any Python control flow or loops, but it isn’t lowered.
And so if you want to do the lowering, you have to go to AOT Autograd.
And so this, you know, this description of Dynamo is a pretty good, I think it’s a pretty good, you know, like black box description of what Dynamo does.
And so you should be able to think about this and think to yourself, you know, is Dynamo useful for my situation or is it not?
So to wrap up this podcast, I just want to compare Dynamo to a few of the other graph capture mechanisms we’ve built in PyTorch.
And we can just use this sort of bird’s eye view to like, you know, talk about the pros and cons of different approaches.
So one very obvious comparison point that people want to make with Dynamo is with TorchScript, right?
So TorchScript is the original PyTorch just-in-time compiler.
You know, what does it look like?
Well, you know, you’ve also got a decorator.
You can decorate your functions.
But unlike Dynamo, you have to, you know, make sure all of your program is what’s quote-unquote called TorchScriptable.
And what do we mean by TorchScriptable?
Well, because TorchScript is a subset of Python that our compiler understands.
And so there are some Python features you’re allowed to use, some features that you’re not allowed to use.
And so depending on whether or not you use those features or not, you know, your program may be TorchScriptable or not.
So let’s do a little bit of a comparison here.
So what does Dynamo do?
So I said Dynamo understands your PyTorch program at the bytecode level.
So Dynamo processes the, you know, bytecode stream that your Python interpreter compiled you to.
TorchScript, on the other hand, processes Python ASTs.
So it actually takes your Python program, you know, produces an AST for it using, you know, for example, a standard Python AST parser and then attempts to map that into its own internal intermediate representation that can represent all the things that are in a normal Python program.
So this is where this is like a major philosophy difference, right?
When Dynamo gives you a graph, this graph is completely inline.
There are no loops.
There are no data structures.
In TorchScript, all of those constructs are preserved, right?
So if you have a loop that is TorchScriptable, then you will get that loop inside TorchScript.
And so that makes TorchScript really good for, well, okay, of debatable goodness.
But one of the things that TorchScript really got used for a lot early in its lifetime was for sort of export situations where, you know, you were doing a beam search and you wanted to loop over various elements.
And then you wanted to capture that loop as is and then ship it to some other environment.
TorchScript can do that for you because it understands loops.
It has an understanding of many different Python data types like mutable lists.
So, you know, if you stay in that subset, you know, it’s basically like a tiny scripting language that happens to be runnable in C++ without the gill.
And, you know, that is beneficial in a lot of situations.
The downside to doing it this way is that TorchScript programs are a lot more difficult to compile, right?
Because you’ve got these random Python lists running around.
You’ve got, you know, all sorts of weird data structures running around.
You basically, you know, can’t really compile a TorchScript program as is.
You have to sort of extract out the, you know, functional graph bits first and then you have you can actually compile those.
And like, you know, that’s a bit of a step.
And, you know, like oftentimes, you know, maybe there is a list data structure, but it’s always static.
And so if you had just unrolled it, then you would have gotten a nice, easy to compile sequence of tensor operations.
But no, you know, you couldn’t, you couldn’t do that, right?
Because TorchScript didn’t know that that was the case.
Compare that with Dynamo, right?
Dynamo is operating byte codes.
And, you know, all it’s doing is it’s inlining and, you know, getting rid of all that stuff.
So the graph you get is a lot easier to compile because it’s basically straight line code and, you know, like just in time compilers really like compiling straight line code because it, you know, it’s a lot easier to not have to deal with control flow.
And, you know, the, you know, the downside of that is, right, we it’s less likely that your code will be valid because what if the, you know, number of loop iterations changes?
What if some conditional changes and so Dynamo has a lot of machinery for making sure that, you know, it knows exactly what conditions have to be upheld in this situation.
And then you can actually, you know, you can specialize on all those things and, you know, breathe safe that, hey, you know, next time around, if, you know, a conditional had changed or if a loop counter had changed, I’m not going to attempt to reuse the stale graph.
By the way, that’s another one of the things that, you know, if you’re thinking about ways Dynamo can go wrong, the guard infrastructure, the infrastructure which tells us whether or not we can safely reuse a graph or not, that’s the other thing that can cause problems.
And I hope to talk a little bit about some of the debugging tools we have for diagnosing if that’s one of the situations or not.
Okay, so, you know, Dynamo, simple graphs, all inline, cool, TorchScript, complicated graphs, lots of support for Python features, you know, less easy to compile, but, you know, you can express more programs in it.
Another comparison people often want to ask us about is FX symbolic trace, right?
So FX was a new graph representation we wrote, we did it in Python, doing it in Python, by the way, was a really good idea.
And, you know, Dynamo is written in Python, and that makes it a lot easier to debug and deal with, right?
TorchScript is written entirely in C++.
It’s very difficult for, you know, an external person to, you know, get their hands on it and make changes.
It’s very easy to tweak Dynamo, you know, change things around and see what happens.
So FX, you know, introduced the Python IR format that we still use in Dynamo, but it also introduced this thing called FX symbolic tracing.
And what symbolic tracing is basically is it’s a Python level tracer using, you know, Python’s ability to do operator overloading to capture the things that are going on.
So, like, say you have a model, and you want to figure out what operations are in it, then you pass in, you call it with symbolic trace.
Symbolic trace, instead of passing in tensors, presses in these things called proxies, and then, you know, it looks and sees what operations get called on these proxies and records that to the FX graph.
So, once again, what’s the difference between this and Dynamo?
Well, you know, Dynamo is sort of morally doing the same thing, but it’s operating at a different level.
FX has to operate at the level of whatever Python’s operator overloading supports.
So, for example, if there is a conditional and, you know, someone is trying to figure out what the heck, you know, sorry, if there’s a conditional, you know, FX doesn’t actually have a opportunity to see what the conditional is or do anything special.
But because Dynamo is, like, executing bytecode by bytecode, it actually can see, oh, there’s a jump condition here and do all sorts of things.
So, Dynamo, you know, is sort of morally doing the same thing as FX tracer, but because it’s doing it at a lower level, it has a lot more flexibility and ability to put in safety guards that FX can’t do.
Actually, FX symbolic trace is very, very limited in some sense, which is because it doesn’t actually, it doesn’t even support querying shapes on tensors because it just replaces these things with proxies and it just says, well, I don’t know what these are.
So, this is not a fundamental limitation and, in fact, the what AOT and there’s a different mechanism that AOT Autograd uses called proxy tensor tracing where we actually maintain fully fledged proxy tensors.
And, you know, this is also very similar to symbolic tracing, but now you can actually query for the size of a tensor and get that out.
But the fact remains, right, that, like, when you run Dynamo, if you, like, call into some external library, Dynamo can notice it because it’s processing each of the bytecode instructions and say, oh, I’m calling a function into matplotlib.
That doesn’t sound good.
I should graph break here.
Any sort of Python operator overlaying mechanism cannot get that level of insight into what is executing in your program.
You’re just going to go ahead and execute, you know, operations.
And only if, you know, you’re dealing with your proxies, do you actually get the callback and get to record things.
So, if there’s other stuff going on in the Python program, you have no idea what’s going on.
So, Dynamo, by hooking into the bytecode, can get all that information.
So, hopefully, I’ve given you a little bit more sort of the high-level information about, you know, what Dynamo does at a high level and how it compares to other systems.
There’s plenty of other things to talk about, and I will talk about them in later podcasts.
Thank you very much for your time.
See you next time.torchdynamo
EP68 Zero-one-specialization
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I’m here with Mikey Daggettsies, who is going to help me sort of explain a little
bit more about, you know, the PyTorch 2 model.
And so our goal coming into this conversation, as Mikey was telling me, was we were talking
about 0.1 specialization at the most recent Composability meeting, which, by the way,
we have a recording for on Twitch and hopefully on YouTube soon if you want to go check that
out.
And so Mikey is sort of newer with the PyTorch 2 project.
And so he was wondering, well, you know, what exactly does this all mean?
Like, what’s going on here?
Mikey, did I describe that correctly?
Yeah, that sounds good.
Okay.
So to start off with, what we were discussing before I decided, hey, let’s record a podcast
for this conversation, was the very concept of, you know, why are we talking about 0.1
specialization?
Why is it a problem for experts?
Like, what does this all mean?
My response to that was to say, well, hey, to understand this, we first need to know a
little bit about PyTorch 2’s compilation model, you know, in its entirety, right?
Like, so we first need to know, like, what exactly is a guard?
Why does this matter?
And so the idea we’re talking about here is that when we are running models through PyTorch
2, we run them through the compilation stack, and we get out some compiled artifact, and that
artifact may or may not be valid for certain inputs, right?
We may have done sort of specializations for certain input sizes to, you know, allow us to
hard code in these constants and make things run faster.
And so when we want to actually run these on new user inputs, we need to check whether or
not all of those things are valid.
And so if you listen to my podcast about Torch Dynamo and guards, the way we find out whether
or not those things are valid are guards.
Okay.
So let me give you a chance, Mikey.
So that’s where we were so far.
And so what was the next question you had on your mind?
So let’s, let’s do something simple.
What is the advantage of specializing on, on input sizes, zero or one?
That’s a great question.
Okay.
So you, because this is a policy decision.
So, um, what, when we, yeah, so, okay.
Rewinding a sec.
So, so in general, you can end up with a compiled artifact and we’ll have some guards saying when
it’s valid, when it’s not.
And in fact, in PyTorch 2, we have some upfront decisions we make.
We say, if you have any input whose input size is zero or one, um, and by the way, um, this
all applies under dynamic shapes because under static shapes, we just specialize on all the shapes
and there’s, there’s nothing to be dynamically varying.
Like if it has to, if it was 20, 2048 on the original run, it has to be 2048 again, but with
dynamic shapes, we try to make things, uh, be able to vary, but we still specialize if we ever see any
input that is zero and one.
So the question here is why is that a good idea?
And so the reason why this is a good idea, um, has less to do with a sort of, um, uh, like
what, um, what’s the word in principle, we could choose not to upfront specialize on things being
zero or one and just run all of our infrastructure in PyTorch, which is responsible for figuring out
whether or not, um, expressions, uh, you know, needed to be guarded on or not.
And we would get a better result, a better result in the sense that you would have less
guards than if we had eagerly specialized on zero or one.
So, you know, it’s a very valid question to ask, well, why do we eagerly specialize on zero
one?
So there’s two answers to this question.
So the first answer is this is for performance reasons, because when you, uh, you know, make
something symbolic, um, we have to do a lot more reasoning about it.
It’s, it’s kept as a simpi variable.
It can build expressions.
These expressions might, you know, involve lots of additions and things like that.
And we can’t ever simplify it down to, oh, it’s five, um, or, oh, it’s eight because, you
know, we didn’t specialize.
We’re trying to find out.
And if it turns out that way later in your program, you end up specializing because there
is a guard, there’s a condition on it or something like that.
It actually is one or zero, then you’ve wasted all that time doing all the symbolic reasoning,
um, ahead of time when you could have just like specialized it to be one or zero.
And then your tracing would have gone a lot faster because you just do constant propagation.
You’re not doing any symbolic reasoning.
So we’ve, we’ve measured this and empirically zero one specialization buys you a lot in terms
of trace time because, well, for one, it’s not that common to show up in inputs.
And two, and this leads to the second reason is we do a lot of conditional testing on whether
or not sizes are one or zero.
For example, let’s suppose that you have a, um, tensor, uh, you’re creating a new tensor
and, um, you’re creating it with, uh, some, uh, um, um, okay.
You’re creating it with both sizes and strides explicitly.
And so the question at hand is, um, is this tensor being created with some set of sizes
and strides, is it contiguous?
And so there’s a very complicated algorithm you can do to like figure out if it’s contiguous,
which involves like looking at, you know, the ordering of the strides and then making sure
they like multiply together in the way you expect so that everything is densely packed together.
But there’s a simple way, um, for something to be contiguous.
And that is if any of the, uh, input, uh, input sizes is zero, then the tensor is contiguous.
Why?
Because there’s no elements in the tensor.
So like it’s, it’s contiguous because there’s just nothing to be discontiguous.
Similarly, if the number of total elements in the tensor is one, that’s also contiguous because
while, you know, like there’s only one element, you know, it can’t be discontiguous with anything
else.
So the code in our framework, which is generating guards, um, is doing all of these checks.
And so if you don’t actually, if you’re not able to do things like say, well, this is, uh,
you know, definitely not zero, then you end up having a lot more sort of reasoning to go
through, uh, where you could have just been like, oh yeah, definitely all the sizes are
not zero.
I’m not going to worry about the case when sizes could be zero and that’s fine.
The other classical example of this is broadcasting.
So when you broadcast, when you add two tensors together, um, you, you normally need to check
that their sizes are equal at equivalent dimensions.
But if one of the dimensions is size one, we’re willing to broadcast it into, you know, the size
of the other dim.
How do we test that?
Well, we have to look and see if it’s actually equal to one.
So like, you know, we end up doing lots of guards on oneness and zerowness.
And so that’s why zero one specialization is useful.
It’s because we’re probably going to guard on it anyway.
So we can just get back the performance if we just assume, oh, if it’s zero, it’s going
to be zero and we’re not going to try to generalize.
So where are these, uh, checks for zero or oneness applied?
Are they checked at all at layers all throughout a, uh, PyTorch program, or are they only checked
on the initial inputs?
Yeah, that’s a good question as well.
So, uh, so there’s two parts to this question.
So one is when we talk about guards, when are guards checked?
And the answer to that question is simple.
It’s, we only check guards at the very beginning of a compiled block, right?
Because the compiled block is this opaque blob of code.
Once we start executing it, that’s it.
We’ve got to execute it all the way to the end.
We’re not like other, like a JavaScript JIT where you have a bailout midway through and
then you like reconstruct your stack state and then go back to the slope path interpreter.
We can’t do that.
So we need to have everything in line when we go in.
So all the guards for everything that happened during the computation are there.
The other half of your question is, um, like when are, when are these zero one tests happening
most of the time?
And the answer to that is they happen mostly one, when you’re constructing tensors, because
when you construct a tensor, we need to do a bunch of tests to figure out various things
about contiguity to it happens when you do point wise operations, because that’s when
you test for broadcasting.
And three, um, they happen, uh, uh, sort of on a very ad hoc basis on a lot of kernels,
um, that are complicated that involve algorithm selection.
So think convolution batch norm.
And this is for a sort of different reason.
It’s because a lot of libraries don’t handle, for example, zero size inputs.
So you have to check, oh, is the number of elements zero?
If it is, well then, you know, short circuit and don’t do anything because, because a zero
batch convolution is very easy to do because there’s no work to do.
You’ve got no data and otherwise call it a decoudian convolution.
And this, this, this last class of things shouldn’t matter, but it does because, you know, which
algorithm you select changes what strides you actually end up with in the end.
And this is one of the reasons why like stride agnostic PyTorch is kind of relevant to like
the discussions we’ve been having about zero one specialization.
Okay.
So, um, to summarize part of that, every place we have a graph break, we are checking guards
because we have to know, um, which graph to enter.
To be clear when, when you, when you enter a graph, because you can enter a graph without
there being a graph break.
Like when, when you start a torch compile region, that’s not going to be a graph break.
And we do zero specialization, zero one specialization at every, um, at every graph entry.
Yeah.
So when you enter the graph, we allocate symbolic sizes for all of your inputs.
And if any of those inputs happen to be zero one, we say, okay, fine.
Uh, this is just literally zero, or this is just literally one.
And we specialize it on the, on the spot.
Okay.
And do we, and, um, you’ve, I think you made the case that this is generally a good idea.
Um, do you, would you say that this is always a good idea or are, do you think there are times
where you wouldn’t want to have this specialization?
No, it’s not always a good idea.
And so that leads us to the conversation we were having at composability sync, which is
for export zero one specialization is bad.
Well, zero specialization, uh, we, we had some examples.
So actually zero specialization is probably bad, but one specialization is really obvious.
If you’re tracing a program and you want a dynamically varying batch size, you really don’t want your
program to not work for batch size one.
That’s probably like the, like most likely batch size you’re going to run in any situation
where you can’t actually pull up inputs and then, you know, do batch inference over them.
Okay.
Um, so it’s not, it’s not, um, it’s not clear to me, like why having a specialization for,
for batch size one would be incorrect.
I mean, is, is my understanding was that, um, you would still do the specialization and you
would still produce a correct graph for an input of that size.
Um, under what circumstance would you be generating a graph that would be incorrect?
Um, that’s correct.
So in fact, we are generating a correct graph under certain conditions.
So if I trace my program with batch size equals one, I will get a program, which is correct.
Whenever batch size is equal to one.
However, if batch size is equal to four, this trace is not necessarily correct.
And one of the things that, you know, people want in export is they want to only have a
single graph, which handles all of the possible cases, right?
I don’t want to batch size one graph and a batch size two graph and a batch size three
graph and a batch size four graph.
Like that’s, that’s dumb, right?
Like probably they’re all the same graph.
And I would just want one graph in that case.
So when I zero one specialize, uh, even if I pass in a batch size two, I will get a graph,
but it is only valid when my size is not one and not zero.
So it’ll be valid for two, three, four, five, six, seven, so forth, but not for one and zero.
Because when we zero one specialize, that also means that if you do a test and you test, Hey,
is this equal to one?
I can say no, when the batch size is two.
Okay.
So, um, again, I think like what I’m, what I’m missing here is it’s like, why in export,
would you not want multiple graphs?
Um, like if, if, um, if you could have a graph that’s good for one, and you can have a graph
that’s good for numbers, uh, greater than one, um, and those graphs are distinct, like what
would be the harm of exporting two different graphs that can be used for different input
sizes?
Obviously you wouldn’t want to graph size two, size three, size four.
Like, um, there’s, uh, there, you know, if those graphs are the same, you would want to
have, you, you would want to collapse them.
But if you do have a distinct graph for size equals one, like why not, why not use it?
That’s a good question.
And indeed, uh, in regular eager mode, this is fine, right?
We’ll have a graph for one, we’ll have a graph for N, and we’ll switch between them depending
on what users give us.
And what, what’s the, what’s the cost?
The user just had to wait a little bit longer for two graphs to compile as opposed to one.
But on export, this is bad because, um, we’re talking about export to these like mobile
devices.
They have very little memory.
They like really, you know, want, uh, like a, a single model that has a small footprint
that they can put on some smartwatch.
And you’re going to, are you going to tell them, Hey, actually, uh, you know, we need
to give you two graphs, one for the batch size, one case, one for the batch size and case.
And, but wait, it gets worse because say you have two dynamic dimensions.
So you have sequence length and you have batch size.
I need four models this time, one for the batch size, one sequence length, one, one N, and one
and N, right.
It’s a combinatorial explosion of possibilities.
So like, yes, in principle, you could do this.
And in situations where it doesn’t cost a lot to have multiple graphs flying around, this
might be fine, but it just, it’s worse.
And like, it doesn’t surprise me that people have a preference for having only one graph
in this situation.
Okay.
That, that, that, that makes a ton of sense to me.
Um, I think the, the question that remains then is, is so if we’re checking for this on
export and we, we identify that there are multiple graphs and we want to flag that, what can we,
what, what can we really do there?
Because the, the graph that is not specialized for size one, um, may not be valid for size
one.
Is, is that, is that correct?
Uh, a graph that is, um, for a batch size greater than one is not necessarily valid for one.
Yes.
So what can we do other than saying, uh, sorry, you’re out of luck.
Um, like, do we have to go back and, um, just have like a, like rerun the trace with, uh,
just saying disable zero one specialization.
Like what’s the, what’s our recourse here?
Yeah.
So we went through a bunch of possible solutions in composability sync, and I believe the solution
the room converged on looks like this.
So first you turn off zero one specialization, right?
So I just spent a while saying, Hey, zero one specialization is great.
Um, you know, like we, we really like it for trace time performance and stuff.
And I’m like, no, no, no.
Like get rid of it.
Right.
Like we’re, we’re not going to eagerly specialize on things.
So under the assumption that you don’t generate any guards on the batch dimension, as long
as you don’t upfront zero one specialize, you will in fact, get a graph that works for any
selection of the sizing question.
Right.
That’s a big if though, if you don’t have any guards.
So in practice, um, you will have guards, right?
For all the reasons that I described so far.
And so now what do you do?
Well, you essentially, and this is what I’ve been doing when working on Unbacked Simmons,
you basically rewrite all the library code in PyTorch to not unnecessarily guard on ones and zeros
if it doesn’t have to.
And it’s actually, you have to change a lot of spots in the code, but it is surprisingly
tractable.
For example, um, like let’s talk about tensor construction, right?
So I said, well, to configure out if something is contiguous, you have to like, look at the
sizes, right?
Because if it’s zero, then it’s definitely contiguous.
Otherwise, you know, who knows, but if you call torch.empty directly, not empty strided
and you don’t pass in strides, you know, that it’s contiguous, obviously it’s contiguous
because like, you know, there’s no way to allocate a non-contiguous output when from torch.empty.
So that’s fine.
There’s another annoying thing, which is, um, uh, in stock PyTorch will compute this thing
called channels last contiguity, which essentially says, well, maybe it’s not contiguous, but if
you, if you move the channels dimension to the last position, would it be contiguous in
that case?
And this is, this is, this is not easy to answer because with torch.empty, the answer is no,
it’s not channels last contiguous unless the number of elements is one, in which case it
is contiguous because like, you know, everything is contiguous when you only have a one element
tensor.
But in this particular case, it doesn’t matter because no one actually ever like in, in models
we’re exporting, no one actually asks, Hey, is it channels last contiguous?
And so if you can just delay it, you can say, Oh, well, I don’t know if it’s going to be
channels last contiguous or not, but as long as you don’t call me out on it, then we can
avoid the guard and everything’s peachy.
And so there’s a lot of like near misses like this, where you like sort of fiddle around with
things and then you get it so that, Oh, actually we don’t have to do the guard.
Um, I recently got resnet running without any, um, guards on, uh, batch size being one.
Uh, there were a lot of things, but I was able to get to the end and like, you know, I was able
to resolve all of them.
There’s one thing though, that I wasn’t able to resolve, which is that sometimes to get the
exactly correct output stride for a given operation, I actually really needed to do a equals equals one
test.
Um, but this is also something we discussed in the meeting.
And the thinking is that, well, you know, strides are this, like, you know, they’re this advisory
thing.
They’re not supposed to change the semantics of programs.
So it should be okay for us to, you know, slightly change what the stride output is if
we’re, you know, tracing one of these things versus not.
And that’s kind of like not entirely true, but like, you know, the whole point of the stride
agnostic PyTorch work stream, which Mikey is working on, um, is to like, make that more
true in your PyTorch.
Great.
Um, the, uh, so yeah, so that, that, um, that’s probably a topic for another day.
Um, cause I know that you, you do think that it may be controversial to, um, extend stride
agnosticism to the outputs of programs.
Right.
Um, so I’m curious, I’m curious, uh, uh, where we’re going to land on that controversy in the
longterm.
Yeah.
So just to, just to repeat on this question, right.
One of the things that originally, um, you know, spurred this question about stride agnostic
PyTorch was, you know, we’re working on this cool compiler.
And sometimes the compiler is like, Hey, like I see that you’re outputting a tensor
with this striding in the original year program.
And I don’t like that.
I want to, I want to give you a different one.
Cause I can give that one to you faster.
I can give you a channels last tensor much faster than I can give you a contiguous tensor.
And we’re not allowed to do this today because it can break user code.
And so like, if we want to change this, if we want to be allowed to do this, we need to
make it so that user code can’t be broken in that case.
And that’s, that’s what stride agnostic PyTorch is about.
Or part of it, at least we can, I mean, we can break those assumptions in the middle of
the program.
Right.
But it’s the, uh, when it goes into eager mode.
And this is also why export is sort of like an easier version of stride agnostic PyTorch
because we do have the assumption that we can, you know, trace through the entirety of the
program.
And so I do think it is okay to assume PyTorch is already stride agnostic, um, when you’re
doing an export style workflow where you have the entire program and maybe PyTorch isn’t,
but you can still do easy tests.
Like you can just make sure the user model isn’t calling as strided or make sure the user
model isn’t like trying to mutate through a reshape call.
Um, these are all relatively simple things to test.
If you can assume you’ve got the entire model.
All right.
Well, um, I think you’ve answered all my questions for today.
Uh, thanks a lot, Ed.
Okay.
Thanks for agreeing to, uh, be recorded on the podcast.
I hope, uh, listeners out there also found that useful.
Okay.
Talk to you all next time.
Yeah.
I hope to see you again soon.Zero-one-specialization
EP69 Unbacked-SymInts
Hello, everyone, and welcome to the PyTorch Dev Podcast.
This podcast is a little bit of out of order
from the previous podcast about 01 specialization.
So if you haven’t listened to the 01 specialization podcast,
try listening to this one, which is going to be about
unbacked cements in general for PyTorch 2
in both eGremote and export.
So this podcast is coming because we’ve been talking more
about 01 specialization and also about the stack of PRs
that I’ve been working on regarding unbacked symbolic integers.
And there’s been a lot of questions about
what the heck are unbacked cements?
What exactly is going on with them?
You know, what are the consequences of adding this feature?
And so I wanted to record this podcast to talk a little bit about,
you know, what exactly is going on here
and answer some of these questions.
Gregory Chanan, who isn’t joining me,
but sent me a list of questions that he had regarding the feature.
And I’m going to use these to sort of drive
the discussion in this podcast.
Okay, so let’s start off with the basics.
So what is an unbacked cements?
So to answer that question,
I first need to mention what a backed cements is.
So a backed cements refers to our symbolic shapes
that we’re passing through our program.
You know, we have a bunch of input tensors.
Instead of statically specializing on these tensors,
we give them symbolic sizes, which just say,
hey, you’re going to do the symbolic execution
on these sizes and you’re not actually going to burn in any particular size.
So if you do a view operation based on the size of something else,
we’ll pull out the symbolic size for that particular tensor
that I’m reading out the shape from
and pass it on to the view without burning in whatever the actual value was.
So if that value changes in the future,
then I can actually, you know, just reuse the same graph in this situation.
Now, the thing about having symbolic integers like this
is if someone writes some Python code and they say,
if x is equal to two, then do something else, do something else.
There really isn’t any way to keep things symbolic in this case,
because, you know, we need to actually know which branch we’re actually going to go down.
Now, of course, there are some program analysis techniques that will allow you
to sort of keep those, keep, you know, trace through both branches
and do some sort of fancy stuff in that situation.
But we’re generally talking about straight line traces in PyTorch 2,
and we don’t have anything that fancy.
So we need to actually have an answer in this case.
And so when you have a condition on a symbolic integer,
we do what’s called a guard.
So we look at what the actual value is, the sort of backing value.
And this is where the term backed versus unbacked comes from.
We look at the backing value.
This is also referred to as a hint inside of our code base,
because the hint basically says what kind of size we might expect this tensor to be in practice.
We look at the backing value, the hint of the tensor,
and then we do the condition based on, you know,
the actual value that we have in the backing value.
And then we go ahead and we say, okay, well, if it’s true,
then I’m going to go down the true path.
Otherwise, I’m going to go down the false path.
And importantly, I will add a guard,
a guard that is executed at the beginning of the graph,
which just says whether or not I’ve actually fulfilled this condition.
So the next time that I run my graph,
will I actually go through the same conditional branch or not?
And these conditional branches can happen anywhere in PyTorch code.
It can happen in user code, where, you know, a user does some condition on,
you know, what the shape of a tensor is.
And it can also happen in library code, where inside of the PyTorch library,
you know, we’re looking at sizes and we’re making decisions based on,
you know, whether or not the sizes are big or not to do one thing or another.
For example, when you’re running convolution,
we will look at the size of your input tensor,
decide which particular convolution algorithm we’re going to do.
Okay, so to summarize, you know, we have symbolic integers,
but they have backing values, hints.
And if we do a condition on them, then we look,
we peek at the backing value and use that to resolve what the condition is,
inserting a guard in that situation.
So what is an unbacked simon?
Well, an unbacked simon is simply when you just don’t have a backing value.
And there are two reasons why you might not want to have a backing value.
So one is, you might just not have a backing value at all.
For example, say you have a tensor that was produced by a non-zero call.
What the actual value size of this tensor is going to be is not known to you
unless you actually, you know, run the operation because it’s data dependent.
So we don’t know what the value is.
We have no idea what it could be.
And so we have no choice but to give you an unbacked simon in this case
because we don’t have a backing value.
We don’t know what it is.
The other example of when they might be useful
is when you want to intentionally prevent guards from occurring on a variable.
Let’s say you’re doing export.
And so with export, you might want to produce a graph that can work for any batch size.
So if you’re going to make a graph that works for any batch size,
then you would like to say, okay, well, I don’t want you to be able to guard
on a batch size being zero or one.
I just want to, you know, like say, hey, you know,
you did no conditional jumps on the value of batch.
So my entire program is indifferent to whatever the batch size was.
And so you might just intentionally feed in an unbacked simon
for the dimension for your batch dimension,
just so that you could make sure that you error out if, you know, some code,
either user code or library code,
attempts to actually do a guard on it in that question.
So one question that people often ask me,
because a lot of our discussion has been revolving around export,
because that’s sort of what’s been driving, you know,
working on unbacked simon’s, you know, recently is,
are unbacked simon’s only for export?
And the answer is no, because you can also use them to, you know, implement.
You can also use them right for the non-zero case
if you are actually going to compile in that case.
And you might also use them to just like, you know, say,
hey, I want to compile this model for eager mode,
but I really, really don’t want to, you know,
have any guards on this value,
because I really only want to compile one graph in this case.
And unbacked simon’s would be useful in this case.
That being said, primarily, we are working on unbacked simon’s right now,
because we are trying to do something with export.
So most of the discussion that’s happening right now
is all about export,
because that’s what we’re spending most of the time thinking about.
I was in a discussion with Sam Gross,
and Sam was asking me,
well, you know, about this non-zero compilation case,
you know, is that a real use case?
Because you might want to just, you know, graph break,
and then, you know, you run the non-zero,
and then you run the graph afterwards,
and isn’t that good enough?
And the answer is, well, yes, that is mostly good enough.
But there are some situations
where you will miss optimization opportunities for this.
And in particular, if you have some sort of data-dependent operation,
say, non-zero, or more realistically, a packing operation,
where you have some padded tensors,
and you pack them into a, you know,
small tensor that doesn’t have any of the padding values.
And by the way,
the output of this packing operation is dynamic,
because, you know,
what you pack depends on how much padding there was
inside the original tensors,
and that’s a data-dependent concept.
So after you pack,
you might want to run some point-wise operations after it.
And here, it would be profitable
to fuse in those point-wise operations
into the packing operation,
which is getting the data in this place.
And this happens with jagged slash nested tensors,
where, you know,
often you have a bunch of input tensors,
and you want to pack them into, you know,
a smaller, you know,
with no padding tensor,
and then do the operation on it.
So this is a profitable optimization.
It’s something that I’ve been told
by the folks working with jagged tensors that they want.
And, you know,
it’s one of the reasons why you might want to support this.
But as I said,
like most of the discussion that’s happening right now
in PyTorch development is all about export.
So, you know,
that’s what we’re doing.
So then,
okay,
so we got Unbacked Simmons.
And so a lot of our discussion with Unbacked Simmons
is Unbacked Simmons work a lot like Simmons,
but all your guards fail, right?
So when you try to actually use them,
you end up with a pretty common situation,
which is you try to feed in Unbacked Simmons
into your model,
and they don’t work
because there’s a guard.
And now you’re like,
well, why is there a guard on my code?
You know,
and you look into the bunch of the cases,
and there are all sorts of different scenarios.
And I actually talked through a bunch of these scenarios
inside the Dynamic Shapes manual,
so you can check that out for more details.
But one of the examples
that has been causing folks quite a bit of trouble,
you know,
sort of like,
do we want to do Unbacked Simmons in this way,
is the so-called broadcasting example.
So let’s unpack the broadcasting example for a moment.
The broadcasting example says,
hey,
you have got a tensor,
and let’s say it’s got an Unbacked Simmons,
and you want to add some other tensor to it.
And let’s say maybe it’s also got an Unbacked Simmons in it.
And it just so happens
that the sizes of the two tensors are equal,
so they will add together no problem.
So we happen to know out of band
that everything is going to be okay.
But when you run this code,
what PyTorch in the library code is going to do
is it’s going to attempt to test for broadcasting.
Namely,
it’s going to check and see if any given size
on the left-hand tensor is one,
because if so,
it can broadcast to the right-hand side.
And we’ll test if the right-hand side is one,
and if so,
it can broadcast to the left-hand side.
Broadcasting being,
you know,
just replicating the one size dim
as many times as necessary
to fill in the other size.
So,
if you just run the library code as is
without any changes,
what we will do is
we will test if
the input tensor size is one,
and then we will test
if the right-hand side tensor size is one,
and then we will test
if their sizes are equal.
But I just told you
that I was passing in a tensor
that was unbacked.
And so if I do a condition on it,
if I actually say,
hey,
tell me if the tensor size is one,
if that size is unbacked,
then that will just immediately fail,
saying,
hey,
you tried to guard on an unbacked simon.
But actually,
you know,
in this particular case,
the guard was completely unnecessary,
because the sizes would have ended up
being the same on both sides,
and you just would have been fine.
You didn’t need to broadcast
because they were just equal.
So,
like,
this is the sort of situation
where,
you know,
you end up with a,
hey,
unbacked simon caused a guard failure,
and now I need to go
modify PyTorch library code.
Now,
when I told people this,
you know,
there were a few questions about,
like,
is this a real problem?
Because,
well,
like,
how,
this seems like a dumb issue to have,
because obviously,
the broadcasting code is,
you know,
going to be fine,
and,
you know,
like,
surely there’s some simple solution
to solve this problem.
And,
one question that people had was,
you know,
why am I looking
into the broadcasting code at all?
Naively,
I would expect the export graph
to just be a list
of eight and ops strung together.
So,
so why do I have to recurse
into the point-wise operation
to actually,
you know,
where,
you know,
I actually run all this
broadcasting logic,
right?
Because,
because when I look at my graph,
all I’m getting is an add operation.
And so,
you know,
like,
there’s no broadcasting to be seen.
So,
so why does this matter
for tracing?
And to answer this question,
I have to say,
well,
the reason why you’re,
you know,
going into this code
is because when you run
the add operation,
you get out some result tensor,
and that result tensor
has sizes on it.
What are those sizes
going to be?
Well,
to figure out
what those sizes
are going to be,
you have to run
the shape propagation rules
for addition.
And those shape propagation rules
are what actually
do the broadcasting.
So,
so,
you know,
to do the shape propagation,
that’s when you actually
do the broadcasting checks
and that’s when you do
the one check
and that triggers the guard.
So guards aren’t just,
you know,
remember,
executing on user code,
they’re also executing
on library code.
And in particular,
they’re executing
in the shape propagation code,
even if that shape propagation code
is completely invisible
in the final exported program
you get.
So then you might be like,
well,
okay,
Ed,
I can see that,
you know,
to compute what the output size
is going to be,
I have to run this operation.
But what if I was,
you know,
what if I said,
hey,
I just don’t want to actually,
you know,
like do any of this
because I don’t need to know
what the output shapes are.
Maybe I just,
I don’t care.
I’m going to,
you know,
sum over them
or do something very simple
to them in the end.
And I don’t need a very,
very fine grained,
you know,
expression that tells me
exactly how to compute
the size of this
in terms of the inputs.
And so for one,
yes,
this is a thing you could do.
Two,
you typically don’t want
to do this in eager mode
because if you were to guard
on the output size,
because remember,
the user can do
whatever they want.
And in particular,
they can pass it
to another operation
where,
you know,
that size needs
to be checked
its equality
against something else.
So if you want
to guard on it,
then you actually need
to be able
to express the guard
in terms of the input sizes.
So you need to know
how to actually do
the computation
from the graph inputs
to the end.
It’s not like
a traditional JIT system
where,
you know,
when you realize
that you violated
some constraint
for your trace,
you can bail out.
We have to like,
you know,
move all of these
bailout checks
to the beginning
of the graph
when we compile them.
But hey,
we’re export.
We’re not going to like,
you know,
really poke on these
with guards.
Would that be fine
as well?
And then is,
yeah,
sort of.
So what we can do
is we can say,
okay,
we don’t know anything
about the output sizes
of this tensor.
We just want to say,
hey,
it’s something.
And as long as you
don’t look at it too hard,
if you don’t try
to do any reasoning
about it,
it’s fine.
And we can do this.
And in fact,
I do this for a,
for the non-overlapping
and dense tech
check on tensors.
So when we,
when you make a tensor,
one of the Boolean fields
we pre-compute is,
is this tensor
non-overlapping and dense?
Sometimes this is obvious,
but if you pass
in a bunch of strides,
it’s very non-obvious.
You have to sort the strides
and then like,
look and make sure
they all like line up
exactly correctly.
And it’s very complicated
and causes a lot of guards.
So what I do instead
is I just return a,
hey,
you know,
this is just an opaque thing.
is non-overlapping
or dense function.
It takes in all the sizes
and strides for the tensor
and that’s it.
You don’t get to know
anything else about
what this quantity is.
And so the point is that
as long as you never
actually try to touch
this quantity
in any meaningful way,
like you never try
to condition on it,
you never try to test it
for equality
with anything else,
that’s fine.
And this works
perfectly okay.
and so it only blows up
if you actually,
if you actually try
to do something with it.
And it’ll probably blow up
if you actually try
to do something with it
because you said,
well,
I don’t know anything
about this,
so there’s no way
to do any reasoning
about this.
And this is one of the reasons
why,
you know,
when Horace
looked at this situation,
he’s like,
well,
this seems kind of bad
because you’re just,
you know,
pushing off the problem
until later.
And the answer is yes,
I’m pushing off the problem
to later.
It pays to be lazy
if you end up not having
to do the work at all.
Another question,
and we’re going to relate this
to the 0-1 specialization episode
is,
you know,
how does 0-1 specialization
fit into all of this?
You know,
we might want to 0-1 specialize
in a dynamic shape regime,
but like,
does that actually seem
to matter for export?
And the answer is yes.
so 0-1 specialization
is kind of,
kind of mixing up
a few topics here.
So one thing
that I mentioned
about 0-1 specialization,
it is a trace time optimization,
right?
You don’t have to up front
0-1 specialized tensors
when they feed into your program.
You can just say,
well,
I’m not going to assume
that,
you know,
this 0-size tensor
is always going to be 0.
I’m going to try
to run my program anyway.
The reason why
0-1 specialization
is so useful
for PyTorch though
is a sort of
empirical observation,
which is that
there’s a lot of code
in PyTorch
which does all sorts
of 0-1 tests.
So,
you know,
basically,
you’re going to specialize
on 0-1 anyway
when they do the test
and guard on the quantity
as well,
so might as well do it
earlier on
in the program.
But,
you know,
if you just say,
well,
I’m not going to do it up front,
well,
you’ll just collect up
a bunch of places
where you actually do
0-1 specialization later.
so it’s sort of irrelevant.
For export,
you just turn off
0-1 specialization
and you pass in
an unbacked cement
and then you just,
you know,
deal with the guards
one by one,
at least in,
you know,
my proposal
for how to do
unbacked simmons.
Okay,
one last thing
that I want to talk
about here,
which is
why has the
unbacked cement
stack of PRs
been kind of
controversial?
So what you find
this stack
of PRs doing
is it’s saying,
hey,
you know,
I had some model.
I had like
ResNet
and I wanted to run it
with an unbacked
cement for batch size.
So I put in one of these
unbacked simmons,
I ran it,
and whenever there was
a guard failure,
I went and tweaked
PyTorch library code
until it no longer
had this problem.
And so people look
at these diffs
and they say,
hey,
well,
like,
does this mean
that I have to,
you know,
write my PyTorch library code
in this funny way
in the future?
That sure sounds like,
you know,
having to tortscript my code
and,
well,
tortscripting my code
was very painful
and I don’t want to have
to do this again
for another thing.
So I don’t know
exactly how
to argue
this one way
or another,
but my
general thinking
is that
yes,
you have to modify
your code,
but I don’t think
it is as bad
as tortscript.
So there are a few reasons
I don’t,
I think this is not
as bad as tortscript.
So one is that
really all of the
really complicated cases
have been inside,
you know,
PyTorch library code,
very low level operations
like empty,
like reshape,
and like is contiguous.
And so,
you know,
one of the ideas
that,
you know,
I was hoping would be true
with my patch set
is I fix these
like low level problems
and then,
you know,
most code is not
written in a branchy way,
right?
Like,
you know,
you don’t have people
re-implementing broadcast
everywhere.
They usually just
call an operation
that broadcasts
and,
you know,
if that broadcast
implementation
knows how to like,
you know,
tiptoe around
unbacked Simmons,
then that’s fine.
So the hope is that
like the sort of fat,
there’s a fat tale
of very complicated
operators that we have
to handle internally
and the rest kind of
will just work out
because most people
aren’t writing their
models trying to,
you know,
like condition on
what your batch size
is going to be.
The other thing
that I think
is a little different
is that in TorchScript,
you,
it was an all or nothing
deal,
right?
You had to get
all of your code
end-to-end
towards Scriptable
to actually get
something useful.
With unbacked Simmons,
you don’t have to
actually get
everything going,
right?
Like if you’re not
doing export
or you’re not like,
you know,
saying,
hey,
I must compile
all of my program
in a single
traced block
from head to toe,
then you’re allowed
to not,
you know,
not use
unbacked Simmons
all the way through.
In fact,
I would not
recommend using
an unbacked Simmons.
In this case,
you can just say,
okay,
well,
this is fine.
Like,
I’m going to make
sure that it works
for sizes that are
greater than two.
And if you happen
to send me a batch
size one,
I’m just going to go
ahead and recompile
my program for the
batch size equals one
case.
No problem.
What’s the big deal,
right?
Like it’s just a 2x
cost in number
of compiled graphs.
And I still have one
that,
you know,
can handle all
of the variable
cases.
So really,
the only time
you need to like
squeeze into this
regime is if you
are trying to
export and it
is a dynamically
sized model,
so you want the
varying batch size
and you’re in a
situation where you
can’t ship multiple
graphs,
you have to ship
one graph.
And to that,
I say,
well,
you know,
what did you expect,
right?
You’re going to have
to write your code
so that it doesn’t
actually like do
any branching on
the batch size.
And there’s sort
of just some sort
of irreducible
complexity,
at least in my
opinion.
Okay,
so this is an
ongoing conversation.
I recorded this to
help information
share.
We might have an
updated recording
later once we have
some more alignment.
So I’ll also link
that in the podcast
if that actually
happens.
All right,
thank you very much
for listening. See you all next time.Unbacked-SymInts
EP70 Dynamo—VariableTracker
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about variable trackers in Dynamo.
What is a variable tracker?
Well, to explain the concept, we first have to think about what it is that Dynamo is trying to do.
Dynamo is trying to take your Python program, and without actually running it,
it wants to simulate the execution of every single operation that happened in your program
so they can find where all the Torch operations happen, put them in a graph,
and then send them off to the rest of the PyTorch compiler so that we can compile them into Efficient code.
So in order to do this, we need to run the code, but without actually running it.
And depending on how complicated your program is, that may require us to do a lot of stuff, right?
Let’s say that within your model, you’re creating a dictionary, you’re putting things into the dictionary,
you’re taking things out.
In order for us to step through every line of code in this situation,
we have to actually model this dictionary in some way.
But we can’t use an actual dictionary.
Well, actually, we can, but sometimes these data structures have side effects.
They actually do things.
Print to your terminal, write to other things.
So we can’t actually use the actual data structures in a lot of situations.
What we actually need to do is we need to be able to maintain some parallel universe in Dynamo,
which is sort of like the dynamified universe of all the state in your Python heap, which we can go ahead and do operations on, right?
For example, if you have a global dictionary and inside your model code, you’re writing, you’re incrementing a counter on it.
When we symbolically evaluate it to extract out of Dynamo graph, we can’t actually mutate that global dictionary.
We have to do it in our sort of local universe that is our simulation and only when we are done, have some actual code which replays this effect onto the real thing.
So variable trackers are essentially our way of representing the Python heap in a way that Dynamo can work with it, can do analysis on it without actually having to touch the real Python values.
So if you’re working on the Dynamo code base and you’re thinking, hey, you know, how, where exactly should I implement some logic for, you know, how I should be able to update some state when I do some operation or, you know, how do I model some data structure that someone has written that I need to do some special handling for?
Chances are it’s going to live in the variable tracker one way or another.
So variable trackers have a lot of purposes, right?
So they encapsulate a state that we can’t actually get our hands at by just directly looking at bytecode.
So most frequently, that’s because the structure is implemented in C.
Like, you know, if you’re talking about something like a dictionary or a set from C Python, there is no quote unquote Python implementation, right?
It’s natively provided by C Python.
So anything that is natively provided is not an actual Python bytecode.
We have to implement by hand inside Dynamo so that, you know, we can basically replicate the logic that is living inside the C code because we’re not tracing through the C code.
The C code is opaque.
There’s no way we can look at it.
If you have a user-defined class and that class is written entirely in Python, chances are we don’t have to write a variable tracker for that because in that case, the variable tracker, in that case, it’s built out of some simpler primitives, which we have written variable trackers for.
But then all the operations, the constructor, you know, accessors, those are all Python bytecode, and we can just step through them in a normal way to actually process them.
Okay, so that’s what variable trackers are in a nutshell.
And what are some things that are useful to know about them?
So we’ve actually made some changes recently to the variable tracker.
We’ve got the PyTorch 2 paper coming soon in ASPLOS, but it’s got the old version of how variable trackers work.
And so I think this podcast is going to be one of the first places where we actually say, besides in GitHub issues, what has changed.
So there are two big things that I want to convey.
So the first is that variable trackers are in charge of doing guards.
Remember, a guard is how we tell, hey, you know, this execution, this symbolic evaluation that we did, requires certain aspects of the Python heap to be some way, right?
Like if I do a conditional on a Boolean, and it goes down one path the first time, well, the next time I go, I need to make sure that Boolean is the same way.
Because if it changes, if it goes false, I’ll go down this different path.
But that’s not the path I trace down.
I’m not parsing the program.
I’m just looking at a particular sequence of execution.
So variable trackers are very important for keeping track of guards because we have all these values floating around.
If we actually poke at these values, it actually matters what the value is.
And so we typically need to do a guard.
But we don’t want to immediately say, well, the exact state of every single object in the Python heap has to be exactly this way when we, you know, start our analysis.
Because chances are, we’re not going to touch most of these things, we’re not going to touch most of the variables that we actually model in Dynamo.
So we only want to actually do guards on things when they actually matter for the execution in hand.
And so the old version of Dynamo, the way it worked was essentially any variable tracker had a set of guards on it, basically saying,
if you use this variable tracker in a non-trivial way, here are the guards that you need to use, you need to put into maybe the global guard state that’s actually getting installed,
or maybe some other variable tracker, which was derived off of the original variable tracker,
so that, you know, all the things you looked at on the variable tracker are valid in the same way.
It turns out there were two problems with this.
So one is that it was a lot of pain to, like, do all this propagation logic, because every time you did something with a variable tracker,
you needed to, you know, make sure you didn’t forget to collect off all the guards off each of them, you know,
plomb them together into one giant set, and then put that on your new variable tracker.
Very easy to forget, you know, very hard to test, because to actually test that, you know, you’ve actually done this right,
you have to, you know, set up some program, and then change the thing that would have been guarded,
and make sure things actually get rewritten. And, you know, most of the time, people are just writing tests that are just testing,
we can actually get through some code one way to another. So, like, writing very good tests,
that test that we are guarding enough, it actually takes a lot of care, and so, you know, it’s pretty difficult.
The second problem is that maintaining these sets of all these guards is actually really expensive.
Like, you know, Python is not a spring chicken, and then if you have these giant sets with tons and tons of objects,
that, you know, need to be hashed every time you’re putting them in the set,
it actually was materially making an impact on how quickly you could Dynamo trace through things.
And so, on our, like, open source benchmark suite, our, you know, tracing times are not so bad,
but we’ve been using PyTorch 2 on a lot of internal workloads, and these workloads have tons and tons of Python code,
and there’s, like, sometimes it would take hours for Dynamo, just Dynamo, not even the compiler,
not even Inductor or Trident, just Dynamo to get through all of that code.
And part of it was we were just, you know, shunting all these guards around, you know, kind of difficult to deal with.
So, Jason Ansell did a patch to make us not have to do this, and the new world order is this.
When you have a variable tracker, we have guarded on it.
That’s it, right?
So, if you have your variable tracker in your hands, there’s all sorts of things you can access on it,
and we’re just going to assume that we have already guarded on everything needed on the variable tracker in that case.
So, there’s no propagation needed, right?
Once you’ve got the variable tracker, we can assume that we already have the guards in question.
Now, sometimes this lazy behavior that we had before is good, right?
Like, say I have a bunch of arguments to your function.
I don’t want to actually guard on all of them exactly.
So, there’s some amount of laziness that we have for some variable trackers,
which is that you can have a variable tracker which doesn’t actually exist yet.
We haven’t actually populated it into an honest-to-goodness variable tracker.
The first time you poke at it and you’re like, hey, you know, tell me what this attribute is.
Tell me, you know, what the value of the boolean is.
Then we actually populated it into a real variable tracker and installed all the guards.
So, there’s, like, you know, specific laziness in various parts of the code base.
The new structure, I think, is very nice.
It reduced a lot of the administrative burden we had to do, and it made stuff a lot faster.
So, hooray.
So, variable trackers, right?
If you’ve got a variable tracker, the obvious thing, which is that you can access it however you like, that will work.
And when you create a new variable tracker, you’re responsible for making sure, at that point in time, you install all the guards you need.
Okay.
So, I talked about how guards work with variable trackers.
There’s actually another update, which is pretty nice, and this is from Michael Lazos, landed in December,
and this is what we’re calling mutable variable tracker.
So, another thing that you may not have realized about variable tracker in the old days is that variable trackers were actually implemented as immutable data structures.
The Haskler in me is like, hooray!
Why were they implemented as immutable data structures?
Well, the motivating reason for making them immutable was to support this checkpointing thing that we do in Dynamo.
So, let me explain what’s going on with checkpointing.
So, with checkpointing, the reason why we need a checkpoint in Dynamo is that sometimes we will be symbolically executing some code,
and we will be like, oh, no, we messed up.
We need to rewind the state of our execution back to some point, some earlier point in time,
where we can actually go ahead and insert a graph break.
And the canonical example of this is if you’re inlining a function, right?
So, if I’ve got some code in Dynamo, and I’m tracing through it happily, I hit a function call, I start inlining the function call,
and then inside that function call, I have a graph break.
What do I do in this situation?
Well, if I had some sort of fancy multi-call frame reconstruction logic, the way I could deal with this is just by, like, doing a graph break right then and there.
But we actually don’t have this logic.
Someone should implement it, by the way.
This would be great.
So, because we don’t have this logic, what I have to do is I have to rewind execution back to when I was about to call into the function I inlined.
And at that point in time, I do the graph break.
So, how can I do this rewinding?
Well, if I have a checkpointing mechanism, whenever I start an inline function call, I can just checkpoint the state of all the variables in my Dynamo program,
and then, you know, just throw out anything else, throw out the new state, and reuse my checkpointing state.
And so, immutable variable trackers make this easier to do in this situation.
But there is a cost, right?
The cost of this is that, you know, we actually have to do these as immutable data structures,
and that means that simple operations, like, let’s say you have a list and you’re appending to it, normally these appends are O of 1.
But if you have an immutable variable tracker, then I have to create a new copy of the list every time,
and so this ends up being an N squared operation to insert N elements onto the list.
Now, of course, you know, once again, with my Haskell hat on, why don’t you just use a more efficient functional data structure?
And the answer is, yes, you could, but, you know, CPython doesn’t have very good support for this sort of thing,
because most people in Python are just doing mutable lists, like, whatever.
Like, that’s the normal thing to do.
So you would be in this situation where if you just wanted to make this go faster, you would have to write a big library full of all sorts of immutable data structures.
And also, it’s kind of like a bad idea in a reference counted language like Python, because, you know, every time you generate garbage,
you generate, you know, these new copies of nodes that you then throw out immediately,
because you’re, you know, just, you’re continuously revving this immutable data structure.
You have to spend all this time, you know, incrementing and decrementing the ref counts.
It’s not like in a garbage collected language where the more garbage you make, the faster your garbage collector runs,
because remember, a garbage collector only needs to traverse the live routes of your object.
So what do we do?
So we said, okay, fine.
Checkpointing is cool, but we actually don’t need it.
And the reason we don’t need checkpointing is, remember this thing, right?
Dyno is working in this alternate universe.
It is, you know, symbolically evaluating your program without modifying the original program state.
So we have an ultimate checkpoint, which is at the very beginning of your program.
That basically tells you what all the state is, and we haven’t touched that at all.
So we don’t have to, like, actually checkpoint midway through.
If we need to rerun Dyno, we can just rewind all the way back to the beginning and then run again from the start.
And so we don’t need a mutable value circle.
So Michael Lauzos got rid of mutable variable trackers.
They are now mutable.
You can mutate them in the normal way you expect.
And, you know, life is good.
And this also made some of our internal tracing a lot faster.
Okay.
So I told you about why variable trackers exist and some of the changes that went on.
One more thing I want to say is how to find your way around variable trackers in the Dynamo codebase.
So there are a lot of variable trackers.
And sometimes it can be a bit bewildering to try to figure out, like, which variable tracker should I use?
And we don’t have really that good of an organization for the variable trackers, but there is some logic to it, right?
So in particular, we’re trying to organize basically the chunks of C code that we are simulating in Dynamo into sort of, you know, various logical things instead of just blobbing them into one giant thing.
So if you think about it that way, this will tell you about sort of where things are.
So in particular, if you have a completely immutable state in Python, like, you know, if it’s a literal, like an integer or a float, we typically model these as constant variables, right?
There’s also an enum variable for doing enum specifically.
If you have some state which is immutable, so you can’t actually modify it, then we tend to organize the variable tracker subclass based on where it comes from, right?
If it comes from PyTorch, then it’s a torch variable.
If it comes from CPython, it’s a built-in variable.
If it comes from NumPy, it’s a NumPy variable, right?
Like, we basically say, where does the code live?
And then we just go ahead and put the code in those locations.
Now, these are giant variables, right?
Because, you know, like, think about torch variable, right?
Like, we have tons and tons of stateless C code because every single function in the PyTorch API counts as, you know, something we need to model in variables.
So these classes tend to be very big.
But, you know, at a high level, the organization is based on, you know, where you can find it.
Similarly, if you have a, some state, sorry, something in C, and it is stateful, then that’s the situation in time where you get the normal thing where you have a dedicated variable per object.
So we have a tensor variable, we have a list variable, we have a set variable, we have a dict variable.
If you need to introduce anything else like that, you’re probably going to make a new variable subclass.
Because state needs to be handled specially, you need to, you know, write logic for how to replay changes to the state back to the original variables, stuff like that.
And then finally, for things that are implemented in Python, and we can inline into them, we have a big pile of, you know, user blah variables, like user function variable, user defined class variable, user defined object variable.
These tend to be actually relatively simple, because we don’t need any special smarts, right, we just are going to plan to inline into the bytecode for them to actually implement them.
So that’s everything I wanted to say about variable tracker today.
See you next time.Dynamo—VariableTracker
EP71 Inductor—IR
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about Inductor IR.
Inductor IR is an intermediate representation
that lies after the A10 graph,
but before the actual Trident code generation.
So if you think about the overall PyTorch 2 stack,
once we are done capturing the graph with Dynamo,
we have a bunch of FX nodes referring to A10 operations.
And then in order for Inductor to actually compile this code
into Trident code, it will take that A10 graph
and do a bunch of transformations on it,
first converting it into Inductor IR,
and then scheduling that IR,
and then actually finally generating the code
from the scheduled nodes.
And that’s how you actually get the good Trident goodness,
as well as the wrapper Python or C++ code
that strings it all together.
So as you might imagine, the Inductor IR
is a pretty important thing to know how to work with
if you are planning to work on the compiler in Inductor at all.
Now, I don’t have a good organization
for all of the things that I want to talk about here.
So this is going to be a bit of a grab bag of things.
But to start off, I first want to talk about
sort of some of the motivating design considerations
behind Inductor IR.
Also, a disclaimer, I did not write Inductor IR,
and I am still learning how to figure things out,
as most of us are over in PyTorch Core.
So another thing is that there are some things
that are not ideal about the current state of Inductor IR,
and probably could use some refactoring.
So I may make some errors in this podcast.
It’s also rapidly changing.
This podcast may become out of date.
I recommend sort of like, you know,
continuing to join the conversation on PyTorch GitHub
if you are interested in contributing.
Okay.
So let’s talk about some of the design considerations
behind Inductor IR.
So first off, we might ask the question,
why do we need an intermediate representation
between ATEN operators and the actual Triton code gen, right?
Why can’t we just go ahead and directly generate Triton code
from each ATEN operation?
So there are a few reasons why you don’t want to do this, right?
So one is that, you know,
we don’t actually want to generate a Triton kernel
per ATEN operation.
We want to do fusion, right?
So we need some way of saying,
hey, we have a bunch of ATEN operations
which can be fused together,
say a sequence of point-wise operations,
and we need to be able to represent
the result of doing this fusion.
And of course, ATEN graphs are pretty simple
in that they only call a sequence of operations
and are done,
which means that they don’t really have any concept of,
well, this is a fused node
that contains a bunch of operations
that are grouped together.
Now, if you’re familiar with old-school TORTScript,
we sort of just did the obvious thing in that case, right?
So a fusion group is simply a big operator
that inside itself contains another graph
which contains a bunch of little operators
that are expected to be fused together.
So you can certainly design your compiler this way,
but we did it a little differently
with Inductor IR for some other reasons.
What exactly did we do in Inductor IR?
There are a few ways to understand this.
I’m going to go through a few different ways
of thinking about it.
So one way to go about looking up at this
is to just look at the code and see,
well, what exactly are the classes
and data structures that we define for IR?
So if you look in Inductor IR.py,
you’ll see that there is a class called IRNode
and there are a number of subclasses of it.
And these subclasses have names such as loops,
base view, layout, buffer, mutable box.
And you might be wondering,
well, what exactly is the relationship
for all of these things?
Actually, these are all different types of things.
IRNode is sort of just this grab bag class
that puts everything together.
In fact, each of these is its own distinct concept.
And so it’s best to not imagine
that all IRNodes are interchangeable.
It’s not an abstract data type
with these various things
that you can have as various options.
What’s another way to understand how IRNode works?
Well, another way is when we write lowerings.
So a lowering in Inductor is the code
that takes a particular AT&OP
and then produces a bunch of IR nodes,
a bunch of Inductor IR representing that operation.
What exactly does that look like?
And so a lowering very closely follows the format
that you expect for the function signature in question.
So for example, if I’m lowering an addition
between two operations,
I’m going to get something that’s like a tensor,
but in Inductor IR,
and something that’s like a tensor in Inductor IR.
So those are my two arguments to my addition.
Now, what exactly do I get?
Well, I mentioned this thing called mutable box.
What I actually get is this thing called tensor box,
which represents a tensor in Inductor IR.
Now, what exactly is inside a tensor box?
Why do I have a box on its name?
Well, let’s think about it.
So when I do operations in PyTorch,
I can have mutating operations, right?
So for example, if I have an AT&T graph
and I say add underscore,
I’m going to mutate a tensor in place.
Now, this makes sense and is fine
if I’m actually trying to do the operation for real
on actual data,
but what if I’m trying to actually go ahead
and do some sort of lowering?
What if I’m trying to generate some Inductor IR
for this situation?
Well, I can’t just do the mutation,
but if anyone else references my tensor
at some later point in time in some other lowering,
what I need to have happen
is I needed to reference the result
after having done the mutation,
not the result before having done it.
So the IR is immutable in some sense, right?
When I do a in-place addition,
I want to create a node that represents,
hey, I did this mutation.
But now when I do operations afterwards,
I want everyone to make reference
to the thing afterwards.
So TensorFlow box basically says,
okay, I’m going to contain some IR inside,
which represents the whatever it is that I want to,
whatever it is produces the output
for the tensor in question.
But whenever I do a mutation on it,
I will mutably swap out the IR that it is pointing to
to whatever the new IR is that represents
the result having after done the mutation in question.
Now, that’s not to say that inductor IR doesn’t have mutation.
It does.
But, you know, we are sort of,
we’re not processing the IR in a traditional compiler sense
where, you know, you just have some sort of graph representation
and we’re writing things into the graph.
Instead, we’re sort of maintaining a big pile of tensors
which represent, you know, pointers into various parts of the IR.
In fact, the IR isn’t even ordered at this point.
We just have a bunch of IR fragments floating around
that have some dependencies between each other
because we’re going to actually figure out
what order we actually want to run them in
later when we do scheduling.
Okay, so a tensor box contains a, you know, something.
Actually, the tensor box doesn’t actually contain
the thing that you would mutate
if you’re doing a data mutation
because the tensor box contains a pointer to a storage box
and the storage box represents the actual backing data score.
And this is useful because we can have multiple tensors
referencing the same storage.
And so, you know, we may have multiple tensor boxes
referencing the same storage box.
And the storage box is what actually references a buffer
that actually, you know, represents the data
that is living in the tensor at that point in time.
So one thing to notice here is that, you know,
inductor IR actually faithfully models PyTorch semantics, right?
If you like think you understand how PyTorch eager mode works
and then you go look at how inductor IR works
and more importantly, how inductor IR evolves
while you’re doing the lowering,
it really matches what you’d expect to see
if you were just running a traditional PyTorch program.
So you have eager mutation,
you can have views.
In fact, not only can you have views,
you can have arbitrary indexing, you know, arithmetic,
depending on what the view in question is.
This is something that Jason Ansell was very important to him
when he was designing Inductor
because a lot of compilers, you know,
don’t like, don’t buy into PyTorch’s idea of views and strides.
And as a result, they have to do a lot of work
to, you know, sort of deal with strange patterns that show up
when people write PyTorch programs in practice.
So Inductor is all about, you know,
being able to compile PyTorch as it is
and it builds in all of these concepts
that are very important to PyTorch
and so we’re willing to deal with them as well.
One consequence, for example,
of, you know, being able to do strided indexing
is we have this entire mechanism
for making complicated indexers
where, you know, I’m doing a kernel,
I’m accessing data on a tensor,
but the data may not be contiguous.
There might be some strange stride pattern
that I need to do.
Inductor can generate arbitrary indexing expressions
to fetch out the correct data in this case
and, you know, we need to be able to do simplifications
and things on this indexing
and this is one of the reasons
why we use SymPy in Inductor, for example.
Okay, so Inductor faithfully models PyTorch semantics.
We have this tensor storage distinction between the box.
Eventually, you get to a buffer
which actually represents the data in question.
You can views on the buffers,
all that sort of stuff.
You can have mutation on the IR,
but when you do mutation,
all we’re doing is we’re swapping out the contents
of a storage box with a new buffer
that represents what happened after the mutation.
And finally, we get to the buffer itself,
which, you know, somehow represents
how we computed the data in question.
And the most interesting buffer
that, you know, you will usually see
when you’re looking at Inductor IR
is the so-called computed buffer,
which says, hey, we did some sort of computation
like a point-wise operation or reduction
that actually produced the data in question.
And inside these computed buffers,
you actually finally have the IR nodes
like point-wise and reduction
that represent the actual, you know,
computation that we’re doing in PyTorch.
Now, there’s something kind of interesting here,
which is that these nodes,
you know, you would expect them naively
to contain FX graphs representing the various operations
that are being fused together
in a point-wise operation or reduction.
But we don’t actually define them this way in Inductor.
They’re instead defined by this thing called define by run.
If you are a PL nerd,
this is actually another way of referring to
what we call higher order abstract syntax.
The main idea behind define by run is that instead of maintaining an explicit graph representation,
we instead maintain a graph as a function.
So it’s very high order.
You have a function,
which takes in a bunch of arguments representing all the arguments
that the actual, you know, IR graph would have represented.
And then on the inside calls all the operations
that represent the define by run operation in question.
So this can be conveniently done in Inductor
because our loops, our loop bodies are control flow free.
We don’t have any sort of control flow.
So we can just use a regular Python interpreter to step through them.
And the big consequence of doing it this way
is that you get to write really compact definitions.
For example, let’s say that you have two point-wise bodies
and you want to compose them together into a single point-wise body.
In a normal graph transformation,
you have to take the two graphs,
you know, sort of muck around with the inputs,
rename nodes so that you manage to get,
you know, the outputs lined up with the inputs,
so forth, and, you know,
do a lot of administrative work to get things together.
In a define by run IR,
you just have two functions, right?
One function takes in some inputs,
produces some outputs,
and the other function takes in some inputs
and produces some more outputs.
So what do you do?
You define a new function that calls the first one
and then calls the second one.
No problem.
So this lets you write really, really slick,
really, really short lowering code.
It’s actually really nice.
And of course, you don’t give up the ability
to access the structured graph representation
because all you need to do is run
this function where you’ve overridden the behavior
of the operations to basically mean,
please write this out into an FX graph.
So you have a way to reify
the higher order abstract syntax into your graph.
This is all done via this thing called virtualized,
which basically takes all of the inductor core IR operations.
These are things like add, sub, whatever that you have inside of loop bodies
and allows you to change what exactly they do depending on the situation.
So one common thing to do is I want this operation to write into an FX graph.
But we also do other things by changing the abstract interpretation of these operations.
For example, you can do read-write analysis to figure this out.
Okay.
So we’ve talked about what the actual inside of compute buffer
and the point-wise and reduction nodes look like.
And so essentially, you end up with this big pile of buffers and unfused computation.
And there’s a few things going on in this situation.
So one is that we have this notion of a buffer that has been realized
versus a buffer that is just computation.
So when you have a buffer that is realized, we are going to forbid fusing into it.
We basically said, we guarantee you that this data is going to exist in physical form
at this point in the IR.
And this is important because if you are, for example, calling an extern kernel,
which is expecting to see a tensor, or if you’re going to use this buffer a lot of times,
you really don’t want to be recomputing its quantity over and over again.
And so unfuse compute, which, you know, hasn’t been realized,
is allowed to sort of go ahead and fuse or maybe even run multiple times
if, you know, that’s profitable for the situation.
And of course, the scheduler, which runs after we’ve lowered all of our A10 operations to Inductor IR,
is that what’s actually responsible for deciding what order to run things?
You know, how exactly should we fuse things together?
What’s the most profitable fusion to do at any point in time?
Okay.
So I’ve talked a little bit about the high-level structure of an Inductor IR,
how it models PyTorch faithfully, going from a tensor box to a storage box to a buffer,
and then finally to the define-by-run IR that represents an operation in question.
That’s most of the high-level information you need to know about how to work with Inductor IR.
The most common things you have to do is you want to write a new lowering,
or perhaps you want to write a new IR node.
So let’s talk a little bit about some of the more practical nuts and bolts of working with Inductor IR.
So one thing that I found pretty confusing about Inductor IR when I first read it is there are a lot of IR nodes.
So I talked about the most basic ones, and Horace likes to tell me,
well, you know, Inductor IR isn’t that complicated.
There’s only, you know, point-wise introduction that really matter.
But actually, if you look at IR, there’s lots and lots of other nodes doing all sorts of other things.
There’s nodes for collectives.
There’s a node for convolutions.
And this is where my warning that, hey, you know, we don’t, it’s not entirely clean, right?
Like, we probably don’t need this many IR nodes.
The reason why people write IR nodes is because, let’s say you have an ATIN operation,
and you need to generate some code for it,
and none of the pre-existing IR nodes does exactly what it is you need for the code gen in this case.
The easy thing to do is to just write a new IR node, you know, write the lowering to that IR node,
and then write out all of the code gen that you need to do for that particular IR node.
And so this actually is often the path of least resistance, so people added a bunch of IR nodes.
But in actuality, it’s often the case that you can reuse some pre-existing IR node.
For example, we have an IR node that represents calling into an external kernel.
It’s called extern kernel, predictably speaking.
And this IR node does a lot, right?
Like, when you pass in inputs to an extern kernel,
we have to do a lot of stuff to make sure they all exist as actual tensors,
and there’s also a lot of logic for actually generating the code gen in the situation.
So in a lot of cases, it would have been better if we had generalized extern kernel
to work with more things and reused it for a bunch of IR nodes.
But we haven’t.
This is a good refactor if, you know, you’re interested in this sort of thing.
When you’re working with an IR node, there are a bunch of things you can customize.
And this is also one of the reasons people define an IR node.
For example, the scheduler needs to decide what order to run.
So you need to report what the read-write dependencies are.
This is something that you can customize on an IR node level basis when you’re writing a new IR node.
Similarly, we have a concept of side effects, right?
If an IR node has a side effect, we’re not allowed to dead code eliminate it.
So that’s also something you can change when you’re working with an IR node.
One other thing that’s really useful when you’re working with IR nodes is we do keep track of origins for them.
So the original ATEN graph has a bunch of ATEN FX nodes, and we keep track of which ATEN FX node produced a particular IR node.
Now, this is not a single one-to-one mapping because when we fuse things together, lots of ATEN operations might go into the same IR node.
Conversely, a single ATEN operation might get desugared into multiple IR nodes if it’s doing, say, a point-wise operation and then a reduction.
But this is really useful, and it’s how we generate meaningful kernel names, for example, if you have that enabled in Trident.
So, yeah, that’s a whirlwind tour to Inductor IR.
As I said, it’s highly in flux, and I don’t claim to be the world expert on Inductor IR, but hopefully that gives you an idea for how to look around this pretty important Inductor IR data structure.
Thanks for listening. See you next time.Inductor—IR
EP72 Unsigned-integers
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about unsigned
integer support that we recently added to PyTorch. PyTorch has supported unsigned integers but
only for 8-bit integers. So you can do UNT8 or also known as byte, but you don’t get any of the
other unsigned integer types, UNT16, UNT32, UNT64. The reason for this is mostly historical. The
torch, the TH library that PyTorch was originally built off of, didn’t have support for these dtypes.
And so we never really added them. Most people could deal with having only the signed integer
variants. That being said, it was kind of a pain not having them for several reasons. One is that
sometimes you want the little bit of extra range that you get from an unsigned integer, say a 16-bit
unsigned integer, that you’re losing half of the range with a signed integer if you’re only doing
it for indexing. And also unsigned integers are great for doing bit manipulation because most of
the bitwise operations are well-defined on them, as opposed to signed integers where if you overflow,
that’s undefined behavior. Who knows what’s going to happen? So I finally got fed up with this and on my
plane ride back from holidays, I decided to go ahead and implement it. So we now have unsigned integer
support in PyTorch. It’s a bit restricted. So one of the problems, and probably the reason why TH didn’t
have unsigned integer support to begin with, is you have to pay a cost whenever you add a new dtype to
PyTorch, right? For every kernel that you want to support your particular dtype, you actually need to
generate code for it. And so when you add a new dtype to PyTorch, that actually ends up being a lot of
extra binary size for all the new kernels you have to add. And, you know, we are already a very, very
large binary if you’ve ever had to download PyTorch. And, you know, adding some more binary size for some,
you know, dtypes that people mostly don’t use in deep learning just didn’t seem like a good trade-off
for us. It gets especially worse when you consider combinatorial explosion of operations. For example, let’s say that I want to do a
operation between a signed integer and an unsigned integer. Well, if I want to avoid doing a conversion, I have to actually generate a fused
kernel for this case to do all the operations together. And sometimes you can’t even do the operation conveniently without a fusion. Like if you want to do a
comparison, if you want to do a quality test between a signed int64 and an unsigned int64, well, how are you even going to do that? You can’t do the
conversion because the conversion will overflow. Well, if you’re okay with overflow semantics, then I suppose that’s fine. That’s another question that I’ll get you
really shortly. So with unsigned integer support in PyTorch, I made a compromise. And the compromise was this, we’ll add a few new
kernels, you know, a few few extra kernels isn’t going to break the camel’s back. The main problem is, you know, when you take the entirety of
PyTorch’s operator space and multiply it by another, you know, three dtypes. So we’re going to take only the most important operations, you know,
constructions, you know, construction, filling it with some constant, equality, but you know, not addition, not multiplication, not those types of things. And those are the only things we’re going to
implement. And so essentially, it’s enough to like get you a little bit of interoperability with, you know, say numpy, which also supports unsigned
integers, but not that much to do anything that useful. And then what we’re going to do is we’re going to do a twofold strategy. So one is
that if you a user come to us, and you’re like, hey, I’ve got this use case, and I would really like support for unsigned integers, then you know,
then you know, like, well, okay, fine. If you ask us for it, then you know, one more, it’s kind of like, we’ll add things if
they actually are useful and used by someone. And we’re not going to have them if they’re just sort of randomly like, oh, you
know, we have integer matrix multiply. So I guess we have to add, you know, 16 bit unsigned integer matrix multiply. No, I
probably don’t actually want to spend, you know, binary size on that. So if you’ve got a good use case for it, then just send in
the bug report. And chances are, it’s pretty easy, you just have to modify one of the macros that is going
ahead and, you know, iterating through all the d types and stamping out versions of the code for each
of them, they just go ahead and add the unsigned types to that. And you know, you’ll get a chrome for
that. So I think I expect to like, accept a trickle of operations like this, slowly through the future.
The other strategy we have, and this is not entirely implemented yet, some of it is implemented,
but not all of it, is we’re going to use PyTorch 2 to implement all the operations. Because hey, you know,
what is PyTorch 2? Well, PyTorch 2 lets us do code gen on the fly for integer types. So it doesn’t matter
that you don’t have out of the box, you know, an equality test between n64 and you n64, you can just
generate it on the fly. If you torch compile your operation in question. So this is sort of leaning
into this idea that in general in PyTorch, you know, PyTorch 2 is this cool thing. It’s a compiler.
Oh, normally we tell people to use it on their end models, but there’s also bits of, you know, the
regular PyTorch library that we could implement with PyTorch 2. And a d type like, you know, the unsigned
integers from 16 to 64 is a good example where, you know, if we don’t want to actually add all
of the kernel support in eager mode, we can still, you know, get it cheaply by using torch.compile.
Okay, so please send us any contributions you might like. You know, this is the sort of thing where I’ve
gone ahead and put in the basic infrastructure. So basic testing things work, but you know, everything
else doesn’t. So if you’re willing to roll up your sleeves and make some changes to PyTorch, at the same
time you’re trying to apply unsigned integers for some sort of use case of yours, I think this is a great
way to, you know, do a contribution. In fact, Thomas V. Mann messaged me on Slack
and he was like, “Hey, you know, I’ve got a fix. Do you want to, you know, do you want me to set it in?”
I was like, “Yes, please. Absolutely.”
Okay, so I want to talk a little bit about a few things to know about the unsigned integer
implementation because I thought it was going to be trivial, right? We already support integers, we support
sign integers, and we support u and 8. So surely it’s just doing the same old thing. Well, not quite.
So here are the main things that are problematic. So one, we need to decide what our semantics are
regarding signed unsigned overflow situations. So for example, if I have negative one, and I compare this
against, you know, hexadecimal 0xFFFFFF, well, you know, on a bit level, this is the same thing.
But if you like ask Python, Python’s like, “Well, no, these are not the same number. One of them is negative,
and one of them is a very large number.” So we need to decide whether or not we’re following C semantics
or Python semantics. Actually, before this podcast, I should have checked what numpy semantics here were,
but I didn’t. So we’ll need to check what numpy semantics are, we’ll need to check what the existing
semantics for uintate and intate are, and then make a call about what exactly we want to do.
In particular, I have, we actually have a class in C++ called C10 Scalar, which represents essentially
any sort of scalar type that Python is able to represent. And I had a problem while I was implementing
this. I was like, “Okay, well, I can store signed integers in this, and I can store unsigned integers
in this, and I can also store booleans and floats and whatever.” And if someone asks me, “Hey, you know,
what’s the equality between these two ints? What should I do?” And not, not an easy answer to this
question. In the end, I believe I was like, “Okay, well, this semantically is representing a Python
big int, which can be arbitrary precision. So no, these should not be the same thing.” But I’m not,
I’m not convinced that the actual kernel should necessarily operate the same way. Of course,
in torch compile, you know, it doesn’t really matter. You can get whatever semantics you want,
we just need a way of actually spelling it out. And usually there is a way of spelling it out.
Some other things. So we have a promotion problem regarding our compatibility with numpy.
So let’s suppose that you have a uint8 tensor. So this existed in PyTorch before the new support
we added, and you do a sum on it. What type does it promote into? Well, you know, the dumb answer
is a uint8, which is not correct. It’s not what we do. And it’s also probably not what you want,
because if you’re, you know, if you’re using these integer tensors, you usually want them to denote
actual integers. So you probably don’t want them to overflow when you run out. But if you have a big
pile of, you know, uint8s, you’re definitely going to overflow your 8-bit integer.
So we actually promote this to int64. Now, why int64 and not uint64? Well, you know,
remember, we didn’t have support for uint64. So, you know, producing a uint64 tensor is not possible.
So we just gave it the next best type. And, you know, it’s not like you’re going to really miss
that, you know, last half of the range, you know, 2 to the 63. However, this is not
compatible with numpy’s behavior. When you do a numpy sum on a numpy uint8 nd array,
it’ll give you a uint64. So we’re inconsistent. And if you do ever want to add the sum operations
to the higher size ones, uint16, uint32, we have a choice to make, right?
We can be consistent with how we currently do it with uint8 and produce an int64 tensor.
Or we can be inconsistent, but match numpy semantics and have it be a uint64 tensor.
This is especially poignant for the uint64 tensor, which we probably definitely want to be inconsistent
with uint8. Because, you know, it would be extremely strange if you summed over a uint64 tensor
and you got a int64 tensor. You just lost the entire…
You just… There’s absolutely no reason to do it this way.
But this is something we have to figure out.
Another thing that’s a bit of a pain with PyTorch today is our handling for the very top range of uint64.
We have lots of places in the PyTorch codebase where we’ve hard-coded int64.
For example, when you do a rand int call, the rand int call takes an integer min and an integer max.
And those are represented in C++ as int64.
Well, you’re going to have a hard time actually representing a rand int call on a uint64 dtype that covers the entire range of uint64.
Because you just can’t.
You don’t have enough space in your int64.
So we need to do something about that as well.
I think probably the right call is to add a new overload to rand int that takes in a scaler.
Because scalers…
Scalers are this union type.
So I’ve got a tag and I can say, “Oh, this is big.
Need to store it in a uint64 instead of an int64.”
But it’s something that we’ll have to do.
The random number generation doesn’t really work with uint64.
Buyer beware.
Probably your best bet is to generate two uint32s and then, you know, use some bit totaling to cat them together.
Finally, one last thing I want to say is that we’ve added all of the uint, you know, large uint types.
But we’re also considering adding some small sub-bite size unsigned integers.
So that’s uint1 through uint7.
So these are kind of strange because they’re not byte size.
And in fact, we can’t implement them in C++ in the traditional way.
But remember, we’ve got this awesome compiler that we can use to do things.
So our plan on record with the sub-bite size unsigned integer types is that we are going to implement them via Python.
So the idea is that, hey, you can’t actually directly do a uint1 operation typically, but you can reinterpret it as a uint8 tensor and then do whatever byte operations, bit-level operations you need to do to, you know, simulate the operation in question.
And sometimes this is not very convenient to do, like, you know, if you want to do addition, the carries are probably kind of a pain.
But, you know, you can do it, right?
Like, especially because you probably don’t actually have int1 hardware.
So you’re going to have to simulate it by doing a bunch of larger size operations anyway.
On CUDA, probably the performance won’t even be that bad, assuming you are bandwidth-bound rather than compute-bound.
And, you know, there are going to be a bunch of steps to getting all of this working.
But we definitely don’t expect any good C++ eager mode support.
So it’s going to all be via PyTorch 2.
We do have some people working on this because, you know, sub-byte quantization is very popular.
So the quantization team is working on this.
That’s everything I wanted to say about unsigned integers.
See you next time.Unsigned-integers
EP73 Inductor—Define-by-run-IR
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk a little more in detail about the defined by run portion of Inductor’s IR.
What is the defined by run portion of Inductor’s IR? Well in our previous podcast episode about Inductor’s IR, we talked about the various IR nodes that are explicitly represented as an intermediate representation between A10 operations and when we actually do Triton or C++ code generation.
Well the defined by run portion of the IR is the specific part of the IR which is responsible for representing the element-wise operations that you might be interested in doing when you’re representing some operation.
So the canonical example of when we use defined by run IR in Inductor is for example when we’re representing a point-wise operation.
So to represent a point-wise operation we get a regular point-wise IR node that represents the entirety of the point-wise operation but then there is an inner function which represents the actual compute that is going to happen inside the point-wise operation.
So when you are thinking about where are the data structures for inductors IR, you’ll see all the top-level ones have classes that are subclasses of IR node, but then all the little actual compute, all of that stuff is going to be done via this defined by run IR.
So how can I go about and read about what exactly this defined by run IR is?
Well, if you were asking me a month ago, I’d say, well, you kind of have to figure it out by reading the code.
Fortunately, I recently added a pull request to Inductor to basically document the entirety of the what I call ops handler inside Inductor.
Because the way that the defined by run IR works is that we’re constructing functions that are calling other functions, in this case, operations in the ops namespace, which we have the ability to override the meaning of so that we can do different things depending on what we need to do.
So just to break it down, in inside Inductor, there’s a module called virtualized.
What this module does is it defines some dynamically scoped variables, which represent various things you might be interested in querying when you’re performing operations in Inductor.
And the one we’re particularly interested in is a global variable, well, not really global, it’s thread local, variable called ops, which represents all of the potential operations you can do inside the defined by run Inductor IR.
So ops has a method named add, it has a method named store, load, etc.
When we are defining, for example, a pointwise operation, and we want to define the inner function for that pointwise operation, what the inner function is going to look like, it’s going to say, well, I’m a function.
And once you pass me in some indexes, usually these functions are taken indexes as arguments saying, you know, hey, this is where you should actually get information from, what is typically going to happen is you’re going to go ahead and, you know, do a load to read out the information in question.
And then actually, you know, do what, with the result call, you know, addition or multiplication or whatever, you know, actual operation that you want to do.
And this all gets packaged up into the inner function, which gets associated with, for example, a pointwise operation.
So when I do something like this, I have the ability to basically change the meaning of what calls to ops means, depending on what I need to do.
So the, like, very most simple example of what you might want to do is you might want to turn one of these inner functions into a string representing, well, what is the actual computation that you want to do?
When you print out a pointwise IR node, and you get out, you know, hey, the inner fun is this thing, we’re actually calling this function inner function string, which is doing this operation.
So what exactly does it do?
Well, it says, okay, let me go ahead and override the ops handler, the meaning of the ops object in virtualized.
So that points to a kernel printing handler, which basically says, okay, well, you know, whenever you call me, what I’m going to do is I’m going to turn your call into a string representing whatever it is that you call me with, and then, you know, return those strings.
And so when you’re done, you basically get a, you know, string representation of all the operations that happened in that case.
And so everything that you want to do, code generation, semantic analysis, they all operate by overriding the meaning of ops in the virtualized namespace, and then going ahead and running the inner function directly.
You can even verify this inner function into a plain FXIR.
I mean, it’s in fact, in fact, very simple.
What you do is you just say, okay, what I’m going to do is instead of passing in regular index variables, I’m just going to pass in FX proxies.
And when I, you know, do, uh, calls on those FX proxies, I’m going to instead record, uh, you know, what calls actually happened into an FX graph.
So, you know, very simple, uh, you know, very easy to write code very simply, um, writing things in this way is also very convenient in Python because Python supports a lot of metaprogramming.
So if you’re running one of these operator handlers and you’re just like, well, you know, for most things, I have a very generic formula that works for any, because there, there are tons of these operations, right?
Like every primitive math operation, actually the way to think about it is for every like torch operator, which we support point-wise compute on, you know, and that includes things like negate and sign.
Each of these has a ops definition.
Now, sometimes when we’re actually doing code gen, we can desugar these into more primitive operations and we often do, but just for ease of sort of wiring everything up, basically everything that is supported in the torch front end gets an ops operator inside of this define by run IR.
So there’s a lot of these that you have to handle, and people often, you know, don’t need to handle them all individually.
They can just write a generic get attribute that takes in some list of positional arguments, takes in some list of keyword arguments, and then does the operation on all of these things.
So that’s pretty nice.
So what exactly should I expect to see when I am looking at the supported operations in, you know, ops inside this define by run IR?
So as I’ve mentioned, there’s all of the regular, you know, arithmetic computation that you might be interested in.
Those are quite uniform, so I’m not going to talk about them too much.
There’s also operations for reading from memory and storing from memory.
So store and load.
There’s also operations for interacting with randomness.
Randomness is directly encoded in the define by run IR, you know, because they require special code generation, typically.
And there’s also some kind of really unusual things that are also supported in this IR.
So, for example, one of the very important things we need to do when we are generating code is we need to compute indexing expressions that say exactly where in memory we want to read from, right?
And so the normal situation when you’re doing indexing is you get a bunch of SimPy expressions representing, you know, some sort of indexing compute.
These are represented as SimPy expressions because we want to be able to simplify these expressions to, you know, basically get rid of because in general it’s, you know, going to be very complicated.
You know, you need to multiply every index dimension by the stride and, you know, do all of that.
But sometimes it can be simplified quite a bit and then, you know, maybe you only need a single index variable at the end.
So you typically have these SimPy expressions floating around.
But of course, sometimes, you know, we want to do operations which, for example, depend on, you know, a indirect.
You want to do some indirect indexing where you have some computation that you did based on tensor data and then you want to actually do that to do an indexing expression.
So there’s a indirect indexing function which essentially takes a regular, you know, regular value that you computed, you know, regular tensor compute value and then turns it into a SimPy expression so that you can use it in subsequent indexing operations.
So this one’s very unusual and typically needs special handling because most of the operations inside ops return, I want to say just some tensor value.
It’s actually not well defined what the ops handler returns because whenever you’re doing different analyses, we will override the return value to mean different things, right?
If I’m formatting my inner function to be a string, these functions are going to take in strings representing the various, you know, inputs and then return a string saying, hey, this is what the output, you know, string format is going to be.
And if I’m doing some sort of code generation, that’s typically what I’m typically passing around is not a string, but this thing called a CSE value, which is like a string, but also we’re doing some common sub expression elimination while we’re at it.
But indirect indexing is different, it takes in one of these, you know, unspecified values and produces a SimPy expression.
Now, unlike all of the like regular tensor compute point-wise operations, we don’t actually override the meaning of SimPy expressions.
So SimPy expressions are always done via SimPy.
They’re always represented explicitly as the SimPy, you know, abstract syntax tree.
So, you know, you actually do need to provide a SimPy expression, even if it’s just like a bogus one when you’re implementing something like indirect indexing.
Some of the other unusual operations we support.
So, for example, the defined by run IR is also higher order in some cases.
For example, the masked operator handles a situation where you are performing, you know, some sort of set of operations, like say some loads in stores, but they may not always be valid.
For example, you are doing an indirect load, and sometimes the index is invalid.
And in fact, what is happening is that you had some condition which said, hey, should I do the load or not?
I can’t unconditionally do the load because if I unconditionally do the load, I’ll have an illegal memory access.
So I need to mask out the load only on the, you know, parallel compute where the index is valid should I actually do the load.
So the masked operator lets us do this by simply saying, okay, well, give me a mask saying whether or not the index is valid or not.
And then give me some, you know, function like an inner function inside of my defined by run IR, which actually has the stores and loads that I want to have run in a masked fashion.
And I actually, I checked the implementation while I was preparing this podcast, and all we do is we just override the meaning of store and load before we go ahead and execute the body of the masked load.
So, you know, not only is defined by run IR, you know, like when you are at the top level and you’re trying to decide what to do, you override the meaning of operations.
But also we can recursively override the meaning within these local scopes to make them do different things.
So, you know, the last set of operations that you’ll get are some weird, you know, sort of collective style aggregation things.
Like if you’re doing reductions or scans, we also have operations representing those in the IR because, well, you need a little more juice to actually represent that.
We do have dedicated top level IR nodes representing reductions and things like that.
And these special operations are typically not valid unless they’re run in a context like that.
But, you know, they’re also something interesting to know about.
Peter Bell is the expert on scan, having been the one who implemented it in the first place.
So we have talked about the defined by run IR in more detail.
We’ve talked about the operators inside it and the general way you work with this, namely by overriding virtualized.
That’s everything I wanted to talk about today.
Talk to you next time.Inductor—Define-by-run-IR
EP74 PT2-extension-points
Hello, everyone, and welcome to the PyTorch Developer Podcast.
Today, I want to talk about extension points to PyTorch 2.
A lot of the work we’re doing in PyTorch 2 involves adding new features,
which sometimes have implications all over the stack.
PyTorch 2’s stack has a lot of different layers,
and so sometimes planning out a change like this can be quite daunting
because it’s like, well, to add this feature, I need to understand Dynamo,
and I need to understand AOT autograd, and I need to understand Inductor.
Most people who work on PyTorch 2 full-time only really work on one layer of the stack at a time,
so asking someone to know about all of these things just so that they get a new feature,
that’s a bit of a lift.
Fortunately, there are a number of pre-existing extension points in PyTorch 2
which you can use to implement functionality that otherwise doesn’t exist right now.
And further than more than that, even we have some things that conceptually make sense
but are just not implemented yet, but they could be implemented if someone wanted to go out and do them.
So in today’s podcast, I want to walk us through some of the extension points in the PyTorch 2 stack
and tell you about how these work and how come they’re consistent with the overall architecture of PyTorch 2.
because one of the main themes about these extension points is the easy-to-implement extensions
involve only a change to one part of our stack without changing any of the global invariants throughout our stack.
And we have some limited cases where we have a way to customize the behavior of something all the way through,
but, you know, that tends to be a lot more work because you have to tell every subsystem how to deal with something in that question.
So to get started, let’s first quickly look at the topmost layer of the stack, namely Dynamo.
Dynamo is all about understanding any given piece of Python code.
what exactly is it doing, capturing it into a form that is an fx graph that is well-behaved enough
that we can run AOT Autograd on it to trace out an actual set of functions in the end.
So if we are thinking about what exactly, you know, we can do in the Dynamo frontend that isn’t too difficult to do,
one of the most easy and easy-to-understand extensions is just adding support for other function calls.
So, you know, what is Dynamo’s job in life?
Dynamo’s job is to look at bytecode, figure out what it’s doing,
and then put an appropriate function call into the graph so that AOT Autograd handles it.
So you can change whether or not something is put into the graph simply by marking something as allowed in graph.
Now, there are restrictions.
When you mark something as allowed in graph, the function you place in the graph has to, quote-unquote,
work with AOT Autograd because what you are saying is that this function is well-behaved enough
so that AOT Autograd can trace through it.
And so what that means is that, you know, it has to support a fake implementation
where you can run it with fake tensors without actually having to have real data.
It needs to not have side effects as long, or it can only have side effects in limited situations
where it is only allowed to, you know, mutate tensors.
And if it mutates a tensor, it needs to be able to tell AOT Autograd that it’s doing it in this way.
Additionally, the function needs to only support, operate on basic types that are supported by FX.
Normally, these are the set of types that are supported by TorchScript.
So that’s tensor, list of tensors, int, you know, basic primitive types.
If you’ve got a custom data type and you want that custom data type to be preserved inside of the FX graph
that Dynamo is producing, that is much more of a lift.
But just putting another function and asking it to be directly traced through,
that’s something that you can do quite easily inside Dynamo itself.
A step up from just putting in a, you know, function for regular tracing is the so-called higher order operators mechanism.
We call them higher order operators because typically the reason they exist is because they are operators
that take in not just regular arguments, but also arbitrary callables,
which themselves tend to contain more graph operations.
So typically, a higher order operation with one of these callables will call that callable maybe never or once or twice or whatever.
So a canonical example of a higher order operator is the cond operator,
which takes in two callables for the, you know, true side and the false side.
And, you know, at runtime only executes one of them.
These callable, these higher order operators can be pretty restrictive.
They are typically not allowed to have side effects.
They are typically, the bodies of these functions are typically not allowed to interact with the Python state in any non-trivial way.
And when you implement a new higher order operator, you know, one of the things is that most of our basic infrastructure doesn’t work on them.
So you have to say exactly, for example, how you want all of the AOT autograd passes to work on them.
But this is also a sort of well-known extension point.
And when people want to add, you know, new operations that are a bit more complicated,
usually you use the higher order operator mechanism.
So you can extend Dynamo by modifying what it is willing to output to give AOT autograd.
You can also extend Dynamo by making changes to how Dynamo processes Python code that is operating over.
For example, when you have some code in Dynamo that is calling some API, let’s say I have a NumPy call,
I can have Dynamo transparently translate this API call into an equivalent Torch function call.
And this is the mechanism by which we implemented our NumPy interoperability layer.
So if you have some code that does some operations on NumPy end arrays, we actually support transparently compiling this into PyTorch operations.
And you can often take a standard NumPy program and automatically get it running on CUDA without any modifications.
This is a very, you know, local change because all that’s going on is Dynamo is producing a new set of Torch operations
where previously it would have just graph-braked on the NumPy operations.
So this change only requires you to know about how to deal with Dynamo.
Similarly, if you have some sort of custom user library code or, you know, another C extension that you need to interoperate with,
one of the things that, you know, you could do is you could add support for it in Dynamo simply by teaching Dynamo what the semantics of these operations are.
We actually don’t have a public API extension mechanism for doing this right now because we just haven’t implemented it yet.
But, you know, in principle, Dynamo is unable to handle anything that goes into C extensions or it often also can’t handle Python code that is too complicated that uses too many features.
But you can always teach Dynamo internally to have a special case for this sort of situation and handle it in some direct way.
We actually have had some discussions about what a good API for this might look like.
One really promising idea is the concept of polyfills.
A polyfill from JavaScript is when you have a implementation of some feature that normally is natively provided by your runtime in plain Python,
in this case, JavaScript, in the case of the web.
So a polyfill would make sense in Dynamo because if you’ve got some code which doesn’t work with Dynamo because it’s implemented in C,
if you write an equivalent implementation of it in Python, then Dynamo can just transparently trace into the Python implementation
and understand what your program is doing.
So this is a really promising way for letting people who own C libraries and want to interoperate with Dynamo to let things work.
And finally, one really interesting possibility that Michael So has been investigating is the possibility for allowing Dynamo to trace non-standard tensor types into the graph entirely.
And so this is a good segue into the AOT Autograd segment of this podcast episode because to do this, Dynamo is actually the easy part, right?
So to handle an arbitrary class, in our particular case, we wanted to reuse the mechanism from TorchScript called TorchBind,
which lets you take arbitrary C++ classes and make them available in TorchScript programs.
And all you need to do in Dynamo is just say, okay, well, if I see some operations on one of these TorchScript classes, one of these TorchBind classes,
all I need to do is just go ahead and write these operations in the graph.
So this is actually the easy part as far as Dynamo is concerned.
You just need a way of, once again, dry running these operations without having real data.
The real problem is once you have these operations in the graph, what exactly is AOT Autograd going to do with them?
So what is exactly AOT Autograd going to do with things?
So remember, AOT Autograd is the part of our stack which is responsible for taking the output Python graph that was produced by Dynamo
and then actually using all of the semantics, all the layers of PyTorch, including Autograd, including functionalization,
all of these things to trace out a low-level Aten representation, which is suitable for handing to the backend compiler.
So this is the part that actually knows all the smarts about how all of the various subsystems in traditional Eager PyTorch work.
And this is, for example, the place where when you add a new higher-order op,
you now have to specify how this higher-order op should interact with each of the various things,
like tracing or functionalization or fake tensors, because that’s what AOT Autograd is going to use.
So if we talk about something like Torchbind, then, you know, if you do add the support for Torchbind,
which this one’s not complete, you have these weird objects which aren’t actual tensor operations.
And so if you wanted AOT Autograd to work with them, you’d also have to teach AOT Autograd
how to either partition them away, which is a very valid thing to do, right?
Like before you go from Dynamo to AOT Autograd, you could partition your graph up into multiple pieces
and only feed in AOT Autograd the pieces that AOT Autograd actually understands.
In fact, this is what we do for DDP Optimize.
DDP Optimize is an option you can use when you are running PyTorch 2 with distributed data parallel.
And what it does is it manually chunks up our graph so that you get pipelining with DDP
where, you know, every chunk starts sending the gradients to the nodes before you finish running everything else.
So you’re not waiting for all the communications at the very end.
And that’s done by splitting up the graph before we pass it to AOT Autograd.
So you can conceivably get rid of things AOT Autograd can understand by partitioning them into their own subgraphs before AOT Autograd handles them.
You can also make AOT Autograd handle things directly.
And with higher order ops, you can just specify how exactly, you know, the various layers should happen.
Or, for example, Brian Hirsch recently added support for tensor subclasses.
So, in fact, tensor subclasses are a really nice extension point in PyTorch 2.
And the reason they’re so nice is because, you know, well, tensor subclasses act like normal tensors.
So they typically don’t need that many changes on the Dynamo site.
That’s not entirely true.
For example, we use detensor with tensor subclasses.
And that has some extra API on top.
And sometimes Dynamo needs to be taught how to understand that API and transfer it into the graph.
And once you get to AOT Autograd, the real question is basically how to go ahead and de-sugar this tensor subclass into a simplified program that doesn’t have any tensor operations in it.
So tensor subclasses maybe require some Dynamo work, have some support for it in AOT Autograd, but it evaporates by the time you get to the backend compiler.
So you don’t actually need to, you know, work on Inductor if you do something like this.
And, of course, AOT Autograd has a bunch of other knobs which you can use.
For example, we have decompositions, which are the entire way we, you know, break down operations into simpler forms for the compiler.
And you can do pre-autograd decompositions.
You can also do post-autograd decompositions.
These are all valid things to do, and you can customize them.
You can obviously implement custom operators, which are just, you know, just like regular operators that PyTorch has natively.
But, you know, if you go ahead and use this API and implement what all the various operations on them should be,
you can actually just preserve them all the way to Inductor.
And Inductor will just call you when you actually want to run the operation.
And finally, once you get to Inductor, there’s a few more things you can do.
So, for example, at an Inductor level, you can introduce the concept of a new IR node,
which lets you control how exactly code generation works when you go ahead and generate the Python code or the C++ code that’s going to represent the operation.
This is, you know, usually you don’t need to because just being able to call some external function is usually good enough,
and we have built-in support for that, but, you know, it’s something you can do and people have added a lot of IR nodes to Inductor for better or for worse.
There’s also the ability to take a custom Trident code and send it all the way to Inductor.
This is some work by Ogas.
It’s pretty nice because it’s often people are writing these Trident kernels for, you know, the very most important pieces of their model,
and it’s nice to have that interoperate with PyTurge 2.
And Inductor also has some facilities for doing code generation.
So, for example, let’s say that you are doing matrix multiplies.
We have the ability to generate epilogues and fuse them in.
And so this capacity basically says, hey, Inductor knows how to generate simple code for point-wise operations.
So if you’ve got some complicated CUDA kernel and you want to, but you have a spot where you just want to paste in some arbitrary extra code that the user provided,
that’s something you want to do.
We also have some examples of people wanting to go ahead and add first-class concepts to the Inductor IR.
For example, when we were working on nested tensor, this is something that, you know,
you do need to generate different kernels that are pretty different from normal point-wise kernels when you want to do this generation.
This is probably the hardest thing to do because, obviously, to get this concept all the way down to Inductor,
you had to have, you know, made Dynamo and AOT Autograd play ball.
So definitely a choice of last resort.
So we’ve talked about a bunch of extension points which the PyTorch 2 stack provide.
Some of them have public APIs for and you can use them directly.
Some of them are just, you know, ideas that are, you know, architecturally consistent with how PyTorch 2 works,
but just haven’t been implemented yet.
So someone has to, you know, you know, roll their sleeves up and handle things.
So that’s it for our whirlwind tour of all the things you can extend PyTorch 2 with.
Hopefully in some later podcast episodes, we can dig into some of these things in more detail.
Thanks for listening.PT2-extension-points
EP75 Compiled-autograd
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about Compiled Autograd,
a feature implemented by Jason Ansell
that allows us to compile the entirety of a backward pass
in a PyTorch program in the same way
PyTorch 2 normally is able to compile pieces of your forward.
To understand why Compiled Autograd is its own thing
and isn’t related to the regular strategy
for handling forwards and backwards in PyTorch 2,
we first have to briefly go over
how does automatic differentiation work in PyTorch 2 normally?
If you’ve listened to the AOT Autograd podcast,
you may know that the way PyTorch 2 works
is that after we capture the forward steps only of a graph,
we pass this along to AOT Autograd,
which is responsible for tracing out a joint forward-backwards graph
that gets partitioned into a forward and a backwards
that then get compiled separately
and assembled together into a custom Autograd function,
which is what’s actually responsible for hooking up to the Autograd engine.
Now, it’s worth remembering that a custom Autograd function
has a property, which is that when you run it,
you run the forwards,
and then later when you actually call backward,
you are still running the normal Autograd engine
that PyTorch has,
and the normal Autograd engine is eventually going to call a big chunky compiled backward,
once again, from your compiled Autograd function to actually run your thing instead.
So we’re just using the good old-fashioned custom Autograd function functionality
that Autograd already has,
but in this case, we’re generating compiled regions
for the forward and backward in this case.
So this is how PyTorch 2 normally works.
This is a really nice model for PyTorch 2
because it means we can transparently work with graph breaks.
Remember, PyTorch 2 is all about
if you can’t compile the entirety of your model,
that’s fine.
You can just compile the parts of it that Dynamo understands
and fall back to eager mode.
And so if you have a compiled Autograd function,
this just smoothly works in this situation.
But sometimes this is not what you want.
For example, let’s suppose what if you wanted to fuse
the forward and backwards of your program all in one go?
So you wanted a single compiled region that handled everything in this case.
Well, you don’t want to produce a custom Autograd function
because a custom Autograd function requires you to always have that gap
where after you’re done with the forwards,
you have to go back to regular PyTorch
and call the backward function
to actually kick off the Autograd engine
that’ll eventually call your backwards.
So to actually do it all in one go,
we need something else.
We need something that actually can understand the backwards pass
in the same level of fidelity
as we are able to understand forward passes.
So before we try to tackle this more complicated goal,
compiled Autograd tackles a simpler version of this,
which is can we just go ahead and compile the dot backward
call that you have in your PyTorch program all in one go?
And this is a little different than imagine that you have just a single compiled region
and you immediately call backward after it.
And the reason it’s different is because when you call backwards
on a compiled Autograd function in the traditional PyTorch 2 sense,
there are still things that we’re not able to fuse into the graph.
For example, when we are done calculating the gradients of all of our parameters
that were used by the forwards pass of our function,
we still have to go ahead and put these into the grad fields
of all of the parameters lying around.
And if those grad fields actually already have tensors,
we may actually have to do additions to sort of combine the gradients
because this is something that PyTorch supports.
You can run backward multiple times with the same parameters
and we will just accumulate the gradients into them.
So this addition necessarily must happen,
or this gradient accumulation necessarily must happen outside of the graph
because there may be outside of the compiled graph,
because there may be other uses of the parameter
that are not part of the compiled region,
and we’re not supposed to actually go ahead and do the accumulation
until we’ve gathered up everything and want to actually put it in.
Because, you know, like if I’ve just finished a compiled Autograd function,
I have a gradient for this parameter,
but I don’t know if that’s the entire gradient for that function
because if I’ve used this parameter somewhere else in a non-compiled region,
that also cancels the usage that will also have a gradient.
And really only the Autograd engine knows in this case.
So how are we actually going to go ahead and do this?
Well, another problem that is actually instructive to think about
on how we’re going to do this is we also have another feature in PyTorch,
which is backward hooks.
Backward hooks show up in a number of situations.
For example, if you write a custom Autograd function, you know, yourself.
So instead of, you know, us using custom Autograd to, you know,
feed in some compiled backwards, you know, you can write one of these yourself
in case you’re writing some function that isn’t normally differentiable by Autograd
and you want to manually specify what the backwards is.
Well, these backward functions can have arbitrary Python code in them.
And so while we support natively tracing custom Autograd functions that are not too complicated,
if they’re really complicated, then we need something else to handle this case.
Similarly, we support just directly specifying backward hooks on variables,
which are just arbitrary functions, which we will call when the gradient for some function is computed.
And when this occurs, we will just call into your function and you can do whatever you want.
And things like DDP and FSTP often are implemented because they need to do special behavior in the backwards
by implementing some sort of complicated backwards function that often is actually, you know,
interacting with Python state, updating things, you know, triggering collectives, fairly complicated stuff.
And most our standard playbook for handling backwards functions in PyTorch 2 doesn’t work
because when we are doing the backwards handling, we’re out of Dynamo, we’re an AOT Autograd.
AOT Autograd is all about doing a make effects trace of the graph and trace question.
Make effects traces cannot in general deal with arbitrary Python, right?
That’s what Dynamo’s job is for.
So you have to only have traceable code.
So if you have traceable hooks, traceable backwards, then you can, you know,
sort of do everything in the traditional framework without dealing with trouble.
But if you have like Python code that is like updating a global variable somewhere,
you need Dynamo to do that.
You cannot do that with make effects.
But that gives us, you know, an idea for how we might want to go about handling this, right?
If I want to somehow compile Autograd,
then what I need to do is I need to somehow get my Autograd step,
my backwards pass in a form so that I can just run Dino on it, right?
Like if I could somehow replace the backward call with a big pile of Python functions
called one after another, calling into various backward functions,
then this would give me all the juice I need to, for example, handle some arbitrary Python code,
or for example, to handle a fusion with AccumulateGrad because they’d all be there.
They’d all be the stuff we need.
But how are we going to do this?
Well, let’s think about how Autograd works in PyTorch normally.
In PyTorch eager mode to do Autograd, we construct an Autograd graph with every operation we do.
And this Autograd graph represents the backwards computation that I need to do.
So when I call backward, I traverse this graph that I’ve created while I was doing my forward operations,
and that specifies the sequence of operations I need to do in order to run backwards.
So in principle, this seems like something that I should be able to turn into some sort of Python code
so that Dynamo can do it, right?
I just need to take this graph, somehow do something to it so that I get some Python code.
And then once I have that Python code, I can just go ahead and Dynamo it recursively in the normal way.
And this is basically what Compile Autograd does.
So if you are sort of nodding off at this point, and this is really the main idea, right?
The main idea is take the Autograd graph, turn it into a Python code, and then Dynamo through it.
And all of the things you expect work.
In fact, graph breaks work, right?
Because if you have some regular Python code, then if I have a graph break in the middle of it,
I just say, okay, well, I need to just call into this thing that, you know, is doing something complicated.
And then I can go on and keep compiling everything else.
The main thing to know about this strategy is that if we take the entire Autograd graph
and turn it into a Python program, which we are then going to trace, then this compiled Autograd region
only works only if you have exactly the same Autograd graph that you had originally, right?
Because if I, the next iteration around, have a different graph, perhaps depending on different
values, then I will end up with a different Python, you know, unrolling of the Autograd graph.
And obviously, you know, in regular PyTorch Eager, if I have two different functions, then
I have to compile them separately.
And so that would be the case here.
So you have to make sure you actually do the same backwards every time.
But in general, this is not a big problem for people who are working with relatively static
computation patterns.
Compile Autograd is a bad idea if you’re actually relying on Autograd to take care of sort of some
sort of dynamic Autograd structure.
So just something to know about if you’re trying to turn on Compile Autograd.
Okay, let’s dig a little bit more into how exactly we do this confusion, because there are
a conversion, because there are some things that are a little tricky about it, and are worth
knowing if you actually need to dig in and work with the code in question.
Intuitively, we have an Autograd graph.
And so what we would like to do is we would like to traverse over the graph, you know, node
by node, and go ahead and convert each of these Autograd function nodes into a corresponding Python
code for the function node.
So it turns out there are a few immediate problems you run into when you’re doing this.
The first problem is that a Autograd function node, namely, you know, one of these things that
says how exactly we’re going to call for backwards, does not correspond to a callable or really
anything that I could put into the corresponding PyTorch graph.
And one of the reasons for this is how we implemented Autograd in eager mode.
We have this thing called derivatives.yaml, where for any given forward function, and these
forward functions are regular old operators, and you can refer to them via functions in the
Torch namespace, or really Torch.ops.810, if you want to be really technical about it.
These derivatives.yaml derivatives are, you know, directly allowed to be specified inside with,
as mathematical formulas, without having to write another operator for it.
So no, you’re not going to find foo underscore backward for arbitrary functions.
Sometimes we have a underscore backward function, because something is really complicated.
But most of the mathematical formulas are just, you know, doing a few operations together.
What this means is that a typical backwards formula is anonymous.
There is no backward op that I can call when I want to actually put it into the graph.
So what do we do to handle this case in compiled Autograd?
Well, simple.
If we can’t directly put the entire Autograd function node in the graph, let’s just go ahead
and trace it.
And tracing is okay, because backward functions implementations in the core library are typically
very regular.
They are traceable.
In the same way, composites are typically traceable.
So we go ahead and we take the Autograd function in question.
We trace it using make effects into some actual sequence of A10 operations.
And that’s what actually gets put into the on compiled Autograd graph.
So one thing to know is that if you’re looking at the output of compiled Autograd, namely, what
exactly is the Python program, the FX graph, a produce that I’m about to process with Dynamo?
You may see a lot of A10 calls in it, but don’t be deceived.
Just because there’s a bunch of A10 calls in it doesn’t mean that you’ve actually run
AOT Autograd.
You, in fact, haven’t.
You are going to then remake effects it later once Dynamo has finished processing everything.
Because remember, Dynamo is going to end up, you know, sometimes going into Python hooks.
Those Python hooks are going to result in Torch function calls.
And those do need to get decomposed.
And the way you decompose them is by calling make effects, aka AOT Autograd.
This is one important thing to know, right?
So we’re doing this tracing step to actually get out a graph representation.
Another thing that is important to know is that when we want to handle dynamic shapes, or we
want to handle sort of variation in the backwards graph, we can’t actually trace out the compute
exactly as is.
In particular, let’s suppose that we have some variable stored inside of the Autograd function
node, which was saved for backwards.
And this is very common, because a lot of backward formulas need to reference the original forward
arguments to actually express the mathematical derivative.
Well, we don’t want to hard code that exact tensor the next time around, because the next
time around, I will get another Autograd graph, it will have exactly the same structure, but
all the same variables are going to be different, because hopefully, you know, your forward pass
actually computed different values the next time you run backwards in this case.
So to actually handle this, we need to actually make sure that our Autograd, our produced Python
fx thing that we’re going to go ahead and dynamo later is parametric over the saved tensors, and
in the case of dynamic shapes, also the saved integers.
So we need some way of actually swapping out the Autograd function with a new one that’s
generalized with, you know, our faked parameterized versions of all these things.
Now, one obvious way to do this is to just go ahead and clone the Autograd function into
a new one with, you know, all the parameters replaced with their parametric generic equivalents.
But this turned out to be like annoying to do for various reasons, for example, because we
don’t actually have a, you know, on polymorphic API for working with these data classes.
And, you know, it’s C++, and everything has different fields, which is kind of a pain.
So the way Jason decided to do it is instead we have a few functions for mutating the Autograd
function node in question.
So the basic model is that you can sort of save a bunch of new values into the record in question.
So you overwrite them with your placeholders, you go ahead and do an operation, and then you
restore it back to the original value.
This is not very thread safe, but because the Autograd engine always takes out a lock when
we’re running an Autograd engine, we don’t really allow multiple concurrent copies of the
Autograd engine to run at the same time.
You can actually observe this mutation.
Okay, so that’s the, you know, how the sausage is made.
And it’s important to emphasize how important compiled Autograd is, right?
So it doesn’t seem like much.
It’s just saying, hey, you can compile the entirety of the backward call in PyTorch 2.
But we actually, we don’t turn this on by default.
It’s a context manager that you have to explicitly opt into.
And, you know, there’s still kind of bugs, because Autograd is complicated.
And, you know, like getting exact parity for this conversion process is not an easy thing
to do.
But we are very committed to making compiled Autograd work, because it is an essential ingredient
for any sort of non-trivial distributed compute that you might want to do on PyTorch 2.
So our long-term plans for dealing with compiled DDP, as well as compiled FSDP, all rely on compiled
Autograd so that we can actually handle their complicated backwards behavior that cannot just
directly be traced with AOT Autograd.
So if you’re looking for someone to go bug about what’s going on with compiled Autograd,
Simon Phan has been working on improving coverage and enabling compiled Autograd on our benchmark
suite by default.
is a very nice update post that he made in earlier February that I’ve linked inside of
my PyTorch 2 state of posts on DevDiscuss, which I recommend checking out.
Another interesting work stream that’s going on at the same time is Jack Cao from Google has
been working on that original goal I told you about, which is, can we go ahead and compile
forward and backward all in one go?
So the PyTorch XLA integration is very interested in this because it’s very expensive in XLA to
do this, you know, swapping out of a graph back to Python and then back in again.
And so it’s a lot more expensive than in, you know, regular PyTorch CUDA.
And so what is he’s doing is he’s saying, okay, well, compile Autograd is this thing that lets
you go ahead and take a single backwards call and turn it into a Python graph that you could
then can go ahead and compile with Dynamo.
Let me take that and embed it within a broader Torch compile call, which goes ahead and, you
know, runs the forward and then go straight into running the backward without doing a graph
break at all.
This is very interesting, there are a lot of technical challenges going on here, but
maybe we will talk about them some other time.
That’s everything I wanted to say about compiled Autograd today.
Talk to you next time.Compiled-autograd
EP76 Tensor-subclasses-and-PT2
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about tensor subclass support in PyTorch 2.
Before we talk about tensor subclass support in PyTorch 2 specifically,
let’s do a brief refresher on what exactly tensor subclasses are.
Tensor subclasses are a way to extend the functionality of the built-in tensor type in PyTorch
entirely in Python without having to, for example, write a complicated C++ extension.
There are a number of ways you can write tensor subclasses,
but the one we are particularly interested in today is the Torch Dispatch mechanism,
which allows you to write a magic method called Torch Dispatch
that handles all calls to the low level of A10 operators that are processed on a tensor
after all of PyTorch’s other subsystems have happened.
So notably, Torch Dispatch happens under autograd.
So this is a pretty nice piece of capability,
and we’ve used it to build a lot of things.
If you’re just thinking about PyTorch 2 itself,
Torch Dispatch is actually the key mechanism which we use to implement proxy tensor tracing.
But when we look more at user space,
people innovating with different and interesting new tensor data types,
we actually have a number of different examples using Torch Dispatch.
For example, D-Tensor is a tensor that has a built-in knowledge
about how its data is distributed over multiple nodes.
That’s something implemented with Torch Dispatch.
We’re prototyping Float 8 support in PyTorch using a tensor subclass because unlike other D-types,
Float 8 is very strange.
You need a scaling factor.
There’s a lot of uncertainty about how exactly to put everything together.
So we’re prototyping Float 8 tensor inside Python.
There’s also nested jagged tensor, which is a non-uniform layout where you have a variable sequence dimension,
which we’ve packed all together instead of padding them all out to get you a regular dense tensor.
And there’s also a fun example in the subclass Zoo of re-implementing complex tensors,
which we actually do have a traditional C++ implementation using a tensor subclass.
It’s actually a little different, though, because the tensor subclass version of complex tensor
keeps the real and the complex numbers in separate tensors instead of interleaving them together,
which is what the C++ version does.
The separate representation is actually better for a number of use cases,
most notably matrix multiply, where there is no built-in complex matrix multiply inside your NVIDIA hardware.
But if you actually do them separately, you can use the regular special matrix multiply instructions that are available.
So subclasses, basically, they’re not really for typical end users,
but if you’re a power user or you’re a developer on the PyTorch team,
it’s a really nice way to go ahead and develop a new feature without having to muck around with C++ in Python.
So how do subclasses work with PyTorch 2?
Well, there are a number of things that we did to make this work,
but let me first describe the overall implementation strategy.
So the most important thing about how we’ve implemented subclasses in PyTorch 2
is fundamentally it is a desugaring, right?
A tensor subclass is interposing at a very interesting point in the stack below Autograd,
but before, you know, the actual tensor compute.
But at the end of the day, a tensor subclass is just calling a bunch of other Torch operations.
So intuitively, it makes sense that if I have a program that’s using tensor subclasses,
in some sense, I could have manually written a big pile of Torch operations under the hood that does an equivalent thing, right?
Instead of doing detensor, I could have manually written out, you know,
the forwards and backwards of my program with manual collective calls put in.
This would be very, very annoying and very, very, you know, like very tedious to do by hand.
And so that’s the value out of doing it at the subclass.
But if we have a compiler, we can just ask the compiler to do it for us,
desugar it into these basic parts.
And then hooray, you know, now we’ve got a simple representation of good old fashioned A10 operators
that we can just pass to Inductor to compile without any special support.
So in fact, tensor subclass support in PyTorch 2 was implemented primarily in AOT Autograd.
And this is all thanks to the works of Brian Hirsch, who is our, you know, AOT Autograd maintainer right now.
So let’s talk a little bit more about how exactly this works, right?
So one of the things that we need to be able to do is if we have a tensor subclass input to our program,
we need some way to flatten it into its constituent, you know, regular tensors,
which, you know, then you can do your regular operations on.
So when you write a tensor subclass that you want PyTorch 2 to support for,
you need to write a torch flatten, sorry, a tensor flatten magic method,
which says how to go about doing this flattening process.
Similarly, we need a way to unflatten tensors, right?
If we are done computing, and there are some subclass tensors that are flowing out of our graph,
we need to be able to reassemble them into good old fashioned subclass tensors,
because that is what the surrounding ear code is going to expect to see.
And so this flattened, unflattened process typically produces some extra metadata,
which is whatever extra stuff that a tensor subclass needs to sort of know what it’s doing.
Like a D tensor is going to, you know, have metadata talking about, you know, if it’s sharded or replicated,
whereas simple, you know, subclasses like complex tensors don’t need any extra metadata, right?
They’re uniquely determined by their real and imaginary components.
One other thing is because tensor subclasses are implemented in AOT Autograd,
our plan is going to be, we’re going to trace through them, right?
We’re going to trace through the subclass.
The subclass internally is going to pull up a bunch of torch functions,
and those torch functions eventually get traced using proxy tensors.
So this means that your tensor subclass has to be traceable in some sense.
It has to be okay to like just run it with, you know, fake tensors,
look at what operations happen,
and then have those be exactly the operations that show up in the very end.
So, you know, no dynamic control flow or anything like that.
This is different from torch function,
which is the other way you could have implemented tensor subclasses,
where torch function is a much more superficial thing
that just lets you modify, you know,
what happens right after you call a, you know,
torch function on some sort of subclass.
That can be handled directly in Dynamo.
Torch to spec cannot be conveniently handled in Dynamo.
So it is handled in AOT Autograd, you know,
after we’ve done all of the Autograd-y stuff,
which is also what AOT Autograd is doing.
So that’s basically all you really need to know about, you know,
tensor subclasses in PyTorch 2, right?
Like it is a desugaring process, AOT Autograd does it.
It is kind of a complicated implementation,
and there are still lots of bugs that, you know,
we’ve been working out, especially on the D tensor front.
But hey, it works.
And, you know, it is a really useful thing to be able to do for tensor subclasses,
especially because torch dispatch tensor subclasses are not very fast.
In fact, the overhead is pretty bad if you actually care about eager overhead.
So like the glory of PyTorch 2 is you can write your tensor subclasses and then it’ll run actually
really fast because you have the full weight of compilation behind you.
All right.
What are some other things that I want to tell you about?
Well, one thing I want to say is there are a few extra details about tensor subclasses that,
you know, are good to know that they are something that you have to deal with.
One thing is that, you know, if you have a simple tensor subclass like complex tensors,
they’re pretty easy to deal with because they don’t define any extra operations, right?
What you can do with a complex tensor is just all the good old fashioned things that you can do
with variable tensors.
But if you have a detensor, there are some other operations that you might need to be able to do,
right?
Like, you know, moving things to replicated stuff like that.
The preferred way to go about doing this sort of thing is to write custom operators that represent
these things that you want to do.
But sometimes you might need to take in some special data types.
For example, detensor needs to take in a device mesh on some of its operations.
So right now, we don’t really have a good way of adding extra support for like sort of primitive
types that are allowed in your IR.
So detensor has a bunch of hacks in Dynamo to get around this.
But this is something that we do want to improve, especially because, and this was the, you know,
subject of composability this week, especially because we do want to support, you know, exporting
pre-dispatch IR.
Pre-dispatch IR necessarily contains subclasses in it because, you know, pre-dispatch IR is before
autograd.
And if we haven’t processed autograd, we definitely can’t process subclasses, which are after autograd.
So, you know, you need to make sure it’s still pretty normalized.
You have regular operators for all the stuff you want to do on the subclass directly because
in pre-dispatch export, you do need to export it.
Another thing that’s kind of complicated with subclasses is views.
Now, what do I mean by views?
Well, let’s say that I have a dense tensor and then I want to construct a tensor subclass
from that dense tensor.
For example, I have a real and a matched tensor and I want to put them together in a complex
tensor.
Or let’s say that I have a tensor that represents packed data and then I want to wrap it into
a nested jagged tensor.
So ideally, this wrapping operation is a view, right?
Like I share the storage between the original and what’s after.
But now that actually, you know, makes things kind of difficult, right?
Because now if we want to do autograd, we need to know how to see through these views
because, you know, I do want gradients to flow between this transition.
If I do mutation, I need to be able to functionalize in this case.
And, you know, I need to be able to reconstruct these views when I am fake-ifying these tensors
inside PyTorch 2.
So Joel Slosher has been working on basically reconstructing views on subclasses, primarily
motivated by the nested tensor case.
But this infrastructure should be useful in a lot of other situations as well.
One other thing that like has shown up in this case that’s kind of difficult to deal with
is dynamic shapes.
What’s difficult to deal about with dynamic shapes with tensor subclasses is with a regular
tensor, we can just look at its size, and those are all the sizes that potentially can be
dynamic.
With a tensor subclass, there may be inner tensors which are dynamic.
This actually shows up with nested tensors, for example.
So for a nested jagged tensor, you’ve got some size at the top level, which sort of represents,
you know, how many batch elements you have, you know, what the embedding size is, and it
says, well, there’s some unspecified, you know, jagged dimension.
And you can’t really express what the jagged dimension is with just an integer.
But on the inside, inside the actual values tensor that contains the packed sequences, that
does have a length, right?
In fact, typically, that’s going to be some sort of dynamic size, which will vary.
And so that dynamic size is not determined by the shape of the outside tensor.
So, you know, this is the sort of thing that you’ve got to deal with.
There’s also some funny interactions with views, because a view can have an integer in
it, and you need to make a dynamic, oh, you know, lots of stuff going on.
So that’s Joel’s neck of the woods.
One final thing that is pretty interesting is tensor subclasses are more likely to run into
one of our old sort of limitations in AOT Autograd.
And this limitation is that when we do AOT Autograd, we are actually compiling the backward ahead
of time before we even know what exactly the user is going to pass us for backwards.
And this means we have to make assumptions, right?
We don’t know what the tensor the user is going to give us in backwards is going to be.
So we typically assume that it is contiguous.
And if you give me a non-contiguous tensor, I just slap a contiguous call on it to make it
the same.
Well, with tensor subclasses, the situation can be even more complicated.
Like, let’s say that I have a D tensor.
And so, you know, when I produce an output D tensor, I have it at some replication.
But, you know, when my D tensor comes back in gradients, in general, I don’t necessarily
want the same replication pattern in forwards than backwards.
So, you know, there can be a mismatch in this case.
So Brian was working on this bug.
And what he did was there are some more magic methods for basically testing if, you know,
the metadata agrees and coercing to, you know, a standard metadata if they don’t.
So, you know, that’s something you might need in some situations.
A better solution that would solve this once and for all is if we actually lazily compile
backwards, waiting until we actually know what exactly the input tensor is before actually
committing to some particular thing.
This would also improve performance with regular dense tensors because we would no longer have
to call a contiguous call before you get some output.
Okay.
So that’s tensor subclasses in PyTorch 2.
There are no docs about how to do this yet.
So, like, you know, if you really want docs soon, you should go bug Brian about it.
But, you know, I’m pretty excited about tensor subclasses because they are really driving a
lot of the, like, really key new features that are going on in PyTorch these days.
So we don’t write C++ subclasses these days.
Most of the new development is going on in Python subclasses.
Okay.
That’s everything I wanted to talk about today.
See you next time.Tensor-subclasses-and-PT2
EP77 AOTInductor
Hello, everyone, and welcome to the PyTorchDev podcast. Today, I want to talk about AOT Inductor.
I thought a bit about how to structure this podcast. Obviously, I can talk about what AOT
Inductor is and how it’s implemented. But actually, I think it’s important to actually split this
podcast into two parts. And the first is to sort of talk about what are the design goals of AOT
Inductor? Because actually, there are a lot of things AOT Inductor, short for ahead of time
could mean. And it gets a little confusing. And I often see this on GitHub issues and forum posts
about AOT Inductor. People are like, oh, does it do X? And I’m like, well, no, it doesn’t really do
that, because that’s not its design goal. But at the same time, and this is part two, is there is,
you know, some common technical themes behind it. And so while AOT Inductor is specifically targeted
at one particular use case, there are a lot of useful pieces that make up it and could be used
in other contexts, where actually, you know, they do do the things that you might want them to do.
So it’s important to, you know, understand what AOT Inductor is, as it is. So you know,
what do you expect to get if you are looking to use it? And it’s also important to know, hey,
there’s a bunch of stuff here, it doesn’t have to be used in this particular way, it can be used in
other ways. And, you know, I feel a little bad these days, because I saw someone comment one day,
I don’t know if on Twitter or Reddit being like, hey, you know, the PyTorch guys,
they’re all working on internal stuff. And I was like, well, that’s kind of true. Like, you know,
there’s a lot of stuff that we’re working on. And, you know, it can be plausibly used by people in the
community. But, you know, a lot of it is being driven by well, you know, there’s this thing that
we want to use. And so we’re working very hard on it. And so I apologize, because, you know, PyTorch
would not be where it is without all our open source users. And, you know, I do feel very much
embarrassed when you know, we’re not necessarily doing the things that you guys exactly want us to
do. I think on the flip side, though, I do think like, fundamentally, we are working on a lot of
core infrastructure that can be used in a lot of situations. And you know, one of the reasons why
I do this podcast is to help, you know, the engagement of the open source community that
wants to help us work on things in PyTorch. Because if you do decide, hey, you know, this thing makes
sense. And you know, I say something in my podcast, and I’m like, yeah, you know, conceptually,
this makes sense. We’re just sort of not working on it right now. That’s an opportunity. That’s a
situation when you actually can actually come and, you know, contribute something to the project and,
you know, get your use case going in a way because, you know, we are all about open source. We’re all
about building things that can work in a lot of different situations. So let’s talk about AOT
Inductor. So what are the design goals of AOT Inductor? So the main primary design goal of AOT
Inductor is we want to produce some sort of export format for a PyTorch inference
program that can be represented as a self-contained distributable executable, specifically a dynamic
library, which you can load up in some other situation. And this dynamic library has no
dependency on the PyTorch runtime. Now, does this sound like a kind of strange thing to want? Well,
maybe. So, you know, like with the advent of frameworks like GGML, right, people very much like
having a bunch of source code, which, you know, represents the model in question. And then you
can go ahead and hack around it, embed it in whatever situation you want. So the reason for
the binary distribution format is actually because we have this, we have this different production
requirement, which is that we want to deploy models. And we wanted these models to be deployed
in some format, which doesn’t require us to rebuild them. When the service, the service that’s
actually, you know, deploying, serving the model changes over time. And so we need to be able to
allow some sort of model runtime SKU where, you know, we can be upgrading the runtime and old models
still work. And at the same time, we need a thing that, you know, is fast and we don’t have to actually
have some sort of, you know, recompilation process just when you want to load the model. We want to be
ready to go as is when you want to start, because, you know, a lot of these services, you know,
when you start them up, we don’t want to wait a long time to warm up in this situation.
So a self-contained dynamic library is perfect for this sort of situation, as long as you have some
sort of stable ABI on it, right? Because what you do then is you say, okay, I have a minimal stable ABI
that this this library depends on, I’m going to keep that stable across releases on my runtime system
that is actually loading this. And now I can update my runtime. And these, these binaries,
you know, keep only depend on the stable ABI, which keeps working. So I can keep using these binaries
without having to regenerate them. And, you know, assuming that I have some sort of freshness
requirement, I’ll probably eventually regenerate the model at some later time, but I’m not forced to
regenerate the model whenever I want to upgrade my runtime. So and you know, you wouldn’t get this
with a, you know, text format, because the binary format here already has all the Trident kernels
compiled into, you know, CUDA code, sorry, not CUDA code, but you know, your actual, your actual bit
code that can run on your GPUs. So you know, it’s very fast, you load up and it’s ready to go, as is.
And also, when you have some sort of text format, like think TorchScript style, that’s a lot bigger
backwards compatibility surface, because you have to worry about, you know, oh, no, like, what if I change the
serialization format? What if I, you know, add remove operators, you know, you don’t really have
to worry about that. If you have binary, you just have to worry about the ABI it works with. And that’s
it. I did say operators, operators actually are kind of part of the BC service, but we’ll talk a little
bit more about that in a moment. So that’s the like, motivating concern, right, is that we want to be able
deploy these optimized models. And we have this skew problem between the runtime and the actual model
itself. So if we talk about performance on AOT inductor, we do kind of care about performance,
but sort of in the generalized sense, in the sense that, you know, Torch compile cares about performance,
right? Most of the optimization juice, we’re getting an AOT inductor is just from the regular Trident,
you know, code generation and optimization that we do, even if you’re running PyTorch 2 in eager mode.
So CUDA is all about, you know, like when you do these CUDA models, it’s all about the portability
and head of timeless, as opposed to, you know, having some extra optimizations on top. That being
said, we also care about the CPU performance of these AOT inductor lobs. And while cogen does also
matter in this case, we also care a bit about overhead reduction in this regime, because, you know,
CPU models are way more likely to be compute bound. And you, you know, really don’t want to be
spending a lot of time doing useless reference counting, and that sort of thing. CPU models tend to be
very small in our regime. So, you know, the overhead really shows up in this case. So, you know, what,
what, what other, you know, reasonable things might you want to do that AOT inductor doesn’t do? So one
reasonable thing you might want to do is you might want to do training, right? Like training is also a
situation where, you know, being able to compile something ahead of time, and then use it reliably
automatically on all your nodes, without having to recompile every time, you know, that’s a really
good use case. And AOT inductor is not there yet. In principle, it could be used for training. But
it’s just, it’s a bit more difficult to get this set up, because there’s a lot more stuff you need
to, you know, actually get training going, right? You need your data loading, you know, you need to
actually have the loss function. You know, if you’re doing distributed training, you need some sort of,
you know, actual distributed framework. So, you know, actually having an end to end,
like training is actually a little, which you just run. And you know, it does the training ahead of
time. This is like, eventually, we want to get here. But you know, AOT inductor is not there yet. So
it’s definitely way more focused on, on the inductor use case. Similarly, AOT inductor is all about the
export workflow, right? Like you only can do this on models that you can fully trace through 100% full
graph, and then actually export it into some graph that gets used in situation. In principle,
the binary products produced by AOT inductor could be integrated into eager mode. And that would give
you a way of, you know, compiling something ahead of time, but then just going ahead and calling it
from Python. So you just don’t have to like warm up PD two beforehand. And you know, like, yes, this is a
reasonable thing to want to do. And hopefully, I think actually, this year, we do have some plans to
actually spend some effort on making this better. But once again, this is not sort of the like original
use case for AOT inductor. AOT inductor has this thing called, has this thing called C++ wrapper
code gen in Trident, which sorry, in inductor, which basically says, hey, you know, when we generate
inductor code, we generate a bunch of Trident kernels, but we also need a bunch of glue code that just goes
ahead and calls the inductor, the Trident kernels, step by step. And normally, we generate Python code,
because it’s very easy to hack. But you know, CPP wrapper code gen says, okay, we’re actually going
to generate C++ code that, you know, calls into these one by one by one. So you know, this reduces
some overhead. But importantly, for an inductor, this is needed, because we want to like put this all into
an executable with no Python dependency. So this thing, this CPP wrapper code gen, you know, this could
be very much useful in a integration with Python, but you know, like, you can actually ask for it
in pytorch2 eager. And mostly what it does is it reduces your overhead if you’re not using CUDA graphs,
and it makes your compile time take a lot longer, because you know, compiling C++ code is a lot,
you know, slower than just interpreting Python bytecode. One final thing that AOT inductor is not is it’s not a
TorchScript style interpreted front end, right? TorchScript was our first attempt at bringing compilers to
pytorch. And the way to think about TorchScript is you have basically a Python file that is very
regular, it has only limited control flow data structures, calls to pytorch operators, and then
that’s your export product. And then you can load it into some other runtime, which needs basically a full
pytorch implementation, so that it actually can go ahead and run these operators. So that’s not what AOT
inductor is, it is very different from what AOT inductor is. And so, you know, those use cases,
that’s not really what AOT inductor is here for. Okay, so we talked a little bit about the AOT
inductor design goals. So what is actually going into the inside of AOT inductor to actually make it work?
And so like the main thing, right, about AOT inductor is it’s all about this ABI boundary,
right? We don’t want to depend on LibTorch directly, because LibTorch is this big and complicated C++
library that doesn’t really have any ABI compatibility guarantees. Instead, we want to
shrink the service area for what we actually depend on to, you know, a small C ABI only set of
operations that give basic functionality that we need that normally you would want to defer to the
runtime. So examples of operations like this include allocating a tensor, or, you know, freeing a tensor
when it’s no longer needed, that sort of thing. Another thing that’s very important that we need
to support in the ABI is what I call fallback kernel delegation. So what I mean by this is, you know,
for a lot of code that we generate in AOT inductor, in inductor, really, in Trident, involves just
generating some Trident code, which, you know, actually ends up being some stuff that
you can directly run on your GPU. And so this code can actually be distributed directly part of the
part of the dynamic library we’re generating from AOT inductor, and you don’t need to do anything else.
Like that’s it, it’s self contained. But of course, there are a lot of complicated operations like
convolutions and matrix multiplies that we don’t actually have cogen capabilities for. Well, we do have
some cogen capabilities for matrix multiply, but you know, sometimes we just use BLOS because that’s the
best option. So in those situations, we don’t actually package these directly into the dynamic
library itself, because that would be very wasteful and increase the dependency surface in some cases.
Instead, we just have a ABI compatible call, which says, hey, runtime, please go ahead and run this
operation for me. And the runtime does it and then returns control back to the dynamic library, which goes
its merry way with the result in this question. And this applies also to the long tail of PyTorch
operators that we don’t have direct cogen capabilities for, right? There are a lot of these,
because you know, weird stuff like sorting, you know, that sort of thing. That’s not something you
really can do in classic Trident. It’s a little difficult. And so our compiler doesn’t know how to
actually generate these things. There’s actually a two tier separation. There are some operators that
are so important, they have dedicated ABI for them, like a dedicated function that we call into for
them. And then there’s like this big, very polymorphic function that’s just like everything
else. And essentially, all you do is you say, okay, well, I want to call this function with these
arguments, and the arguments are boxed up in some very regular format, so that you know, we can
basically handle any arbitrary argument type. And this is very much similar to like a Tor script style
thing, where, you know, you get all these boxed arguments, and then you just do a polymorphic dispatch
to, you know, that particular operator using the same dispatch mechanism that Tor script does in this
situation. Another thing that I want to mention about the implementation of AOT Inductor is that it
has some implications for how you write code when you want to actually run them on AOT Inductor. So
the most obvious implication is that your code does need to be exportable, right? You need a way to get
your entire model into a single graph, because that’s the graph that we’re going to actually compile
AOT Inductor into. That being said, you know, there’s a lot of attention these days on, you know,
writing custom Trident kernels for, you know, you know, doing all sorts of fancy attention variants
and that sort of thing. And we can actually deal with Trident kernels. So if you write a Trident kernel
as part of your model, in a traditional export, this might be a little difficult to export,
because, you know, what exactly is this Trident kernel? If you want to, you know, send it to
a mobile device that is running, you know, some sort of Qualcomm, you know, hardware thingy,
right? Like a Trident kernel is going to be useless for this case. But because AOT Inductor is specifically
all about, you know, like producing an artifact that can run on CUDA, all we need to do is bundle up
this Trident code with the rest of the Trident code that Inductor is generating. And, you know,
this all can be saved directly into the model. And, you know, you don’t, you don’t have any
runtime component. And so we have support for user defined Trident kernels without actually
wrapping them in custom operator. And these can actually go straight into your AOT Inductor
kernel. And, you know, similarly, if you have a custom op, that’s a little different, we can’t
necessarily embed that directly, that’s just going to go the normal Tor script box style, you know,
call back into the runtime to actually do things. So if you’re writing a new op, or you, you know,
you can’t remove ops from your runtime, because you might have saved models that depend on them.
Okay, so I talked a little bit about what’s inside AOT Inductor, right? So AOT Inductor is all about,
you know, generating all just straight line C++ code against a fixed ABI that, you know, strings all the
Trident kernels together. But that’s basically it. That’s the core of what you do. And like, you know,
you can imagine using this technology in other situations. One of the things that AOT Inductor
expressly does not solve is what I like to call the caching problem. The caching problem is essentially,
hey, I’ve got some, you know, code in eager PyTorch 2. And I think I want to reuse it, I don’t want to
like keep compiling every time, I want to compile it once and then reuse it subsequent times. And so I
want to cache it on a separate run. The caching model for AOT Inductor is an export style caching model,
which says you exported this thing. And there’s the thing you exported is the source of truth. That’s
that’s that’s it, right? So if the thing you exported doesn’t do what you expect, then you know, well,
what did you expect, right? You exported this thing. But when you’re in a, you know, sort of more fluid
environment, like you’re running PyTorch 2 eager, it’s very tempting to be like, well, okay, I want to
cache this thing. And I also want to like change my model code, or whatever. Or, you know, maybe I
actually had a bunch of graphs, I maybe recompiled, you know, the same region multiple times under
different parameters. And, you know, which thing do I want to use, and this is one of the like big
problems we face when we were trying to figure out how to improve the warm start times of Torch
Compile, right? Like warm start is very important, right? If you’re running a big training job, and you
know, you have to restart from a checkpoint, because one of your nodes crashed, right? You really don’t
want to be waiting 20 minutes for Torch Compile to recompile everything. And we just don’t, we don’t have
a good solution for this. We have some patchwork solutions for when we know how to do the cache
accurately, because, you know, we’re inside inductors, say, and we know all of the relevant
input arguments that we get here. But at the very topmost level, at the dynamo level, it’s very hard
to tell, does this code object actually apply the next time around, unless you force the user to make
some assumptions. So A1T Inductor says nothing about this, right? It just makes the simple assumption,
which is you exported this thing, this thing you exported is what you asked for, that’s what you’re
going to get the next time you run it. And you know, we’re working on this. So I hope to do some
more podcasts about this particular problem, because it is a big problem. And we are working like this is
one of our top line things we’re working on coming this year. Okay, that’s everything I want to talk
about in AOT Inductor today. See you all next time.AOTInductor
EP78 Min-cut-partitioner
Hello everyone and welcome to the PyTorchDev podcast. Today I want to talk about the mincut
partitioner. Actually I’m not the best person to talk about the mincut partitioner. I should
get Horace on the podcast at some point to talk about it but I do want to mention something very
specific that Horace told me over the core offsite which I thought was really interesting and it’s
that when we talk about the mincut partitioner naively you think of the mincut partitioner
splitting the graph into a forward part and a backward graph but actually that’s inaccurate.
What you’re actually doing is the forward graph always is the forward graph and the backward graph
what you’re doing is you are carving out the backward graph out of the joint graph which means
that potentially you actually can be putting some bits of the forward graph into the backward graph
and this is the sense in which the mincut partitioner is also capable of doing rematerialization.
All right so that’s what I want to talk about in this podcast but to get there I think I need to
first explain what the mincut partitioner is why it exists and then I can say this again. So that’s
that’s all I want to talk about today. So what exactly is the mincut partitioner? So the mincut
partitioner is an essential component of AOT Autograd and what it is is essentially the way that we compute
what the forward and the backward of a function should be before we wrap it up into a custom
autograd function. So some backstory here. So remember that in PyTorch 2 we’re all about graph
breaks. We’re all about being able to compile parts of your program while having other parts of your
program run in conventional PyTorch eager mode. And one of the things we need to do when we do this is
we need to be able to have the compiled pieces of your program interoperate with the rest of PyTorch’s
regular eager autograd system. So what this means is that typically when I compile something I need to
also manually specify and compile the backwards of it and wrap it all up in a custom autograd function
because that’s the normal way that I introduce new differentiable primitives when I’m working in
PyTorch eager mode. So AOT Autograd is the component that’s responsible for doing all of this
true to its name. The AOT and AOT Autograd is all about you know doing autograd ahead of time. By the
way I do have a podcast about AOT Autograd which is still pretty accurate so you might check that out
if you want more details. But let’s think about how exactly we would go about actually doing this
autograd ahead of time right. So what we’ll have is we have some forward graph which is precisely the
region of code that we’re compiling and we want to differentiate it somehow. So how are we going to go
about differentiating this? So we could imagine first you know taking out the forward graph and then somehow
starting up another trace when we do backwards and then tracing out what the backwards is in the situation
and that gives me the second backwards graph which is what I actually want to go ahead and put into my
program. But we don’t do this in AOT Autograd. We do something a little different. What we do is we do
is we trace something called the joint graph. So the joint graph is a single graph that has all of this
all together in one go. It has the forward computation and then has the backward call which reads out the
backward graph instead. And one very interesting thing about doing it this way is that the joint graph
not only has the inputs that you have available when you are doing forwards, it also has all the tangents
which are flowing in from the Autograd engine from the backwards. Because remember
unlike when you’re differentiating an entire model where you’re returning a single scalar loss and so
you can just assume that the gradient on that loss is one, we are going to be in general outputting a
large number of tensors and we need Autograd engine to tell us you know which directions the gradients on
those output tensors are which will tell us what gradients the inputs should be via our computation
in this case. So we’ve got this weird graph right this joint graph which has both the forward
and the backward inputs and this is not really that useful right because the way I normally want to run
my PyTorch program is I want to go ahead and actually run the forward compute first and only later
when I am when I am when I got my backward tangents available do I want to run the backwards. Of course
I could wait until the very very end you know all the way to when the backwards is run to actually go
ahead and run the forwards but this is actually you know not going to be very useful because I do need to
actually have the forward output so I can run the rest of my forward computation and then if I go and run it all
again you know in the backwards while I’m doubling the work in question. Actually this is not so strange
right there’s something that there’s a technique that people do when they do this which is called
activation checkpointing and this thing where I just decide hey I’m just going to you know recompute
the entirety of my forwards pass when the backwards pass comes along is akin to just you know slapping
an activation checkpoint on your entire model saying hey I don’t want to save anything for backwards I just
want to recompute it all from scratch. If you slap one of these on the entirety of your model this doesn’t
actually help with your peak memory usage because you know why did your peak memory usage go up? Well
think of the typical memory usage of a deep learning model as looking like a mountain right which is
that as you’re executing your forwards pass your memory usage is going up it’s going up because you
are saving activations for the backwards pass then once you run the backwards pass we start the the
memory mountain starts going down because as we do computations that used the saved activations from
the forwards and these are going in reverse order because you know that’s the backwards is run in reverse
order to the forwards I can release those saved activations as I go so now the memory usage goes down
so the peak of the memory usage is at the very end of the forwards pass right as we’re about to do the
backwards pass so if you do something like well I’m just not going to save any activations for backwards when
I run my forwards sure the initial time you run the forwards is not going to have very much memory usage but then when
the backward rolls around and you’re like okay well I need to recompute the forwards to get all of the
things I needed for backwards well your memory usage is going to go up because what are you doing well
you’re computing a big pile of saved activations from the wrong direction right you’re computing them
forward from the front to the end whereas the backwards pass wants to use them to end to the front
so you end up having that same mountain of memory usage again all over so it’s pretty pointless if you do this
over your entire model of course it’s not pointless if you take a subset of your model
and do it only there and that’s the idea behind activation checkpointing anyway so we have this
joint graph right it’s got the forwards inputs it’s got the tangents inputs and it needs to produce the
forward outputs and the final grad inputs associated for us and so the partitioning process basically says
okay given this graph which you know if we just took it at face value we didn’t do anything to it would
represent an activation checkpointing strategy which is probably the wrong thing to do if you’ve got the
entirety of your graph in this case um how do I minimize the uh how how do I how do I strike a balance
between uh v computation uh you know the amount of compute I do and the amount of memory that I need
to save for things in backward uh you know and uh you know how do I want to do this and this is exactly
the job of the partitioner right the partitioner is going to make a decision about what exactly we are
going to save from backwards and it’s going to try to minimize the memory usage um you know that of the
things we need when we do this subject to some other heuristics and that is going to basically uh reduce
the amount of memory we actually need when we actually go ahead and compute uh when we uh compute our
network overall because the less memory I’m saving for backwards the lower that mountain is when I’m
climbing it okay so let’s go back to the thing that uh I wanted to talk about this podcast from the very
beginning right so I used to think of the min cut partitioner as well I’ve got this joint graph I’m
going to split it in two um the first half is the forward graph and the second half is the backward
graph and you know that’s what I end up with but this is not accurate and the way you can realize that
this is not accurate is that if you think about it there’s really you know nothing you want to do
uh to the forward graph um in actuality okay sure there may be some you know a little bit of
compute uh that depends only on forward inputs doesn’t depend on tangents um and that is not
used by the forward compute and you can decide whether or not this compute should happen in the
forwards or backwards this is not too difficult to figure out right you have to do this compute
um at some point either way so you know you you’ll just put it either you know in the forward or
backwards depending on you know what the maximum usage is but you don’t really um you don’t really get to
make any more changes to the forwards graph right you can’t not compute things that are needed for
the forward output because you’re obligated to produce all the forward outputs when you’re all
done and you can’t compute anything that in the backwards depends on the tangent because you don’t
have the tangent when you’re running forwards uh so well you know nothing you can do there but let’s
talk about the backwards for a moment right so the backwards doesn’t have this constraint right
the backwards as i said could in principle decide that it is going to do the entirety of the forward
computation over again or you know more hopefully it doesn’t actually do that but it does some subset
of it but essentially when i’m looking at the backwards i can actually decide to reincorporate
pieces of the forward computation and that is okay there is nothing wrong with that so as i said i you know the
forward graph doesn’t really have very much i can do to it but the backward graph i can use as much or as
little of the forward pass as i want and when i put things from the forward pass into the backward pass and
say hey go ahead and recompute this i am essentially reintroducing re-computation into my program and that can
be useful for example when i’m doing activation checkpointing sometimes activation pointing
checkpointing style things are free one particular case it is free is when you’re able to fuse all of
the re-compute into some computation that you that you’re already going to do and backwards and the
reason for this is typically we are memory bound so extra compute is free so as long as inductor is able
to do the fusion then well you’re not going to pay anything right you did a little bit of compute
but it doesn’t matter because you were paying uh the cost to go ahead and read memory in fact uh you know
uh re-computation can actually make your program faster because if you’re reading less memory then
you know you are reading less memory and if the computer is still free in that case then it doesn’t
matter that you did more compute you reduce the memory and that was the thing you actually needed to
reduce in this case okay so uh that’s um that’s a really interesting insight that uh i got from harris
um as i said i should actually do a proper podcast with harris sometime about the mincut partitioner
probably um we’ll call this podcast selective checkpointing activation checkpointing because
that’s what harris has been working on and it’s some really interesting stuff and i’m really looking
forward to uh you know being able to share it all with you um in the future one more thing that i want
to mention so i talked a lot about how we have this constraint which is that i can’t do backwards
compute ahead of time because i don’t know what the tangents are ahead of time right i only know what
the tangents are when they actually get run in backwards actually we have a long-standing problem
in aot autograd that stems from a very similar problem the problem is this when i trace in aot autograd
i am tracing the forwards and the backwards ahead of time aka i am doing it before i actually know
what my tangents in question are and so the thing is that while i do know some things about the tangents
for example i know they have to have the same sizes as the output of my graph because that’s how automatic
differentiation works but i don’t know for example whether or not the tangents are contiguous or not
and in fact the way aot autograd works today is we just assume that the tangents are contiguous
when we create up the fake tangents uh to go ahead and do our tracing with and sometimes this is not true
sometimes i can get tangents which are not contiguous maybe they are transposed maybe they
are channels last and now i have a graph that is slightly suboptimal for this case what we do when
this happens is we just call dot contiguous on the tangents before feeding them into our compiled graph
but this is a place where we’re leaving performance on the table and we actually have some internal
models where we’ve noticed that this is actually a problem another case where this is a problem is
when tensor subclasses are involved so let’s say that i have a program and it has a tensor subclass
output and i’m i’m going to you know run backwards and i don’t know exactly what the backwards input is
going to be is it going to be a tensor subclass is it going to be a normal tensor tensor subclasses can have
metadata for example if i am running with d tensor aka distributed tensor distributed tensors know what
sharding they have do i expect the gradients to be exactly sharded in the same way that the four
outputs were in fact not necessarily because the communication patterns you need when computing
backwards are actually quite different in fact you can also in the worst case scenario have a plain tensor
output but actually the subclass output is going to be a subclass tensor so we’ve spent a long time
arguing about what exactly we should do in this situation and we actually have a new plan once again
thanks to harris’s insights about how the mincut partitioner works so the key insight is that when we run
the forwards uh we can never ever depend on uh things from the tangents right but that also means
that if i have a subclass or i have a different contiguity in the backwards it is only the parts
of the graph which depend on those tangents that can actually change depending on this so when i’m making
decisions about mincut partitioning um i actually uh you know have a lot of play because i never could
have moved the things that could change um into the forwards pass um i only could have ever moved the
things from the forwards pass which i know exactly whether or not they are going to be um what what
they are because i know what all the forwards inputs are into the backwards pass so in this way what i need
to do um is instead of going ahead and pre-committing to oh this is contiguous oh this is a particular
uh subclass with particular metadata what i need to do is i can go ahead and do the trace i want the
forwards graph to be fully elaborated and i want the backwards graph to be um you know essentially pre-dispatch
i want to avoid making any commitments to contiguity or subclassness so that later when i actually get
the tangents i can actually go ahead and retrace it and lower it to be actually what i see so this is
our current plan on record um i think brian hirsch is planning to work on this and uh you know we’ve
got a design doc that i’ve linked in the pytorch 2 weekly update which you can check out for more
information okay that’s everything i wanted to talk about mincut partitioning today uh see you next timeMin-cut-partitioner
EP79 CUDA-graph-trees
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about CudaGraph
Trees, our CudaGraph integration with PyTorch 2. Most of this was work done by Elias Ellison,
so you know kudos to him for actually building all of this. So first off let’s remind ourselves
what CudaGraph Trees are. I do have a podcast about it so if you want to know more details
about CudaGraphs itself you can go there. But CudaGraphs is essentially a way to remove
overhead from applications that are calling Cuda kernels by saying hey instead of running all of
the possibly very expensive host code that glues a bunch of Cuda kernels together we just smash it
all into a recording that just runs the Cuda kernels one after another exactly the same way
that they were run before. So in PyTorch Eager we have a API for using CudaGraphs called make
graph callables and it basically does exactly what you would expect. It will go ahead and CudaGraph
record your function and you will get exactly what you asked for. And so maybe this is what you want,
maybe it isn’t. It’s actually kind of hard to use CudaGraphs in a lot of situations, right? You have
to make sure that there’s no CPU compute in your program, there’s nothing that varies from run to run,
there’s no unsafe calls to unsafe operators, those will just cause CudaGraph recording to fail
because CudaGraphs will say no, no, no, you can’t read out things from CPU. When you are passing in the
inputs to CudaGraphs, they all actually have to be static addresses because those are being burned into
your CudaGraph. So, you know, if you have an input, you have to make sure you copy that into a fixed
buffer. All this needs to be handled by hand. So you can do it if you’re very motivated and people are
often very motivated and will manually CudaGraph their code. But one of the things that we wanted to do
with PyTorch 2 was to make it easier for people to get this overhead reduction without having to go
through all this rigmarole. And of course, you know, PyTorch 2 actually does help a lot with overhead
reduction intrinsically because we’re in the business of, you know, taking your models, you know, factoring
out all the Python code. So we don’t actually have to run any of your Python code. We only have to run the
residual bytecode afterwards that does exactly the Python state updates we need. And by fusing kernels
together, we reduce fixed costs because, well, you know, the less kernels you’re running, the less
overhead you have to do in this case. But it’s still the case that for a lot of really overhead bound
models with very, very small compute and lots and lots of operations, it turns out CudaGraphs still
gives you a pretty sizable efficiency improvement, even when you’re using the PyTorch 2 compiler.
And this is something that we could reduce, we could reduce the overhead of PyTorch execution,
even in PyTorch 2, like there are things that we could do. But CudaGraphs is just there’s nothing
faster than zero, right? When you run with CudaGraphs, there is no host site overhead by construction,
because you’re going straight to running the Cuda kernels one by one by one. So CudaGraphs is cool.
And I want to turn back the clock back to the eve of the PyTorch 2 release. And we’re having a call.
I actually I remember I was driving home from, you know, doing some maintenance on my Tesla. So I was
on the highway and I was called phoned in to a conference call we were having, which was basically
the question was, what are we going to do about CudaGraphs. And so the problem was CudaGraphs,
we could tell from our benchmarks, made things a lot faster when we were running PyTorch 2. But they
used too much memory. Why did CudaGraphs in PyTorch 2 use too much memory? So the problem was related
to graph breaks. Specifically, let’s imagine that you’ve got your model and there’s some graph breaks.
So, you know, you’ve got graph one, graph two, graph three. Obviously, we can’t CudaGraph the
entire thing because we have no idea what’s going on between the two graph breaks. So instead, we
CudaGraph each graph separately. And we CudaGraph each graph separately. Well, you know, how exactly does
CudaGraphs work? So normally, the way CudaGraphs works is you end up with a isolated CudaGraph block,
which contains enough memory to store all the inputs. Because remember, it’s it’s all static
addresses, right? So the next time you call this CudaGraph, you have to give it tensors in exactly
the same location they were last time. So to make sure you actually have those addresses available,
you have to actually keep that memory around. So for every CudaGraph, you have a big Cuda memory
allocation, which has enough space for all the inputs, modulo parameters, because parameters,
you can just assume have some static location. And they, you know, don’t change. So everyone can
reference those static addresses, it has all of your input space, and also enough space to do all the
intermediate commute you might do because obviously, in the middle of your graph, you may do allocations.
And those allocations also are going to have hard coded addresses, and you need to have them in your
CudaGraph. So when you have three CudaGraphs, what you end up having is 3x the amount of memory you need,
because each CudaGraph has its own pool of memory that’s disjoint from the other ones, being like,
hey, this is the memory that I need to actually do my compute, because I’ve burned in all these static
addresses. And so I need to reserve it for myself when I do it. And this is very memory expensive,
because when I ordinarily run my program in eager mode, I don’t have this hoarding behavior, right?
When I’m doing stuff with the Cuda caching allocator, I ask for some memory, I use it when I’m done,
I return it to the Cuda caching allocator, and it’s allowed to send that memory off to someone else so
that they can use it for something else. But these CudaGraphs can’t actually do that, they have to hold
on to the memory, because the next time you call it, they need to make sure that memory is actually
available for them to actually do things. Okay, so so CudaGraphs was, you know, using up too much
memory, we were like, oh, my God, you know, what are we going to do about this? Like, how are we going
to launch PD2 with a version of CudaGraphs that takes up this much memory? And, you know, we were
thinking of ideas for how to do this. One of the ideas that we had was, hey, you know, when you,
you know, do normal eager mode, we’re willing to reuse memory allocations between CudaGraphs.
So there’s nothing stopping you from reusing the memory allocations between separate CudaGraphs,
right? So I’ll remember all the CudaGraph is doing is saying, hey, you know, there’s a static
address, and this the memory and the static address needs to be available when I use it. So if, for
example, the CudaGraphs get called in exactly the same order, every single time, then what you can just
do is say, okay, well, this memory is no longer being used. You know, I needed it for the first
graph, but I’m no longer using it anymore at this point. Let me go ahead and use it for something else
when I’m running my second CudaGraph. And so I don’t need to actually do the sum of the intermediates
of all three graphs, I can do reuse. So my memory usage looks a lot more just like what the high watermark
memory usage used to be. But there’s a problem with this. And the problem with this is when you
have a graph break in PyTorch 2, you can’t actually guarantee that the same graph will be called next,
right? Because maybe the reason you did a graph break was because the user had a, you know,
dynamic conditional, which is going to shunt you between one graph or another graph. So if you do all
this memory reuse, and then suddenly, some other graph gets called, well, oh, you know,
maybe some memory that you were expecting to be available is no longer available, and you’re in
trouble. But there is a, you know, maybe obvious next step to do in this case, right? Which is,
what if when we, you know, diverge between the two Cuda graphs, we simply, you know,
imagine that, well, there’s two paths we can take. So at the time I do memory allocation,
and I’m done with the first graph, you know, the memory allocator is in some state. And then depending
if I go to graph two, then I will, you know, do some things based on graph two. But if I go to graph
graph three, instead, I’ll do some other things. And sort of imagine, like, you know, in one of those,
like, time travel movies, where you make a decision, and depending on decision, you know, the future
branches off into two possible different futures, we just want to do the same thing for Cuda graphs.
And so this leads to this concept of Cuda graph trees. And this is what we actually implemented
in part of two. And Cuda graph trees sort of completely solve the problem of memory reuse
in Cuda graphs, because we simply say, well, it’s a choose your own adventure. The memory usage you’re
going to end up using for the Cuda graph recording is the maximum of the memory usage for all the
possible branches you would take. But because we are only allowed to evolve the Cuda graph in the sort
of paths on the tree, every path on the tree is going to have a consistent allocation deallocation
pattern. And so as long as I go down that same path, I can just simply reuse exactly the same memory
addresses as before. And if I take a different path, well, that path is on its own execution. And I’m
guaranteed not, you know, once I’ve made a choice, I can’t, you know, change my man and go down another
path of the tree. So each of these paths are self contained. And then eventually, I get to the end of my
training loop iteration, I go back to the beginning, and ostensibly, usually, you know, when you’re done
with a, you know, single training step, you know, all your memories done, and so everything can be assumed
to be cleared, and you can start going reusing things again. So this is a basic concept of Cuda graphs,
right? So Cuda graph trees. So the main idea is we want to reuse memory across graphs. By reusing memories
across graphs, we get rid of the big memory usage used by, you know, Cuda graphs, and sort of the
tech you have to build actually do this is some sort of, you know, ability to checkpoint the state of
the memory allocator. So that if you’re like, hey, you know, I’m running Cuda graphs, and I want to,
you know, record if it goes this way. And I also want to record if it goes some other way, I need a way to
reset the state of my allocator to what it was at that point in time, so that I can go ahead and then do
a bunch of other allocations and deallocations based on what I see in the next case.
Okay, so that’s the basic implementation idea behind Cuda graph trees. There are some operational
implications to how, you know, we’ve implemented Cuda graph trees. So when I, one of the discussions
that, you know, Les Kano opened up on GitHub is, hey, maybe we should turn on Cuda graph by default,
this is mode reduce overhead when you’re running torch compile. And, you know, maybe this is a good
idea. We’re a bit nervous about it. And so the reason we’re nervous about it is that although
Cuda graph trees, you know, is pretty good at what it wants set out to be, which is it’s set out to be
a way of using Cuda graphs, where we can basically let you say, okay, just try reduce overhead. And
PyTorch 2 is going to take care of, you know, dealing with all the safety conditions you need,
right? So we don’t have a problem. If you are, you know, doing CPU compute or unsafe operations,
because, hey, we’re just, you know, we’re PyTorch 2. So we’re actually getting a graph,
and then we can go look at it and say, are you doing any compute on CPU? Are you like calling non zero,
and then we can just disable Cuda graph if those things have happened. And because we’re PyTorch 2,
we also keep track of all the inputs. And so we know, oh, these inputs are parameters,
so we can statically bake them in. These inputs are just regular eager inputs. So we’re going to
allocate dedicated buffers for them in the Cuda graph memory pool, and copy them in. And we do all this
for you, because we have a pretty deep understanding of what is going on in your code, because hey, you
know, having graphs is great. And furthermore, because, you know, Cuda graph trees have this,
you know, sort of choose your own adventure style, you know, property to them, we can even do this in
the presence of graph breaks. So obviously, your code inside the graph breaks is going to run,
you know, slow, but all the stuff inside PyTorch 2 is actually going to run fast. But this this safety,
this abstraction is not complete. So one of the like big things that you have to be aware of,
is that when eventually, when we’re doing Cuda graph trees, we want to sort of stop the tree,
and go back to the beginning of the tree, right? If we always keep, you know, appending more kernels
onto the tree. This is kind of pointless, because if you’re continuously recording new Cuda graphs,
you’re never getting the benefit of replaying the Cuda graph, right? You only get the benefit of
Cuda graphs, when you actually have a pre recorded Cuda graph, and you replay it again. So at some point,
we had to be like, okay, we’re done recording, we’re going to go back to the root of the tree.
And now we can follow a path. And hopefully that path has all Cuda graphs, we’ve already recorded,
so we can go zip zap, very fast. So when we restart the tree, when we go back to the root of the tree,
we now have the, you know, big constraint, which is that we actually need to have, you know, freed all
the memory associated with the Cuda graph memory pool, because we’re going to go stomp over it again,
in an unpredictable way, when we start using the memory again. And so I said, usually user code is
written so that this isn’t a problem. But you can get it wrong, right? If you like hold on to a tensor
that is an output of a Cuda graph tree, then well, that tensor, if it stays live, you know, is going to
inhibit Cuda graph trees from actually, you know, being able to be used as a memory, because we don’t
want to stomp over the data. And then you get a bunch of garbage in one of these tensors that’s hanging
around. Another problem that, you know, is sort of non transparent with Cuda graph trees is what
happens when you have mutations on input tensors. So remember that I said, when you do Cuda graphs on
an input tensor that doesn’t have a static address, we go ahead and copy it into the Cuda graph. So once
you’ve copied it into the Cuda graph, that’s a separate, you know, allocation for the input in
in question. So if you remember, if you’re a programming question, as it goes ahead and tries
to mutate this, it will mutate Cuda graphs internal representation of the memory in question, but it
won’t mutate the actual, you know, original user input, which may have been allocated in eager mode.
So we don’t do an unsafe thing. In this case, we actually just, you know, cancel Cuda graph trees in
this situation. But, you know, if you’re just applying Cuda graph trees to some random code that
you haven’t actually looked at, it’s possible that, you know, it doesn’t actually work because
there are things that like look pretty benign. And, you know, we have like gotten past them with graph
breaks, but then they just inhibit Cuda graphs from working. So you kind of like if you’re like, oh,
I think my model actually should run with Cuda graphs, then you have to actually like look and see if
Cuda graphs is actually running when you turn it on with PyTorch two, because the chances are,
we actually may have turned it off for any number of reasons, some of which are, you know, like just
fundamental framework limitations, but not limitations from you, the user, like it’s probably not too
difficult to adjust your code to handle this case. Finally, um, Cuda graph trees are not free, right?
They do change the cost model of your program. I already mentioned one of the things that changes,
right? When you have a Cuda graph tree, and you have a lot of, you know, branches,
ordinarily, you only use up the, um, you know, memory associated with the, uh, branch when you go
down that branch, but a Cuda graph tree is going to have a standing allocation, which represents the
maximum you memory usage of all the branches you could possibly take. So you better not be, you know,
relying on the fact that, well, sometimes, uh, you know, my memory does go up, but you know,
it doesn’t happen all the time. And therefore, you know, uh, there, there’s something okay in this case,
right? Like you’re, you’re just always going to pay the worst case memory usage in this case.
Also, your memory usage is going to be worse than it would have been in eager mode, because when you
could graph things, those could graph allocations have to go in a separate memory pool than the
eager memory pool. So, um, you know, the, oh, if you’re running everything in eager,
the Cuda caching allocator may be able to like make better use of your memory by like serving
things from a shared pool. But when you separate the pools, your memory usage generally gets worse
because you know, like, uh, you’ve got two pools. So, you know, if something is free in one pool and
you need an allocation in the other pool, that doesn’t work. You, you have to just go ahead and
allocate in it. So there, there can be some memory inflation in this case. And, uh, finally, um, you know,
CUDA graphs, um, in the worst case scenario could make your model run slower. And that’s because for
all the inputs, which don’t have fixed memory addresses, we have to copy them into the CUDA graph
region. So this is copy that you weren’t, you didn’t have to do in normal eager mode. Now in a
situation where you don’t have lots of graph breaks, you have like one graph and your inputs are not too
big. This is usually not a big deal because the savings from CUDA graphs, um, more than out swamp
the fixed one-time cost, but if you have a lot of graph breaks, then this can potentially be a
performance problem. So, um, maybe we can turn on CUDA graphs by default, but it’s, uh, you know,
uh, we’re probably going to work on it some more because definitely CUDA graphs is one of those
things where you need to like turn it on and then see if your model is doing what you expect to do
or not. Um, there are some ideas that we have for improvements to CUDA graphs.
Some of the things limitations that I talked about, such as mutations to input tensor in principle
can be fixed with just more engineering. Another tool that we have in our toolbox is re-recording
CUDA graphs. So we actually have already implemented this for dynamic shapes. The idea behind dynamic
shapes and CUDA graphs is that, um, you know, when I have dynamic shapes, normally this doesn’t work
with CUDA graphs because a CUDA graph has everything burned in, including the sizes of the tensors in
questions. So, you know, you can’t have a single CUDA graph that works for multiple dynamic sizes,
but what you can do in this case is for every dynamic size you see, you could re-record the CUDA graph,
um, you know, using the same dynamic kernels that inductor had generated, but just with a different,
you know, size in question. And this is profitable because it’s a lot cheaper to re-record a CUDA graph
whose cost is on order of how fast it takes to run the model. Then it is to like actually do the
entire PyTorch2 recompilation, uh, again, which, you know, is pretty expensive, um, in part because
compile times are slow, but you know, it’s just going to be a lot more work. You’re actually generating
kernels in that and stuff like that. So this is something that we can do, um, to like work around
problems that CUDA graphs have. And, um, another case that, uh, on a mesh and lathe have been looking
into is, Hey, um, you know, we probably also want to, uh, you know, we do re-recording of CUDA graphs.
If we have a, um, CUDA graph that is referencing a lot of parameters, but actually we have a lot of
different parameters. And so a common situation this occurs is let’s say you have a bunch of transformer
blocks in your model and you’re only compiling the transformer block and you want to CUDA graph
the transformer block. So it would be nice if you could have a single compile product that works for
all of the transformer blocks in your program. But in this case, the parameters for these blocks
are different. And, um, if you know, naively CUDA graph it, then you would have to do a copy in on
the parameters, which is generally a terrible idea, unless it’s a diffusion model, because apparently,
according to, uh, Dima, who I was talking to about this, uh, diffusion models don’t have as much of a
problem with doing this copy in. So to deal with this problem, uh, what you can do is you just
re-record the CUDA graph for each, uh, individual block. So the now compilation cost is you compile
once, uh, with a generic, uh, with a version of your model, uh, that can work for an arbitrary
parameter. And then for every particular transformer block, you re-record the CUDA graph with the new,
uh, you know, static addresses for each parameter. And then, you know, once you’ve, you know, done,
however, you know, dozens of transformer blocks, then they can all be reused. And this doesn’t cause
memory usage because you’re going to reuse the same memory for each of these recordings.
Okay. So I hope that told you a little bit about what to expect with CUDA graph trees. This is what
happens when you do mode, reduce overhead in torch compile. That’s everything I wanted to say today.
Talk to you next time.CUDA-graph-trees
EP80 Inductor—Post-grad-FX-passes
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about the post-grad FX passes in Inductor.
What do I mean by post-grad FX passes?
Well, let’s think about the entirety of the PyTorch 2 compilation stack, right?
So we’ve got our input graph that comes from Dynamo.
That graph has a bunch of torch operations in it.
It’s not normalized at all.
We feed it into AOT Autograd, and a bunch of stuff happens.
But what AOT Autograd is responsible for doing is functionalizing, desugaring, doing all
sorts of transformations so that we eventually end up with a pair of two graphs, one for forward,
one for backward, which are fully normalized.
They are A10.
They are functional, except maybe for some mutations at the end, and those get sent to Inductor.
And so the Inductor is going to go ahead and lower these into Inductor-level IR.
But before it does that, it takes a whack at this graph, this normalized A10 graph, with
a few FX passes that operate at sort of more of a graph-level type optimization attitude.
And these are called the post-grad FX passes.
True to its name, there’s also a pre-grad FX pass.
There’s also a joint graph FX pass.
But we’re not going to talk about those.
We’re just talking about post-grad FX passes.
These passes are, in some sense, the easiest ones to write, because we give you the most
invariants in the IR, because you can assume that we’ve already gone ahead and done all
of the decompositions.
We can assume that the graph is functional.
A graph being functional is really helpful because it means a lot of transformations, like
moving nodes around, replacing aliases with non-aliases, duplicating uses of inputs.
All of this is safe because you don’t have to worry about maintaining some sort of aliasing
analysis to figure out, oh, am I allowed to move this, you know, read to some other spot?
What if someone is mutating it?
So you don’t have to worry about any of that stuff.
Except that we do, for efficiency reasons, permit the IR to do some mutations on inputs.
And the reason we allow the IR to do mutations on inputs is because it can be important to
make sure that we promptly do the mutations, you know, while we’re executing the graph, instead
of waiting until the very end, um, producing a pure, you know, output tensor and then doing
the mutation, which would increase the overall memory usage that you might need in the situation.
So the IR environment, um, and, uh, you know, Brian has a nice document.
He wrote, uh, recently this week, uh, what mutations does AOT and Autograd allow Inductor
to see?
Um, and so the general category of what things we are allowed to do, what things we’re allowed
to put in Inductor are essentially only input mutations.
And these mutations can only happen at the end of the graph.
And they look like a bunch of copy underscores into inputs.
So we never have mutations on intermediates.
We never have, uh, mutations on, uh, in randomly in the middle of the graph, they’re all at the
end.
And so you can sort of look for them when you’re writing a pass and make sure you sort of step
around them when you’re doing things on them.
And so, uh, there’s also one other thing, which is that, um, so with some of the recent
work going on with, uh, uh, dynamo capturing per parameter FSTP, we also are, uh, also allowing
a storage mutation, um, namely, uh, you know, doing some sort of, uh, resize to zero on storages
inside of the graph.
This is mostly so that we can promptly cause a storage to get deallocated because we know
that the memory is definitely not going to be used, but we need to, there’s a bunch of
references to it, live references to it from backwards, uh, say things saved from backwards.
And those will just, we will refill the empty storages later when we’re coming back in backwards.
But for now we want to deallocate it.
Okay.
That’s a very special, special case, but, um, post-grad ethics passes largely a functional
IR.
And, um, you know, actually inductor itself is not functional.
Like inductor, when doing code generation has, uh, the ability to do mutation.
It knows about control dependencies.
It knows, uh, you know, when, uh, it is basically very mutation aware, but when you’re in the
nice, happy FX graph universe, um, this is prior to us sort of going into the mutation world.
So you get to do, you know, sort of nice, easy optimizations.
And then once we’re done with the post-grad FX passes, that’s the point in which we re-inject
all the mutations, go to a lower level IR, and now you, you got a reason about mutation because,
you know, at some point in like, even, even in like a functional compiler, like a functional
compiler, which isn’t like Haskell, at some point you’re going to actually start mutating
things because that’s, what’s going to need to happen, uh, you know, at the hardware level.
So, uh, you know, the whole point of the IR is, you know, once, when you’re higher level
on the FX nodes, that’s when you can do sort of very, you know, loose reasoning, you can,
you know, move things around.
And then as things get successfully more and more refined, it’s harder to do these sort
of higher level abstract optimizations because, you know, you have more constraints, but at
the same time, it’s now possible to express optimizations that weren’t accessible before
because we’re, we’re moving to a lower level things that, you know, previously you couldn’t
really talk about now become expressible.
Okay.
So post-grad FX passes.
So what do I want to talk about in this, uh, podcast?
So one thing is I wanted to tell you that they exist because, um, it’s easy to, you know,
forget that you can go ahead and do an FX pass.
Most of the optimization work in inductor is not really an FX passes.
Like the way to think about the post-grad FX passes are there sort of very domain specific
optimizations for particular situations and the bulk of the smarts, you know, sort of the
bulk of, you know, when you’re doing just average optimization in PyTorch 2 is happening during
the lowering and the scheduling.
Like that’s actually the bread and butter.
This is very different from like a traditional optimizing compiler where like, you know, your
graph optimizations really are the, you know, name of the game.
Like you, you, you’re running very simple optimizations over and over again, you know, until you
quiesce and like by doing lots of small things over and over again, you eventually end up with
something that, you know, is very optimized.
So that’s not really what is going on in the PyTorch 2 compiler.
The graph passes are mostly like, oh, there’s a special thing and it’s very semantic.
It requires some high level understanding and that’s what we’re going to do.
So, okay.
So I want to tell you that post-grad FX passes do exist, even though it’s not the bread and
butter, because it is a really useful thing to be able to do in some situations.
And the other thing I want to talk about today is what exactly are some of the post-grad
FX passes.
So to prepare for this podcast, I just popped open the post-grad Python file in inductor and
just read through all the passes to get an idea for what’s going on.
So let’s see what’s in here.
So one of the first things we do is dead code elimination.
Actually, this one’s not on by default.
It’s controlled by a config flag, DCE in inductor config, and it’s off by default because apparently
there’s some problem with inference mode, mutations, I don’t know.
We do dead code elimination at a lot of points in the stack.
AOT Autograd does some dead code elimination after it’s done functionalization.
And the inductor scheduler also does dead code elimination.
So if something’s dead, something will get rid of it, especially the inductor scheduler.
That DCE pass has given me a lot of trouble in the past.
So it’s not that important for the post-grad FX pass to do the DCE.
Okay, what’s the next thing we do?
The next thing we do is a pass called reorder for locality.
So what’s the idea behind this pass?
The idea behind this pass is let’s say I’ve got a node and it depends on some arguments.
So let’s just look at these arguments.
So like, okay, argument one.
Okay, what’s going on with argument one?
If argument one is solely used by, you know, my node, the node that is using it and any nodes
that are after my node, then I know, well, you know, there’s no point in, you know, trying
to compute this early, right?
Because this node, this node is the first node, which actually makes use of this node.
And so I can actually just sync this node as late as possible.
I want to be lazy.
I want to, you know, only do the nodes producer right before I’m actually going to use it.
And that’s the sense in which this is a reorder for locality thing, right?
I want to, you know, do the thing right before I use it because that makes it easier for me,
for example, to do fusion.
This is not always a profitable optimization, because if you have an operation where, you
know, it depends on some big amount of data, and it’s like, say, a reduction, so then you
have a small amount of data coming out at the other end, then you might not want to push
this as late as possible, because you’d be keeping the input live until all the way in
the end, whereas previously, you may have been able to do the reduction.
And then now you only have to hold on to a scaler in that situation.
So we actually only turn this on at inference.
But, you know, it is a useful thing to do, you know, especially if you don’t really think
the user made good choices about when exactly to go ahead and run things.
The next pass that goes on is actually a custom hook.
So we offer in the inductor config a way for you to plug in your own custom pre and post.
Sorry, we give you a hook into post grad FX, and we actually give you two hooks, a pre hook,
which runs, you know, before all of the pattern matching we do in post grad FX, and a post hook
that runs after all of them has run.
So you can just pass in a callable to the inductor config, and, you know, do whatever,
you know, optimization you want to do, you know, go, go while, just, you know, remember
the iron variance, right, we do allow mutation, but it’s always at the end.
After the custom hook, we have the bulk of the passes that are going on here, these are the
pattern matching passes.
So as I said, you know, most of the graph passes we are doing are all about just, you
know, looking for particular patterns of operations in the graph, and going ahead and substituting
them with some other type of thing.
And so what are some of the patterns that are going on?
So I’m not going to actually talk about all of them.
But I am going to talk about the ones that are specifically named in the source code, and,
you know, some representative examples.
So one of the ones that we’ve got that actually is its own thing, because it’s fairly complicated,
is a pass called group batch fusion passes.
So there are a few ways to explain what this is.
But the probably the simplest is, let’s say you’re doing a matrix multiply, right?
So if you’re doing just one matrix multiply, you just go ahead and call mm, and you’re done.
But what if you’re doing, say, five matrix multiplies, and they’re just, you know, they’re doing all
the same size input, they’re doing different inputs, different weights, different weights,
different weights, but their sizes are all the same, right?
So I could call the matrix multiply kernel five times.
But what I could do instead is I could stack all of the inputs and the weights together into
a single batched tensor, and then use a batched matrix multiply instead.
And this can be, you know, a lot better for efficiency, because you can get, you know, much
better occupancy now that you see all of the work as opposed to only some of it.
So group batch fusion does a bunch of optimizations along these lines.
We also have a library called fbgem, which is used pervasively internally that has all sorts of,
you know, fusion stuff.
And another thing it has is like a gmm op, which actually lets you fuse together matrix multiplies
that don’t necessarily have the same shape.
I don’t really know how this works.
But, you know, this is something you can do.
The batching matrix multiplies together is a very, very common optimization.
Actually, if you, you know, ever have looked at the transformer architecture, you know,
when you do the QKV matrix multiplies, they are typically batched together.
And yes, you know, hypothetically, you could not batch them, but, you know, they do want
to batch together.
Although, you know, you want to use STPA anyway.
So, you know, that’s what you should use in that case.
But we do lots of matrix multiplies and, you know, fusing them together can be a useful
thing to do.
Another one of the passes we have is called remove no ops.
So this one’s pretty easy.
It says, hey, if you are doing an operator, but actually the operator does nothing, get
rid of it.
So, for example, if I have an int64 tensor and then I say convert it into an int64 tensor,
remove no ops says, okay, I can get rid of it.
Now, you might be wondering, well, hang on.
You know, that’s really trivial.
Why didn’t we just get rid of it in decompositions?
And the reason we didn’t get rid of it in decompositions is actually because it’s not, it’s
technically not a no op.
And the sense it’s not a no op is that let’s say that you did one of these, you know, like
conversions on an int64 tensor and it gave you a int64 tensor out.
That output tensor doesn’t alias with the input.
It’s actually a fresh tensor.
We guarantee that it’s that case.
We do have operators like dot2 and dot contiguous, which are, which have the possibility of giving
you back, you know, the exact original tensor.
But these actually are all our composites.
And as composites, they actually have to decompose before we get to, before we get to inductor
into an actual op that, an actual op that does the work that unconditionally does the conversion
or not.
So in some circumstances, you know, you might have ended up, ended up directly calling the
underlying, we always do this.
And so now we need to get rid of it.
And remember, the fx graph in postgrad is purely functional.
It doesn’t have mutation.
And so we can just change the aliasing relationships of things willy nilly without worrying about,
you know, if that is going to change the observable side effects of mutations in the graph.
So that’s pretty cool.
Remove no op ops.
That’s another pass.
We also have a big pile of just graph patterns, which just say, hey, if you see this particular
pattern of nodes in the graph, replace it with, you know, this other pattern.
And this is, I mean, some of these are pretty good.
A lot of these are kind of like benchmark hacking.
It’s like, oh, you know, we were looking at some model.
Why is it slow?
Oh, because there’s this, you know, pattern of code and we can’t really generate good code
for it.
So let’s just rewrite it in something that’s a little better.
One, one that I noticed while I was preparing for this podcast is a like cum sum optimization,
which is like, hey, if you’re allocating a tensor of constants and then you’re doing a cumulative
sum on them, we don’t need to.
We don’t need to actually allocate the tensor and do a cumulative sum on them.
Like I can just constant fold that into a tensor in a range style tensor where I just, you know,
go ahead and directly increment things, you know, stuff like that.
Like, yeah, in some sense, you know, maybe people should have written their models a little more
carefully and then we didn’t have to write these patterns.
But, you know, we put a bunch of patterns in here.
The other really big set of pattern matching passes we have are the split cat patterns.
So what is split cat?
So split cat is a situation that occurs a lot in the recommendation models we care a lot about
inside meta internally, where I’ve got a packed tensor that contains, you know, a lot of, you
know, typically it’s not a, it’s not a dense tensor.
It’s actually some sort of ragged tensor with a ragged dimension where I have a bunch of sparse
features and they’ve all been concatenated together.
And so one of the common things that I need to do is I need to do some processing, maybe
only on a few features, not all of them.
And the most convenient way to write this when you’re writing normal PyTorch code is to go
ahead and split the tensor into a bunch of itty-bitty tensors, do the operations on the, you know,
few tensors that you actually care about, and then cat them all together back into the fused
thing.
So, you know, this can, as you can imagine, is quite inefficient.
And so there’s a bunch of things that, you know, we can do to, like, make this more optimized.
And so this is the, like, the split cat category of optimizations.
So that’s all the pattern matching optimizations I want to talk about.
There’s still a few more optimizations that don’t fall into the category of pattern matching.
There’s an optimization for fusing DDP communication.
So what’s the idea behind that?
Well, the idea is that, you know, when you are doing distributed data parallel, one of the
things you’re doing is, you know, as you are doing your backwards, you want to do an all
reduce on gradients to, like, get them all together into all of the nodes.
And so if you have multiple gradients, then you don’t actually need to do them as separate all
reduces.
You can fuse them together into a single all reduce and do it all at once.
So concatenate and then all reduce.
We typically write this pattern.
When you’re writing eager mode PyTorch, we typically, you know, manually make sure we do the
fusion ourselves because there’s no compiler involved.
But you can write your, you know, distributed code a lot more simply if you’re just like,
well, you know, just do the all reduce whenever it’s necessary.
And in compiler, we trust to go ahead and fuse the communications together.
So this is the path that goes ahead and does that.
And we have some other ones like moving constructors to CUDA.
So the idea behind this is if you allocate a tensor, you know, a fresh new tensor, like a constant
or whatever, and then you go ahead and move it to CUDA, that’s pointless.
Just go ahead and allocate it on CUDA directly in this case.
So that’s a bunch of passes.
You know, that’s actually all of them.
Like, you know, every single pass.
Like if you look at this file as of April 2024.
That being said, there are a few special passes, which are especially interesting.
And they happen at the very, very end.
And they have to happen at the very end because they do the thing that I was talking about,
which is reintroduce mutation back into the graph.
So all the other passes I talked about don’t, you know, do mutation.
They may need to be a little careful around the copy underscore nodes at the end of the graph.
But in general, they are, you know, ridden with the idea that you’re dealing with a purely functional graph.
And so the special passes, there are two of them.
So one of them is to re-enplace in place of lobs.
So if I’ve got a tensor and I compute the add of it, like I compute add plus two on it and I have a new tensor, and now my old tensor is never used again, I can just turn that into an add underscore.
Similarly, we have a pass which decomposes so-called auto-functionalized operators.
What is an auto-functionalized operator?
So let’s say that you have a custom operator and your custom operator only has a mutable version.
So all it does is it takes in some tensors, mutates them, and then bam, you’re done.
So when we want to represent this in one of these purely functional graphs, we need a functional representation of this operator.
We don’t force users to go ahead and, you know, write the functional op representation because it’s very boilerplate-y.
It’s basically, okay, we’ll allocate the outputs as necessary and then pass the outputs into the mutating one so that, you know, they get mutated and then return them.
So we don’t force you to write that boilerplate.
We have a auto-functionalized high-order op that takes one of these mutating ops and turns it into a functional op.
But of course, you know, this functional op is a lie, right?
All it’s really doing is, you know, allocating the outputs and then passing them in into the mutating kernel in the end.
So, you know, if you know what the outputs are, then there’s no need to actually go ahead and, you know, do this functional op.
You just want to turn it back into the original mutating op so there’s none of this overhead in this case.
So when you are doing these special passes, now we’re actually reintroducing mutation to arbitrary points in the graph.
And so we do really need to know about aliasing information.
For example, I can’t re-emplace an implacable op if there is another alias to it, you know, somewhere else.
Because I, you know, the semantics are that the alias is the original thing because that’s what my graph had.
My graph didn’t have any mutations in it.
And so to make sure we actually have this information, we actually, you know, need to look at the fake tensors that are stored as metadata on the FX nodes.
So inside the FX graph that we’re processing, every node is annotated with a fake tensor, not only saying what the shape and the D type and, you know, regular metadata of the data at that node is,
But also recording accurate aliasing information by way of the storages associated with the fake tensors.
Fake tensors have storages and you can actually say, hey, are two fake tensors, do they have the same storage or not?
And that tells you if two nodes aliased or not.
This is like super, super simple alias analysis.
It bypasses a lot of the complexity that you would have to do in a normal compiler because we say you get one alias pattern.
It is exactly this alias pattern.
It cannot change.
We don’t have, you know, anything like, oh, you’ve got two inputs.
Do they alias?
Do they not alias?
Who knows?
We always specialize on the aliasing pattern of the inputs.
So, you know, we need accurate fake tensor information to actually do the reemplacing because we need to know whether the storages are right.
And in the old days, we used to just rerun fake tensor propagation on the FX graph before doing these passes.
But these days we have this thing called the fake tensor updater.
And all it does is it says, well, you know, most of the optimizations that are happening in the graph are local.
So, you know, you can go ahead and do your optimization and, you know, be a little sloppy about, you know, actually remembering to put in updated fake tensor metadata.
And then we’ll just go ahead and, you know, compute.
Recompute only the fake tensors for the areas of the graph that changed until, you know, we’ve reached a fixed point in the graph.
And, you know, that’s pretty nice and saved us a bunch of compile time.
Okay, so that’s our whirlwind tour of post-grad FX passes.
One last thing I’ll note, most of the passes are kind of poorly documented.
Sorry, that’s what happens when you write a compiler really, really fast.
The commit messages tend to be pretty good, though.
So if you’re, like, looking at some code and you’re like, oh, what does this do?
There’s no comment.
Go look at the history, see who added it.
There might be a pretty good explanation of what the heck’s going on.
Okay, that’s everything I wanted to talk about today.
Talk to you next time.Inductor—Post-grad-FX-passes
EP81 Higher-order-operators
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about higher
order operators in PyTorch as well as in PyTorch 2. Now the name higher order operators is
a slight misnomer because technically you can use them to build any sort of operator,
not just higher order ones, but their primary use case is all about control flow, so it sort
of makes sense. What do I mean by higher order operator? Well in functional programming we
typically refer to higher order as some sort of function which itself takes a function as
an argument. And as you will see many higher order operators in PyTorch are in exactly this
way. They are operators that take other operators or even graphs of computations as arguments to
do various operations that you might want to do. So why do higher order operators exist? Well
there are a few reasons they exist. So one is that in PyTorch 2 our general mode of operation
is we only support straight line code. So you know you can write Python code, it can have
loops, it can have conditionals, all sorts of random stuff. But by the time it gets to our compiler,
we have flattened away and inlined all the control flow so that we have a single straight line graph,
one op, then the next op, then the next op. And this is really great for the compiler optimizations
we want to do because straight line code is very easy to optimize. You know, you don’t have to do any
sort of control flow analysis or anything like that. But it’s kind of annoying for users sometimes
because sometimes, you know, you really, you really, really need that control flow op. You know,
really, really, really. And maybe for some reason, you don’t want to just have that control flow be
implemented in host Python. Because remember, PyTorch 2 is integrated with regular PyTorch in Python. So you
can, you know, seamlessly transition in and out of compiled regions, and regular, you know, Python
control flow. So maybe you have some reason you want to have the control flow inside of your model
itself. For example, if you’re exporting a model, well, you know, you’re not going to bring Python
along for the ride in that case. So you need some sort of way to represent this sort of thing. And
higher order operators are a way of defining control flow operators, so that you can go ahead and
actually write an operation, which will be recognized by all the pieces of the stack. And then you can
actually use it. And it then it turned out that we could also use higher order operators to do a bunch
of other things. And I’ll talk about each of the potential use cases. So what exactly is special
about a higher order operator? So there’s a few important things. So one is that when you traditionally
think of what we call an operator in the PyTorch 2 compile stack, that is to say some sort of thing in
torch.ops. Normally, it has a restriction on what arguments it’s allowed to take. Specifically,
this restriction is exactly the torch script schema restriction. So back in the day, when we had
torch script, our first compiler for PyTorch, there was some restriction on what kinds of arguments you
could take. So you could take tensors, you could take ints, you could take list of ints, but you
couldn’t take a dictionary or an arbitrary Python object or a callable. So one of the reasons why this
restriction existed is because operators need to interface into C++ code. And obviously, we can’t
have, you know, random Python objects leaking into C++ that just doesn’t work all that well. So higher
order operators are sort of an extension to the PyTorch operator mechanism. So they behave the same way as
regular operators, you could implement a regular operator as a higher order operator, there’s nothing
forcing you, for example, to take a callable as an argument. But, you know, it’s, it’s going to be
done entirely in Python. And because it’s done entirely in Python, you can pass higher operators,
operators, any arguments you want. So there are benefits and downsides, right? The benefit is you
have a lot more expressivity, you can now pass in, you know, an FX graph, you can pass in an operator,
all sorts of random stuff that you couldn’t do. And the downside is that, well, basically, with higher
operators, you have to do everything by scratch. And, you know, let me let me emphasize what I mean by
everything. So normally, when you define an operator, like a normal operator in PyTorch,
you get a bunch of things for free, you get serialization, you get autograd, especially if
you do it in core PyTorch, if you do it as a custom operator, you know, it’s a little more work,
you have to write an autograd function, but still all this sort of stuff makes sense is available for
you. We even have things like functionalization, which lets us, you know, convert a mutating operator
and a non-muting operator. So all this stuff you get for free, when you write normal operators,
we sort of ask you for the minimum possible implementation needed. And that’s essentially
all the user stuff, you know, all the stuff that we can’t actually, you know, figure out automatically
for you, you need to give us. But if you write a higher order op, right, you’re saying, hey, you know,
I’m going to do some special stuff with some special arguments. And my behavior is very,
very custom. And so actually, I have to go and implement every single one of these
transformations by scratch. So I have to say how to do autograd, I have to say how to do fake tensor
propagation, I have to say how to do functionalization, I have to say how to actually run the
thing, well, you’d expect to have to do that. All of these things, everything that the PyTorch
dispatcher, you know, normally would handle when you’re doing an operator, you have to implement
when you’re doing a higher order operator. And this kind of makes sense, because, you know,
many of the sort of generic implementations we have in the dispatcher involves saying, hey,
there’s a fixed universe of types. And I know how to handle every single type when I want to do
something. For example, vmap. Vmap is our, you know, way of vectorizing operations. Vmap says, well,
all I need to do to vmapify an operator is look at all of the tensor arguments. And then, you know,
go ahead and extract out the batch dimensions from them. So I can’t do that. If you’re giving me,
you know, some random Python objects, because I don’t know if there are tensor arguments lurking
inside them. So that’s why you have to re-implement everything. And so really, actually,
the implementation of higher order operators is very simple. Once you actually have done all the
hard work of defining what you’re supposed to do on every step of the dispatcher, then all the higher
order operator calling mechanism does is re-simulate the same, you know, sequence of dispatcher calls
you would have gotten on a regular PyTorch operator just doing it entirely in Python. In fact, the main
implementation mechanism for higher order operators, the Python dispatcher, was actually built for a
different reason, namely that we wanted a little more customizability on the dispatcher from Python. The
dispatcher is all in C++ for eager mode performance. And then, you know, it turned out, hey, we can actually
use this to implement higher order operators. If you’re interested in more about the dispatcher, I have a really
nice blog post on my blog about the dispatcher. And I think there’s also a podcast about it. So plenty of
material on this. Okay, so that’s the sort of high level on higher order operators, you know, they let
you put arbitrary arguments inside of your operations, including graphs and other operators, that’s why
they’re higher order, and they’re pretty difficult to implement. So, you know, what exactly in the PyTorch2
stack, you know, involves higher order operators? So let’s just go through them, one by one. So for the first
class is the class of control flow operators. And this is sort of the original raison d’etre for higher
order ops. So you’ve got things like cond, map, while loop, they do what you expect, right? Cond lets you do
a runtime conditional on data that you don’t actually know ahead of time. And you’ll branch to one side of
the left, super useful when doing export, and also not too, you know, difficult to deal with from a static
analysis perspective. We have the while loop, which lets you, you know, loop over and over again, doing an
operation until some condition is true. This one, we have some restrictions, you know, they’re not
unrestricted while loops, the, you know, the return type of the while loop has to be exactly the same
every single time. You know, there’s restrictions on loop carrier dependencies. Actually, a lot of these
are just sort of like very closely modeled off of their equivalents in the TensorFlow slash JAX world,
you know, where, you know, they’ve been ahead of the game on us, because, you know, we didn’t really
need used to need this in eager mode, you could just write a regular while loop in Python. So that’s a
bunch of host side control flow, it sort of works the way you would expect. We also have device side
control flow. So for example, when you want to do a cumulative sum, you know, that that’s just a,
you know, built in operator in pytorch, torch.cumesum. But let’s say you want to do a custom reduction,
how would you go ahead and implement that? Well, you might want to specify some sort of reducer
function, maybe subject to some constraints like it having to be associative, because we’re doing some
sort of like tree reduction type scheme. And so after you specify this reduction, which, you know, is
probably a bunch of operators, you know, as I said, a bunch of operators, you want to wrap it up into a
thing that actually turns it into a, you know, reduction operation. So cumulative sum is an example of a,
you know, device side, you know, operation, we also have things that don’t really resemble control flow,
but still require that, you know, give me some sort of function, and I will do something with it,
inside the context of this function. So for example, something that we’re going to be releasing soon is
templated attention kernels. These are pretty cool. This work is from Horace and from Driss. What it
essentially does is, you know, there’s a standard attention kernel that we have written in Trident
that, you know, you can just use directly if you call regular attention. But there are various things that you
might want to customize, like the scoring and other things. And these things are embedded directly inside of the
attention kernel. So you can’t just, you know, go ahead and tweak some arguments to the call of the attention.
And what you actually probably want to do is pass in some custom functionality, you know, some operation on
scalers that specifies what you don’t want to do. So similar to the cumulative sum case, you know, what we do is we
define a higher order op, you can pass in some, you know, callable, which specifies a bunch of scalar operations
you want to do on the inside. And then we’ll, you know, bundle this up with our template, attention template.
And then you get a custom attention kernel that does all this stuff for you. And yes, in principle, you could
have, you know, copy pasted out the Trident code, and then, you know, made the modifications you need. And in some
sense, it’s not that hard, like there’s not that many lines of code you need to change. But you know, you kind of
need to know Trident to do this, it’s kind of not so easy. So having something like template attention makes it a lot
easier to just go ahead and do lots of variations on things you might want to do. We also even use this
higher orderness, this ability to take in graphs, for things that aren’t even really control flow or like
kernels at all. So for example, activation checkpointing is done as a compiler pass inside
PyTorch 2. It’s not done by tracing out the eager implementation of activation checkpointing for
technical reasons. And so to do this, we need to say some way of what the region we want to
checkpoint is. And that’s a higher order operator, right? It’s a higher order operator that contains
a graph that is just, you know, the stuff you want to checkpoint. And this doesn’t really have any
runtime meaning. It’s just used by the compiler to control what we actually do. Okay, so that’s all of
the like sort of control flowy, higher order operators, but there’s more. So I mentioned that
higher order operators lets you pass arbitrary crap as arguments to the Python function. And so this is
actually useful in a bunch of situations. So for example, Ose recently added the ability to
write custom user Triton kernels and have them directly embed into the PyTorch 2 component. And so
what exactly does this look like from a compiler’s perspective? Well, you write some Triton code,
and then we need to somehow represent it inside of, you know, our compiled graph, which normally is just a
bunch of Aten ops. And what these like Triton call kernels need to do is they need to hold on to a
reference to the actual Triton code that you wrote ahead of time. Hmm, doesn’t sound like an int or a
float. It’s not really a callable of operators. It’s just some random, you know, Python object. That’s the
Triton representation. And so, you know, if you have a higher order operator, you can just go ahead and put
it in that operator. And then we can preserve it all the way to Inductor. And then Inductor can do smart
things with it because Inductor works at the Triton level and it can incorporate other Triton code into
its code. Another example of sort of non-standard type arguments is our Torchbind integration. So Torchbind
is an old, it’s from the TorchScript days. It’s a way of binding arbitrary user-defined classes so that you
can call them inside of the compiler. A useful thing to want to be able to do and also useful in
PyTorch2 context. And so, you know, when you make references to, you make references to these, you
know, sort of user-defined objects, well, once again, these are strange arguments. They don’t, they’re
normally not handleable by the dispatcher. So that’s also done as a higher order op. The final class of
higher order ops, I would say, are what I call operator variants. That is to say, in principle, the higher
order op is not necessary. We could have just manually written out a bunch of extra operators
representing the thing you want to do, but you actually have to write a lot of custom operators.
So instead of having to write a lot of custom operators, the higher order op lets you take an
operator and turn it into some different variation, which, you know, wants to be treated as a single
operator, but, you know, has some different semantics. So let’s give some examples. So one of the higher
operator ops that does this is out D type. What does out D type do? It’s very simple. It says, hey,
do this operation, and the output D type should be this, rather than the normal D type in this case.
And so one of the, like, sort of primary reasons this is necessary is when you’re doing low precision
matrix multiply, you may want to control what the output precision is, and in particular, have it be
higher precision than the inputs, because that lets you do accumulations and higher precision. Maybe that’s
just what you want in that case. So we could have just added another matrix multiply operator that
has a out D type, you know, argument on it. But instead of doing that, and we actually argued about
this a lot, because this is this is one of the serious proposals on the table, we instead introduced
a higher operator that controls out D type. And like, basically, now you can use this on any operator.
And actually, most operators don’t do anything special. But you know, if you need it for other
things, you can just use it in that case. Another example is auto functionalize. So I mentioned that
we have this thing called functionalization, which takes your graph with possibly mutating operations,
and turns it all into non mutating operations. And so the problem with functionalization is that
you need a pure version of the operator. So for everything built in in regular PyTorch, we have both a
mutating version, like add underscore as well as the pure version, add without the underscore. But if
you’re writing a custom operator, and you know, you just wanted to like, you know, mutate some of the
arguments and you know, return some other arguments, you probably don’t want to go through all the
boilerplate of writing the functionalized version. So auto functionalize just goes ahead and does it
automatically for you, right? It just looks at the schema of your operator and is like, okay, these
arguments are mutated. So let me go ahead and pre allocate, you know, buffers for them, so that they
can get mutated into and then return them. And now you have a functional version of the operator in
question, saving you from having to actually manually write these things out. And you know, auto
functionalization kicks in, you know, early in the compiler stack, we do a bunch of optimization passes
on the purely functional IR. And then what we actually do is we de-functionalize them, we replace
them with the original operators. And that, you know, means you don’t actually pay any runtime cost for
this. And finally, we have, and this is very new from Yidi and Richard, is effects support. And
basically, what effects support is, is we there are some operations we want to support in PyTorch IR that
have side effects, like printing, you know, things that are like for logging. And so in particular, we don’t
want to reorder them when we’re doing operations. So there’s a bunch of ways you can prevent reordering, but our
choice with the FX, the higher level FX IR, the functional IR, is that order doesn’t matter. Instead, we’re just
going to manually insert fake data dependencies between, sorry, fake control dependencies, model as data
dependencies between nodes. And if they’re just regular data dependencies, it’s just fake data
dependencies, then the normal, you know, sort of respecting of data dependencies will make sure we
don’t reorder things as well. So once again, when you have an operator, and you want it to like, actually,
you know, have some more strict ordering requirement, well, now you need some new version of the operator
that takes in this, you know, control dependency node, this token, as we call it, and then produces a new
token that you can thread on to the next thing. So once again, a pain to actually write these
operators all by scratch. So there’s a higher order operator that does this wrapping and then also adds
the token to the input and output. So that’s a whirlwind tour of all the higher order operators in
PyTorch. There’s a bunch of them. They are sort of important because, you know, if you’re an export
backend, you kind of need to know how to deal with higher order operators. They’re very, very custom.
So we try not to add too many of them. But you know, they’re a bit too useful not to use. So we are
using them relatively frequently. That’s everything I wanted to talk about today. Talk to you next time.Higher-order-operators
EP82 TORCH_TRACE-and-tlparse
Hello, everyone, and welcome to the PyTorch Dev Podcast.
Today, I want to talk about Torch Trace and TL Parse,
our structured logging framework for PyTorch 2.
You may have already heard of Torch Logs,
which is a very nice developer-oriented feature
that gives you debug logging for the PyTorch 2 stack.
Traditionally, PyTorch didn’t have very much logging,
but we found it very useful when working on a compiler
because compilers are complicated,
and so we have a lot of logs
and you can use them to get useful information.
However, what we noticed was that for bigger jobs,
big and complicated models,
the amount of data that we got from Torch Logs
was actually too much.
It was very difficult for people to find the information they needed.
For example, if you’re running Dynamo in debug mode,
you get a line of debug logging
for every single bytecode you process.
So as you can imagine,
that’s pages and pages of bytecodes for large models,
and suddenly you just can’t figure out,
you know, did this graph break, whatever.
There was another problem,
which is that when we were running models on clusters,
sometimes there would be bugs,
and people would say,
hey, you know, I ran my PyTorch 2 model,
and it crashed or it had some problem,
you know, please take a look.
And we’d want to take a look,
but one of the things we’d want to look at in the situation
is some of the generated code that Inductor had
or some of the intermediate code,
and we had no way of getting at it
because even though we have things like Torch compiled debug,
which dumps all of the intermediate products to disk,
if you’re running on some cluster,
you know, it can often be inconvenient
to actually get those things,
because by the time the user has come to you with their problem,
all of the machines that the job is running on have already been released,
and their file system is scrubbed,
and you no longer have access to any of the logs.
So TL parse was born out of this problem.
Originally, my idea was,
hey, you know, what if we had a log parser for PyTorch 2 logs,
and, you know, we’ll just go ahead and, you know,
take all of the plaintext logs we’re generating
and then parse them into something useful.
But it turned out that we have lots and lots of logs.
And also, it’s not so easy to tell apart our logs from other logs.
And there were a bunch of other logs,
and we ended up having gigabytes of log files,
which didn’t even have the information we wanted.
So TL parse and Torch trace are now two pieces.
So part one is it is a structured logging mechanism.
So unlike the plaintext logs,
the logs that Torch trace emits are much more structured.
They are emitted as JSON.
And there’s only a few of them,
basically things that we thought would actually be useful for TL parse.
So when you run PyTorch 2 program
with the environment variable torch underscore trace,
you’ll get these trace files, which are structured logs.
And then you can feed them to part two of TL parse,
the log parser, which is written in Rust
and just lets you, you know, go ahead and take those JSON
and output a nice formatted HTML document
for this sort of thing.
Actually, there’s, you know, this is a very modular system.
You can use the structured logs for anything else you want.
So for example, I did a Easter hack called TorchDBG.
It’s just a little, you know, time travel debugger
that lets you get a trace of a model
and then look at things in a React UI
that lets you, you know, forward step, backward step.
And I actually use the structured logs mechanism.
So I didn’t use the regular torch trace logs.
I, you know, added my own custom structure tracing log
from a dispatch mode.
And I didn’t use the log parser, TL parse.
I actually made a custom React UI
that read in the structured log format.
And, you know, just, you know,
structured logs are useful, right?
They can be used for a lot of things.
And in this case, I just use this container format
for this other use case.
But we’re going to talk about torch trace and TL parse
because that’s actually useful.
And TorchDBG is just a fun side project
that I don’t really know what to do with.
Okay, so let’s talk a little bit about
what kind of structured logs, you know, torch trace emits.
So there are a few things that we emit.
So one is that every time we do a compilation,
we emit a compilation metrics,
which basically says what happened?
You know, how many operators do we compile?
Did the compilation succeed, fail?
Did we restart?
You know, basic metrics like this.
We actually also send these
to our internal structured logging system at meta
so that, you know, we can go ahead and query them.
But it’s pretty handy to just have these useful,
available for a single log, for a single run.
So you can look at it in all place.
The other thing in the log is all of the compilation artifacts.
So, you know, when Dynamo is done executing
and it’s generated in FX graph,
we dump that to the torch trace log.
Then for each of the intermediate FX passes,
AOT autograd, inductor passes,
we dump their FX passes.
We also dump the final inductor generated code,
you know, the Python code that’s got trident code in it.
And, you know, bundling all this together,
you know, you’re mostly interested in,
you know, any particular compilation.
So we have this thing called compilation IDs.
And these let you identify distinct compilations
that happen in the system.
They’re numerically ordered.
And so they usually come in the form of X slash Y,
where X is the particular frame we’re compiling.
So, you know, if you’re compiling function F,
maybe that will get the number zero.
Then when you compile G, you’ll get the number one
and so forth and so forth
until you have all the compiled frames.
And then the second number,
Y, tells us the number of recompiles
we’ve done on this frame.
because sometimes we’ll compile a frame
multiple times because of guard failures.
So then you’ll get 00, 01, 02, 03, and so forth.
Actually, there’s a third number.
It’s appended at the end, underscore blah.
And that happens when Torch Dynamo restarts.
So you also get to see restarts of analysis
in Torch Dynamo.
So that’s it.
That’s basically all the structured logs
we actually emit.
So these get put into a log file in JSON,
and then they get sent to TL parse,
which actually does some sort of visualization.
So what exactly does TL parse do?
Well, it’s also pretty simple, okay?
So one of the things it does
is it is an HTML file,
and HTML files mean we can, you know,
do things like hyperlinks.
So instead of, you know,
bladding all of the intermediate products
into one giant log file,
which is what you would normally get
if you were doing regular plain text logging,
I can just create separate files
for each of the compilation artifacts,
and then you can click links to get to them.
It’s not, it doesn’t sound like much,
but it’s a huge, huge difference for readability.
Another thing that we do,
and this one’s pretty neat,
and I like it a lot,
is we build a stack try
of all of the compilations that we did.
So remember, a try is a data structure
where, you know,
if you have a bunch of, you know,
strings,
strings in the computer science sense,
where, you know,
they have shared prefixes,
and then at some point they diverge,
a try lets you, you know,
sort of put them all
into a tree-like data structure
where shared prefixes share,
you know, a path,
and then once the strings diverge,
then they go down different paths in the tree.
Well, a try doesn’t have to operate only on strings,
you can have a generalized try.
So our stack try operates on frames
instead of characters.
So for every shared frame
in any given stack trace we have,
they get the same, you know,
node representation inside of the stack try,
and then when things diverge,
we, you know, actually branch the tree.
So this gives you a really nice bird’s eye view
of all the compilations that happen.
I’ve definitely debugged problems involving,
you know, for example,
Dynamo trying to compile too many things.
I’m just looking through the stack try
and being like,
well, do I expect these stack traces
and looking for something
that looks out of place?
Like, oh, somehow we’re compiling something
that was triggered from an import statement
inside of some random code.
And then, you know, I know,
oh, that’s the problem.
Normally, if I just have a big pile of stack traces,
one for every compilation,
that might be pretty hard
to find the needle in the haystack.
But the stack try compresses away
all the redundant information
so you only see things.
There are lots of other possibilities
for the visualization.
And one of the reasons
why we have this two-step architecture
where you generate structured logs in JSON
and then you have a separate log parser
is so that we can iterate
on log parser separately.
So, you know, it’s kind of difficult
sometimes to update the trace generation code
because that’s often associated
with some deployed, you know,
version of PyTorch,
some packaged binary
that you can’t easily update.
But once you have one of these traces
and assuming it has
all the information you need,
you can just download it
to your local machine
and then keep iterating on TL parse
until you have some sort of format
that, you know,
looks like what you want.
Some quick brief things
about a log format design.
So as I said,
it’s a structured log
where the structure is just JSON
for human readability.
Really, the main idea
behind this log format
is it’s designed to interoperate
with your logging system.
because if you’re, you know,
doing any sort of, you know,
infrastructure work
with PyTorch,
you’re actually running jobs,
you probably have some way
of actually capturing logs already
and putting them somewhere.
So Torch Trace is designed
to piggyback on that.
Of course,
these log files are, you know,
put in separate files
away from the rest
of your regular logging.
But assuming you have some way
of sending things
to your logging store,
you just need to point those files
at your logging store
and you can store them.
So in particular, for example,
we don’t actually, you know,
allow for arbitrary sized,
sorry, we try not to generate lines
that are arbitrarily long
because often, you know,
regular logging systems
can’t handle that.
So that’s like one of the things.
Some other things are that,
you know, we do intern strings.
So we generate a string table
to reduce size.
This is mostly useful
for stack traces
which are very, very repetitive.
The traces are still
pretty repetitive though.
So, you know,
I do recommend gzipping them
if you can.
And finally,
how exactly did we design
the JSON format?
So it was mostly co-designed
with Rust-30 JSON
which, you know,
basically is able
to conveniently deserialize
JSON objects
into Rust structs
as long as the structs
have some particular way.
And the most important thing
to know is we do
protobuf style unions.
So whenever we have a message
where there are
multiple possibilities,
we just have fields
that optionally contain
all of them,
one per possibility.
This means that it’s possible
to have multiple fields set
even though this is
technically illegal
because it’s supposed
to be an enum.
But it’s really good
for backwards compatibility
because you can always add
new possibilities
by just adding new fields.
How exactly did we design
TL parse?
So one thing that I did
was I didn’t implement it
in Rust.
I’m actually a little unsure
about whether or not,
you know,
Rust is the right
program for this.
So originally when I wasn’t,
when I didn’t have
a structured trace mechanism
in PyTorch itself,
I was planning to parse
regular logs.
And parsing regular logs
would have been a problem
because we tended to get
gigabytes and gigabytes of them.
So it would have been
prohibitively slow
to actually implement them
in Python.
So Rust is really good
at these like command line,
you know,
text processing applications.
The one thing that’s a little
awkward about using Rust
for this program is
there’s often a bit
of iteration needed
in, you know,
what exactly I want
for the design.
And, you know,
sometimes Rust,
you know,
Rust wants you to like
do a bunch of refactoring
to get all your lifetimes
right.
So that’s kind of
irritating sometimes,
but it’s not,
it’s not a huge deal.
And I’ve definitely
been able to add stuff
fairly quickly
when I needed to.
The other thing
is that,
you know,
James Wu,
who has also been helping
with the development
of TL parse,
we have a little bit
of structure
for artifact parsing.
So instead of just
having a giant command
if statement of doom,
we have a trait
for artifact parsers.
So basically every time
you add a new artifact,
you just write a new
parser struct
and define a trait
for it that says
how to do the parsing.
And this is a little
bit of structure.
You basically can
cargo cult it
if you want to
write a new trait.
That being said,
I’ve noticed that,
so the original way
I designed the JSON
format was I like
basically did a
separate struct
for every single
message type
that I wanted to do.
But I think that’s
probably a mistake.
It’s probably better
to have a single
generic like artifact
text format.
And what that means
is that I can just
easily add new
structure traces
that generate more
artifacts without also
having to update
TL parse at the same
time, which is what
I currently have to do
because everything
gets their own
special snowflake
enum.
If you work at
Meta, we actually
have an internal
version of TL parse,
so you don’t have to
download and install
the regular one
from pip.
And it has some
niceties like it
knows how to talk
to our internal
job systems and
download the logs
directly from there
so you don’t have
to download it
yourself.
Very, very
convenient.
You just paste in
the URL and it
does everything for
you.
Okay, so what’s
next for TL parse?
So I think the
main thing I’ve
noticed is that as
I use it to debug
problems in
production, there’s
a lot of small
bugs that just
sort of become
obvious when you’re
dogfooding.
For example, one
recent site outage
I was helping
debug was we were
trying to figure out
what was wrong with
stack.
and there’s
something wrong
with stack
try, which is
that we’re
actually, the
stack try is only
supposed to show
user frames, but
in some situations
it also shows
a dynamic compile
frames.
And this was very
confusing.
I thought, oh, are
we compiling these
things?
But actually the
answer was no, we’re
not actually compiling
those things.
You know, they’re just
showing up for some
reason.
So, you know, that’s
something to figure
out.
And there’s some
cases where we’re
just missing stack
traces where it’d be
helpful to have a
full stack trace.
So, there’s always
improvements to do for
the TL parse UI.
And if you like Rust
and, you know, you
like PyTorch, you
know, this might be a
fun little thing to
work on.
For example, right
now we do know
whether or not
compilations succeeded
or failed because you
can see them, you
know, by looking at
the compilation
metrics page.
But in the stack
try, it doesn’t tell
you.
It just gives you a
bunch of links to
the various things.
So, you have to
click on the
particular compilation
ID you want and
then click on
compilation metrics to
find out if it
actually compiled or
not.
So, you know, just
inlining that
information in the
stack tree, that
seems like a useful
thing to do.
These are very easy
to work on because
once you have a
trace and it’s not
too hard to generate
a synthetic trace,
you can just, you
know, make some
changes, you know, run
the program and then
see what it looks
like.
It’s very pleasant.
We also, you know,
can add more
structured traces to
PyTorch.
This is also
something that, you
know, I often, you
know, think, oh, you
know, it would be
nice if I had this
information.
One thing to be
careful about is we
don’t want to add
too many things
because then the
log files get
very large and
because it takes
longer to parse
them and it’s more
load on the
storage system.
But there is a
bunch of stuff that
currently is only
available in text
logs and I think
would be pretty
useful to have
available.
One thing in
particular that I,
you know, care a lot
about is symbolic
shapes logging.
So I haven’t
figured out exactly
how I want to put
it into TL parse,
but there’s
definitely something
here that I want
to put in.
And finally, one
thing for like us
internal users is we
have a lot of
models that we
torch compile that
are actually
dynamically generated.
So what happened
is we used a torch
dot fx to sort of
generate a IR and
then we generated
Python code for it
and then we’re
dynamoing into that
Python code.
So this Python code
doesn’t exist anywhere
in the source, the
source file system,
you know, they’re just
completely generated
on the fly.
And so if you have
errors in this
source, you’re just
like, what the heck
is in this source
code?
I have no idea.
There is no access
to it.
So it would be really
nice to actually dump
that to TL parse.
So so TL parse could
show it for you when
you have the stack
traces.
There’s some fiddly
bits in implementing
this, so I’m not
exactly sure.
So there you have it
TL parse.
So if you work at
meta and you’re
listening to this
podcast, TL parse
is really, really
useful.
So if you haven’t
tried it already, the
next time you have
some sort of problem,
even if you’re like
debugging unit tests,
you know, just say
torch trace, blah, and
you can get out a
trace and TL parse it
and look at it.
I promise you it
actually is really,
really useful.
And if you don’t
work at meta, you
know, I think TL parse
could still be useful.
You know, you some of
the integrations don’t
exist yet, but you
know, as I said, you
can just run it, take
a look at things, you
know, you might be
surprised by what you
could find out about
your model.
That’s everything I
wanted to talk about
today.
Talk to you next
time.TORCH_TRACE-and-tlparse
EP83 Compiler-collectives
Hello everyone and welcome to the PyTorch Dev Podcast. Today I want to talk about compiler
collectives, a new feature in PyTorch 2 compilation which allows the compiler to communicate to other
instances of the compiler on other ranks in distributed training in order to communicate
information that may be useful to other nodes in the training. To explain why compiler collectives
are useful, I first need to recollect a particular problem that we encountered in our production
deployment of PyTorch 2 inside meta. The problem looks something like this. Occasionally we would
have jobs that were running with PyTorch 2 enabled and they would nickel timeout. Now nickel timeouts
occur whenever you have a nickel collective and some of the collectives just have to wait too long
for a result and there’s a timeout because one of the reasons, common reasons why you you know wait
too long is because there’s a deadlock or it’s never actually going to finish. So we have a timeout to
make sure we actually kill the nodes and make sure we release resources in this situation.
So in this particular case we were nickel timeouting and the first thing you do when you have a nickel
timeout is you go and look and you see what the heck all the jobs were doing at the time they crashed
and we noticed that in this particular case some of the jobs were doing compilation. Now why were some
of the jobs compiling code in Torch Compile while other ranks were just you know waiting in the network
collective so using TL parse a log parser that we have for PyTorch 2 which can tell you what was going
on on all the nodes see a previous podcast for this for more information we noticed that the ranks that
were compiling were actually doing an extra recompilation that the other ranks were not so some of the ranks
had just gone ahead and run the code and got in all the way to the collective and this poor unlucky rank
was actually recompiling and further inspection of the trace revealed that the reason why this rank had
decided that it needed to recompile was that there was some particular input to one of the graphs that
it had compiled and that input had changed and the graph that the node had previously compiled was
static it had thought that the size of the node size of the input at that location was static and when you
compare this to the other nodes those other nodes had already compiled a dynamic node for them so actually
what had happened here is a consequence of something that we call automatic dynamic so automatic dynamic in
PyTorch 2 says hey we don’t know whether or not your inputs are static or dynamic unless you explicitly
tell us so if you don’t tell us we will assume that all of your inputs are static and then depending on
what we see at runtime if you pass us a tensor with size five and then you pass us a tensor of size seven
on the second run we will realize oh actually it looks like you want this to be dynamic and we will recompile
recompile your graph so that it is dynamic in this case so the problem is that most of the other
nodes had gone in particular inputs that had varied between the first run and the second run say going
from five to seven and they had all recompiled with that input being dynamic but the unlucky node the node
that was actually recompiling at the time of the nickel timeout had actually unluckily gone in an input of
exactly the same size both instances and so it had happily assumed well you know let’s just keep it
being static and only got caught out not being prepared to deal with it at the end of the next
run when they suddenly had to recompile so when we first ran in this problem i thought oh my god
automatic dynamic was a mistake except that i it’s not really a mistake because um if we didn’t have
automatic dynamic then you know this model would not have compiled at all uh you know so it’s like
you know it’s a useful mistake but in some sense it’s architecturally and it’s a bit questionable
because the whole point of spmd distributed training is you want all the nodes to be doing the same thing
and so one of the things you want is you want all the nodes to be compiling at the same time it’s really
bad if one of the nodes is recompiling even if we adjusted the nickel timeout so that we didn’t time out
because if you wait long enough then the recompiling node will eventually get to the end and you will be
able to make progress it’s still not optimal for you know this divergence to happen because all of the
other ranks are waiting for this one straggler rank to finish compilation now there was a another ongoing
problem with our production deployment where things that were supposed to compile in 30 minutes were
actually taking two hours to compile so that really exacerbated the problem a lot in that particular
case but still you know when we noticed this problem while it was kind of a interesting issue
and it wasn’t entirely clear what we should do about it right should we go ahead and you know force a
bigger timeout in this situation should we do something else and so the one solution that we settled on uh
which was a balance of sort of being easy to implement and not requiring too many extra constraints from the
user is this thing called compiler collectives so compiler collectors are an abstract idea the abstract idea
is hey when i am doing compilation on my uh piter shoot process let me actually assume and this is
a new assumption that everyone in the group in inside you know my training job is compiling at the same time now
i can assume everyone is compiling at the same time then what i can do is during compilation i can do a
collective to all the other nodes to basically tell them hey what’s going on so this is an abstract idea
you can use this for all sorts of things but what we’re going to use it for and to solve this particular
recompilation problem is this we are going to say hey have all the ranks talk to each other whenever you see an
input a tensor input and when you see the tensor input i want you to tell all the other ranks what
you saw the size of that tensor input and so in this particular uh in this particular case what happened
was the input that was dynamic actually is variable across all of the ranks because it’s it’s some sort
of like data dependent size uh this is like a recommendation model so there’s like a sparse feature going on and
you know not all the ranks are getting the same sizes so when you have this situation where it’s
unbalanced across all the ranks then if all the ranks talk to each other to try to figure out what’s going
on they’ll say hey actually everyone has a different size for this so maybe even though this is the first
time i’ve run and i don’t necessarily know what the size of the rank should be oh let me just go ahead
and make it dynamic and more importantly because all the nodes are talking to each other we can
ensure that they consistently decide whether or not a particular input should be dynamic or static
so in this way we either never recompile or if every rank happens to be unlucky and sees the exact same
size input iteration one and iteration two everyone recompiles at the next stage so hey like you know
that’s not great but at least you’re not going to nickel time out because everyone is still doing the
same thing now i slightly lied in this explanation i i suggested that you know we do a communication every
time we see a tensor input but you don’t really want to do that right because communications are expensive
you want to batch them together typically so what we actually do is we run the dynamo tracing process
to the very end of the region we want to compile collecting up all the inputs we’ve seen along the
way and then at that point in time we go ahead and do the collective have everyone talk be like hey you
know here are the sizes of all of the inputs i’ve seen and then because um dynamic tracing is something
you sort of can’t do retroactively you you need to like have made the decision to make something
dynamic at the very beginning we just tell everyone to go ahead and restart your dynamo analysis and this
time uh you know make decisions about whether or not inputs are dynamic or not based on this compiler
collective we actually already have this restart capability we use this restart capability to deal
with graph breaks because when a graph break happens um if we’re in some the middle of some inline call
stack we actually don’t have the ability to graph break inside a nested user frame so we pop we need to pop all
the way back to when the first inline function call happened but that involves in in full generality
rewinding back arbitrary changes to the mutable state so instead of you know having to figure out how to
reverse all that we just say okay whatever we’re going to start over again but this time we’re just going
to stop immediately when we get to the function call so same idea we’re going to restart and then use our
new knowledge to you know make different decisions when we’re compiling so compiler collectors are
pretty cool um we actually uh you know i actually successfully ran them on the production model that
you know sort of sent us down this goose chase in the first place and there’s there’s a really interesting
consequence to it so not only does it solve the recompile problem which you know actually it happens
pretty rarely like i don’t even know that i’ve actually necessarily solved it this is something that i’d
have to actually you know run the real model about i have a synthetic test case that like it shows that
it works but you know i don’t know definitively that it works for the real model but the thing is that
because the compile uh the compiler is talking to each other even on the first iteration i actually can
skip the stutter step that happens typically when you have automatic dynamic the stutter step being
the very first time i compile it with static shapes and then the second time i compile it with dynamic
shapes i don’t need to do that anymore because i figure out immediately that the shapes are all
dynamic and this actually drops down compile time for this model from 95 minutes to 63 minutes so
that’s pretty cool um the problem with this uh approach is that it’s not universally applicable right
i said that we’re going to assume that every rank compiles at the same time but it’s really easy
for me to have a valid spmd program with torch compile that doesn’t have this property like just say
that i have you know one rank doing one thing another rank doing another thing and it just so
happens that the first rank has one graph to compile but the second rank has two ranks to compile like
there’s a kind of strange architecture and you know uh you’re definitely doing something unusual if this
sort of thing happens but you know it’s possible and in this situation you can’t turn on compiler
collectors because you’re just going to deadlock when um one of the compiled regions is trying to talk to the other ones
fortunately the deadlock is pretty obvious because you know if you have any sort of uh you know when
your job deadlocks and you go look at the stacks i i really hope you do have the ability to look at all
the stacks when you’re thinking deadlocks like you know basic basic capability that you should have when
doing distributed training um you just look at the stacks you’ll see someone was blocked in a compiler
collective and you’ll be like okay yeah i guess that’s what happened but it it does you know give me a little
trouble figuring out how i’m going to roll this out because right now in um nightly’s it’s it’s a
configuration option it’s not on by default i actually want this to be on for most of the jobs we’re
running but and it’s going to be a little bit of work to figure out how to roll it out okay so that’s
basically it behind compiler collectors the the original pr is actually very simple um and i had to fix some
bugs because you know there are some funny interactions but uh it basically um worked pretty
well but i i do kind of wonder you know if this is the right approach and there are a bunch of other
approaches that we thought about um which you know i just want to talk about briefly here because they
they are kind of interesting alternate approaches so one of the other ideas that i had was hey you know
why don’t you just mark dynamic the input in question and so i don’t have to like go through
all of this rigmarole of you know doing compiler collectives to talk to each other to figure it out
and sometimes i think this is exactly what you should do but in this particular model it’s actually um
it’s not a single graph that’s getting compiled it’s actually um 10 sub graphs um five of which have
non-trivial graph content and the particular graph that is getting recompiled is like embedded in the
middle of like this you know opaque model that i don’t really know what it is i actually due to some
vagaries in our environment i can’t even edit it directly it’s like produced as a side effect of some
other compilation process yeah yeah this is kind of a crazy thing to do but like that’s just the place
we are um for this particular model so it’s not obvious where to put the mark dynamic because
where to put it depends on you know where the graph breaks that dynamo decided to put were
and in general that’s like not well defined another idea that i’ve had in the past about automatic
dynamic is this thing where we have to run it once and then we run it again to figure things out
it’s always been sort of a stick up my back like uh like can’t we just like record this somehow and
then the next time around just do the right thing like seems pretty reasonable right and uh you know if
you imagine some sort of like profile guided optimization setup right the way profile guided
optimizers in uh traditional compilers work is you just you know you run your program you get a profile
you put it up somewhere and then compiler uses it and optimizes your code and if the profile changes
if the code changes right then the profile might be a little bit stale out of date and your compilation
might not be as good but you know as long as you refresh the profile then things will be good again so yeah
it’s kind of operationally complicated to run and um maybe we still want to do this but uh i got talked
out of it so i don’t know not really what i’m going to do another idea is it’s like hey you know you know
what’s the what’s the point of like forcing every um compiler to compile at the same time right don’t you
really want just one compiler to compile everything and then you know send it everywhere like if you’re
doing real if you’re really doing spmd then it’s really going to be the same thing everywhere and
it’s true like you know that would be pretty nice um there are some problems so one of the problems is
that uh you don’t you know you don’t have an obvious like artifact um without running dynamo because
uh when you have a bunch of graphics once again like dynamo is calling the shots about where the graph
breaks are so actually you know you can imagine some sort of record replay so that where you record
you know a dynamo execution and then you replay that on subsequent runs and that’s exactly what you
want um that would certainly work but but it’s like kind of operation like you actually still have
to run the entire training script to actually get the recording which is your you know quote unquote
compile product there’s there’s no like offline compilation process actually you know one of the big problems
with trying to make pi church 2 more ahead of time is that it’s really it really leans into the eager
mode so like a lot of the time we we are solving a lot of problems by just being able to assume that you
know we’re actually running the model with real data like this solves a lot of problems so if you like
take that away if you’re like trying to do a full expert workflow things get a lot harder in a lot of aspects
but you know this is like compiling only one place it is kind of a good idea and you know we’re kind
of talking about this for you know some of the easier regimes so for example once you get to inductor
in some sense the inductor compilation is much more well behaved than dynamo because what’s inductor well
it takes in an fx graph and a big pile of config options and then produces you know a bunch of you
know trident kernels that you want to actually uh run so you know this is this is actually a good
old-fashioned style compiler input output and so you could very much imagine you know like just go ahead
and say do the compilation of this graph somewhere else right somewhere some other service where you
know if all the ranks ask that service for the same compiler result you can notice that they’re the same
you know batch them into a single request compile it and return the result to everyone we all this
remote execution for inductor we actually talked about this at the most recent composability sync so you
know go check that out if you’re more interested so yeah maybe maybe we’ll have that in the future
um we’re still kind of fighting fires with uh our existing um cache deployment and caching is like kind
of a uh easier harder it shares a lot of similarities with remote execution like you have to solve a lot of
the same problems with that so we’re still like working on getting caching under control so that’s kind of
where we are right now all right so that’s everything i want to talk about with compiler collectives
talk to you next timeCompiler-collectives
Ref
Running Whisper on an M1 Max to transcribe audio data — Dag-Inge Aas