I'd like to use multiple devices at the same time, not all of
necessarily in the same context. What is the best way to do this? Is
it OK to use a Python thread for each device and handle queue and
kernel calls separately within each thread?
Any scheme you come up with should be fine. In CUDA, it's common to use
one thread per context since "which context" is thread-global state. CL
has no such restriction, so even driving multiple devices from a single
thread should be A-OK.