On Sun, 22 Jan 2012 22:48:17 -0800, Massimo Becker <mbecker16(a)gmail.com> wrote:
> Hi,
>
> I am trying to use PyCUDA on OSX 10.7. I have read the installation
> information on the wiki but I am a bit confused and any help with this
> is greatly appreciated.
>
> The documentation states that after version 0.94 the Boost C++
> libraries are no longer needed. Does this mean that the Boost.Python
> libraries are no longer needed as well?
Yes, no need for any of that.
> Under step 3: Build PyCUDA, the config file has the following lines
> for snow leopard. Can these be used with 10.7 lion by changing lines
> 10 and 11 to reflect /Developer/SDKs/MacOSX10.6.sdk ?
I believe so, but 10.6 should work also on 10.7.
> 7 CXXFLAGS = ["-arch", "x86_64", "-arch", "i386"]
> 8 LDFLAGS = ["-arch", "x86_64", "-arch", "i386"]
> 9
> 10 CXXFLAGS.extend(['-isysroot', '/Developer/SDKs/MacOSX10.6.sdk'])
> 11 LDFLAGS.extend(['-isysroot', '/Developer/SDKs/MacOSX10.6.sdk'])
>
> OS X 10.7 ships with Python 2.7.1. Will this version work with the
> dependencies that are listed and PyCUDA??
Sure.
> OS X 10.7 is using the follow compiler for gcc. (this is the output
> when I type gcc)
> i686-apple-darwin11-llvm-gcc-4.2
> Is gcc needed for installation and if so will this compiler work?
I think that's fine.
Can you please update the wiki when you're done building?
Thanks,
Andreas
Hi,
I am trying to use PyCUDA on OSX 10.7. I have read the installation
information on the wiki but I am a bit confused and any help with this
is greatly appreciated.
The documentation states that after version 0.94 the Boost C++
libraries are no longer needed. Does this mean that the Boost.Python
libraries are no longer needed as well?
Under step 3: Build PyCUDA, the config file has the following lines
for snow leopard. Can these be used with 10.7 lion by changing lines
10 and 11 to reflect /Developer/SDKs/MacOSX10.6.sdk ?
7 CXXFLAGS = ["-arch", "x86_64", "-arch", "i386"]
8 LDFLAGS = ["-arch", "x86_64", "-arch", "i386"]
9
10 CXXFLAGS.extend(['-isysroot', '/Developer/SDKs/MacOSX10.6.sdk'])
11 LDFLAGS.extend(['-isysroot', '/Developer/SDKs/MacOSX10.6.sdk'])
OS X 10.7 ships with Python 2.7.1. Will this version work with the
dependencies that are listed and PyCUDA??
OS X 10.7 is using the follow compiler for gcc. (this is the output
when I type gcc)
i686-apple-darwin11-llvm-gcc-4.2
Is gcc needed for installation and if so will this compiler work?
If anyone can answer these questions, I would be happy to update the
wiki for OSX 10.7 installation steps.
Thank you,
Max Becker
--
Respectfully,
Massimo 'Max' J. Becker
Computer Scientist / Software Engineer
Commercial Pilot - SEL/MEL
(425)-239-1710
When you use the runtime API, you are using the C compiler to generate host code to populate the arrays. The compiler knows the sizeof() the array elements, and the individual offsets for each field within a struct, because it is responsible for laying out the data structure in memory. Part of what the compiler has to know is what the alignment restrictions are for the target architecture. In the case of the runtime CUDA API and nvcc, the host and device compilers are in agreement about what these rules are, so that when data is copied from host to device by cudaMemcpy, the layout does not need to be altered in any way. The compiler, on both sides, also ensures that pointer arithmetic uses the right offsets, which also requires known the sizeof() the data type the pointer points at. No explicit padding is ever required.
However, in PyCUDA, the Python side of things does not directly have access to the information in the C compiler, in particular the sizeof() operator or struct member offsets (which are computed at compile time). The pycuda.characterize module provides a reimplementation of sizeof(), so you can be sure to allocate the right amount of memory and know the memory interval between array elements. However, it does not compute the offsets to individual struct elements, so you either have to manually do that the same way the compiler would, or follow the convention mentioned to always order struct elements in order of descending size. Inserting padding member fields in a struct is just a helpful way to remind yourself where the gaps are in the struct layout so that you mimic them in your Python code. They are not required if you are manually putting in correct offsets, using your knowledge of what the C compiler will do.
The two cases you compare (Vec and DoubleOperation) are actually structs with different sizes, where the only reason they have different sizes is to satisfy the alignment requirement for the first member in the struct. In one case, the first member is a pointer, and needs 8-byte alignment. In the second case, the first member is a float, and 4 byte alignment is sufficient. You can see how this works on the host using the following C++ program:
// alignment.cxx
#include <iostream>
struct threefloat {
float x,y,z;
};
struct pointerint {
float *a;
int n;
};
int main() {
std::cout << "sizeof(int) = " << sizeof(int) << std::endl;
std::cout << "sizeof(float) = " << sizeof(float) << std::endl;
std::cout << "sizeof(float*) = " << sizeof(float*) << std::endl;
std::cout << "sizeof(threefloat) = " << sizeof(threefloat) << std::endl;
std::cout << "sizeof(pointerint) = " << sizeof(pointerint) << std::endl;
return 0;
}
On my 64-bit Linux system, the output of this program is:
sizeof(int) = 4
sizeof(float) = 4
sizeof(float*) = 8
sizeof(threefloat) = 12
sizeof(pointerint) = 16
Basically, I've found that I stay out of alignment trouble in PyCUDA by following these two rules:
1. Arrange members of structs in descending size order.
2. Always uses pycuda.characterize.sizeof() to compute the size of a struct.
On Jan 21, 2012, at 2:06 PM, Yifei Li wrote:
> 1) The example fails to work without padding, the second block prints the wrong 'datalen'.
> However, if I use CUDA runtime API instead of pycuda, the result is correct even without the padding.
>
> The trick suggested by Stanley works. So let me answer this question myself: the C compiler automatically figures out the size of the struct should be 16 instead of 12, which matches the size of the struct on the device.
>
> Actually, I found that if the program is written using CUDA runtime API, you don't need to worry about alignment at all. For example, I tried several structs with different sizes, and the values of the fields in a struct are always printed correctly.
>
> Why is that? This seems to contradict with the following (quoted from Chapter 5 of CUDA 4.0 programming guide):
>
> A typical case where this might be easily overlooked is when using some custom global memory allocation scheme, whereby the allocations of multiple arrays (with multiple calls to cudaMalloc() or cuMemAlloc()) is replaced by the allocation of a single large block of memory partitioned into multiple arrays, in which case the starting address of each array is offset from the block‟s starting address.
>
>
> 2) Since the size of the struct without padding is 12 bytes, I tried a different struct but of the same size:
> struct DoubleOperation{
> float x;
> float y;
> float z;
> }
> And the kernel function is changed to
> __global__ void test(DoubleOperation *a) {
> a = &a[blockIdx.x];
> printf("block %d: %f %f %fn", blockIdx.x, a->x, a->y, a->z);
> But this time the values of x, y and z are printed correctly by both blocks. So why does it work even though the struct's size is the same as before?
>
> However, I still don't have answer for this.
>
>
>
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
>
>> 0) what are you trying to do?
>>
> I'm trying to see if the struct example in the tutorial still works
> without padding. And following is the kernel function I use:
>
> __global__ void test(DoubleOperation *a) {
> a = &a[blockIdx.x];
> printf("block %d: %d\n", blockIdx.x, a->datalen);
>
> The kernel is launched using 2 blocks, each of which has one thread.
>
>
>
>> a) what is the problem?
>>
> I actually have two questions.
>
> 1) The example fails to work without padding, the second block prints the
> wrong 'datalen'.
> However, if I use CUDA runtime API instead of pycuda, the result is
> correct even without the padding.
>
The trick suggested by Stanley works. So let me answer this question
myself: the C compiler automatically figures out the size of the struct
should be 16 instead of 12, which matches the size of the struct on the
device.
Actually, I found that if the program is written using CUDA runtime API,
you don't need to worry about alignment at all. For example, I tried
several structs with different sizes, and the values of the fields in a
struct are always printed correctly.
Why is that? This seems to contradict with the following (quoted from
Chapter 5 of CUDA 4.0 programming guide):
*A typical case where this might be easily overlooked is when using some
custom global memory allocation scheme, whereby the allocations of multiple
arrays (with multiple calls to cudaMalloc() or cuMemAlloc()) is replaced by
the allocation of a single large block of memory partitioned into multiple
arrays, in which case the starting address of each array is offset from the
block"s starting address. *
> 2) Since the size of the struct without padding is 12 bytes, I tried a
> different struct but of the same size:
> struct DoubleOperation{
> float x;
> float y;
> float z;
> }
> And the kernel function is changed to
> __global__ void test(DoubleOperation *a) {
> a = &a[blockIdx.x];
> printf("block %d: %f %f %fn", blockIdx.x, a->x, a->y, a->z);
> But this time the values of x, y and z are printed correctly by both
> blocks. So why does it work even though the struct's size is the same as
> before?
>
However, I still don't have answer for this.
>
>
>
>
On Sat, 21 Jan 2012 11:24:33 -0500, Yifei Li <yifli82(a)gmail.com> wrote:
> On Fri, Jan 20, 2012 at 9:58 PM, Anthony LaTorre <tlatorre9(a)gmail.com>wrote:
>
> > add the padding field!
>
> I know this will work. But without the padding field, the same program
> written in C does not have any issue.
I'm really not sure what you're trying to do here. Can you explain
properly:
0) what are you trying to do?
a) what is the problem?
b) what is this mysterious 'C program' you keep referring to? CUDA C?
c) what have you tried?
d) what were the outcomes?
Andreas
On Sat, Jan 21, 2012 at 11:29 AM, Stanley Seibert <stan(a)mtrr.org> wrote:
> I just replied to you over at the CUDA forum:
>
>
> http://forums.nvidia.com/index.php?showtopic=221189&st=0&gopid=1358769&#ent…
>
> The size of Vec is not 12 bytes, but 16 bytes.
> pycuda.characterize.sizeof() will let you compute this automatically.
>
> (Is there any documentation of the pycuda.characterize module? I can't
> find any...)
>
Thanks. I'll try your suggestion
>
> On Jan 21, 2012, at 11:24 AM, Yifei Li wrote:
>
> >
> >
> > On Fri, Jan 20, 2012 at 9:58 PM, Anthony LaTorre <tlatorre9(a)gmail.com>
> wrote:
> > add the padding field!
> > I know this will work. But without the padding field, the same program
> written in C does not have any issue.
> >
> > or you can just flip the order, i.e.
> >
> > struct Vec
> > {
> > float *data;
> > int len;
> > }
> >
> > I tried this trick already.
> >
> > Any further ideas?
> > _______________________________________________
> > PyCUDA mailing list
> > PyCUDA(a)tiker.net
> > http://lists.tiker.net/listinfo/pycuda
>
>
On Fri, Jan 20, 2012 at 9:58 PM, Anthony LaTorre <tlatorre9(a)gmail.com>wrote:
> add the padding field!
I know this will work. But without the padding field, the same program
written in C does not have any issue.
> or you can just flip the order, i.e.
>
> struct Vec
> {
> float *data;
> int len;
> }
>
I tried this trick already.
Any further ideas?
Hello , i am a bit confused about measuring time,so i need a little help.
I have a code like :
....
Rs_gpu=gpuarray.to_gpu(np.random.rand(numPointsRs*3).astype(np.float32))
Rp_gpu=gpuarray.to_gpu(np.random.rand(3).astype(np.float32))
....
start = drv.Event()
end = drv.Event()
mod =SourceModule("""
__global__ void compute(float *Rs_mat, ...., float *Rp,.)
""")
#call the function(kernel)
func = mod.get_function("compute")
start.record() # start timing
func(Rs_gpu,..Rp_gpu...)
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
#----- get data back from GPU-----
Rs=Rs_gpu.get()
Rp=Rp_gpu.get()
print "%s, %fsec, %s" % ('Time for Rs = ',secs,str(Rs))
print "%s, %fsec, %s" % ('Time for Rp = ',secs,str(Rp)) //here i am
computing the same thing!
My questions are:
1) Is this right correct for measuring the gpu time?
2) How can i distinguish the results for Rs and for Rp (if it can be done)
Thanks!
--
View this message in context: http://pycuda.2962900.n2.nabble.com/how-to-measure-time-tp7208359p7208359.h…
Sent from the PyCuda mailing list archive at Nabble.com.
add the padding field! or you can just flip the order, i.e.
struct Vec
{
float *data;
int len;
}
On Fri, Jan 20, 2012 at 12:29 PM, Yifei Li <yifli82(a)gmail.com> wrote:
> Quoted from Chapter 5 of CUDA 4.0 programming guide, which may be
> relevant.
>
> "Reading non-naturally aligned 8-byte or 16-byte words produces incorrect
> results"
>
> But I still don't know how to fix the problem
>
> On Wed, Jan 18, 2012 at 3:01 AM, Andreas Kloeckner <
> lists(a)informa.tiker.net> wrote:
>
>> On Tue, 17 Jan 2012 16:55:22 -0500, Yifei Li <yifli82(a)gmail.com> wrote:
>> > Hi all,
>> >
>> > I modified the example
>> > http://documen.tician.de/pycuda/tutorial.html#advanced-topics by
>> removing
>> > the '__padding' from the structure definition and got incorrect result.
>> > The kernel is launched with 2 blocks and one thread in each block.
>> >
>> > Each thread prints the 'len' field in structure, which should be 3 for
>> > block 0 and 2 for block 1. However, the result I got is:
>> >
>> > block 1: 2097664
>> > block 0: 3
>> >
>> > No such problem if I write the following program using C. Any help is
>> > appreciated.
>>
>> It seems CUDA doesn't automatically align the pointer, without being
>> told to?
>>
>> https://en.wikipedia.org/wiki/Data_structure_alignment
>>
>> Andreas
>>
>>
>
> _______________________________________________
> PyCUDA mailing list
> PyCUDA(a)tiker.net
> http://lists.tiker.net/listinfo/pycuda
>
>
On Tue, 17 Jan 2012 16:55:22 -0500, Yifei Li <yifli82(a)gmail.com> wrote:
> Hi all,
>
> I modified the example
> http://documen.tician.de/pycuda/tutorial.html#advanced-topics by removing
> the '__padding' from the structure definition and got incorrect result.
> The kernel is launched with 2 blocks and one thread in each block.
>
> Each thread prints the 'len' field in structure, which should be 3 for
> block 0 and 2 for block 1. However, the result I got is:
>
> block 1: 2097664
> block 0: 3
>
> No such problem if I write the following program using C. Any help is
> appreciated.
It seems CUDA doesn't automatically align the pointer, without being
told to?
https://en.wikipedia.org/wiki/Data_structure_alignment
Andreas