[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)
Hi,
When I want to do the fastest memcopy with MicroBlaze and the addresses are aligned, I use the offset as the loop index and move data backwards.
This will remove one instruction from the loop.
Loop:
lw r4,r6,r10
sw r4,r5,r10
bneid r10,loop
addik r10,r10,-4
Now the above code is without consideration of the pipe stalls (which are different between the area version and the performance version).
I guess the main target for Linux is the performance version with caches.
The above code would take (100% cache hit assumed) 1+4+2+1 = 8 clock cycles/loop.
The major drawback here is the lw-sw combination since we do a store on the register that load has just read.
The forwarding of this value goes from the WB pipe stage to the OF pipe stage, which means that we will have 2 stall cycles in EX and MEM.
(sw takes 2 clock cycles when caches are used).
We can now rearrange the code a little to minimize the stalls in the pipeline:
Loop:
addik r10,r10,-4
lw r4,r6,r10
bneid r10,loop
sw r4,r5,r10
We will now need to start the loop with r10+4 since we the first instruction in the loop subtracts r10 with 4.
The above code should use 6 clock cycles/loop.
But maybe you could add another case in your memcopy to unroll the loop.
If you know that the number of bytes to move is a multiple of 8, you can do this instead.
Loop:
lw r4,r5,0
lw r7,r5,4
addik r5,r5,8
sw r4,r6,0
sw r7,r6,4
addik r6,r6,8
bneid r10,loop
addik r1,r10,-8
This loop takes 1+1+1+2+2+1+2+1 = 11 clock cycles/loop which is 5.5 cycles/word.
You can unroll how many times you want but you can never get better than 4 cycles/word.
Göran Bilski
-----Original Message-----
From: owner-microblaze-uclinux@xxxxxxxxxxxxxx [mailto:owner-microblaze-uclinux@xxxxxxxxxxxxxx] On Behalf Of Jim Law
Sent: Tuesday, April 29, 2008 10:01 PM
To: microblaze-uclinux@xxxxxxxxxxxxxx
Subject: Re: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)
Hi again,
Just heading out the door here and might not be available for a couple of
days, so thought I'd pass this along, untested, for your consideration.
I wrote a move_halfword_higher() function in assembler, but didn't have the
time to test it. If my assumptions were correct about the intent of the
memmove() in adapter.c, then this should drop in and do a quicker move. It
assumes a bunch of stuff, so both the overhead and transfer loops should be
quicker than the memmove it replaces, but only for this specific case.
Inner loop is 12 cycles per word to move, only 11 cycles of overhead on top
of that.
If I didn't get it right, I'll test and fix it on my return.
Jim Law
Iris Power
----- Original Message -----
From: "Brettschneider Falk" <fbrettschneider@xxxxxxxxxxxxxxx>
To: <microblaze-uclinux@xxxxxxxxxxxxxx>
Sent: Tuesday, April 29, 2008 1:10 PM
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff
*orig_skb,structnet_device*dev)
> Hi,
>
> Jim Law wrote:
>> Here is a new assembler memcpy for the MB.
> cool...trying that...stay tuned...
>
>> ... not sure what else I can add at this point. Is there any other
>> significant transfers or checksums that use other functions
>> that might benefit?
> drivers/net/xilinx_emac/adapter.c has that ugly memmove() in
> FifoRecvHandler(). Maybe memmove() can also be faster in assembler... ;-)
>
> Cheers, Falk
>
> ___________________________
> microblaze-uclinux mailing list
> microblaze-uclinux@xxxxxxxxxxxxxx
> Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> Mailing List Archive :
> http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
>
___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/