[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)



Hi,

Actually the limit when unrolling the code is 3 clocks/word.

Moving 12 words (48 bytes) can be done by this:

lw r4,r5,0
lw r7,r5,4
lw r8,r5,8
sw r4,r5,0
sw r7,r5,4
sw r8,r5,8
lw r4,r5,12
lw r7,r5,16
lw r8,r5,20
sw r4,r5,12
sw r7,r5,16
sw r8,r5,20
lw r4,r5,24
lw r7,r5,28
lw r8,r5,32
sw r4,r5,24
sw r7,r5,28
sw r8,r5,32
lw r4,r5,36
lw r7,r5,40
lw r8,r5,44
sw r4,r5,36
sw r7,r5,40
sw r8,r5,44

The above code takes 12*1+12*2= 36 clock cycles which is 3 clocks/word.

Göran

-----Original Message-----
From: Goran Bilski 
Sent: Wednesday, April 30, 2008 10:23 AM
To: 'microblaze-uclinux@xxxxxxxxxxxxxx'
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)

Hi,

When I want to do the fastest memcopy with MicroBlaze and the addresses are aligned, I use the offset as the loop index and move data backwards.
This will remove one instruction from the loop.

Loop:
   lw r4,r6,r10
   sw r4,r5,r10
   bneid r10,loop
   addik r10,r10,-4


Now the above code is without consideration of the pipe stalls (which are different between the area version and the performance version).

I guess the main target for Linux is the performance version with caches.
The above code would take (100% cache hit assumed) 1+4+2+1 = 8 clock cycles/loop.
The major drawback here is the lw-sw combination since we do a store on the register that load has just read.
The forwarding of this value goes from the WB pipe stage to the OF pipe stage, which means that we will have 2 stall cycles in EX and MEM.
(sw takes 2 clock cycles when caches are used).

We can now rearrange the code a little to minimize the stalls in the pipeline:

Loop:
    addik r10,r10,-4
    lw    r4,r6,r10
    bneid r10,loop
    sw    r4,r5,r10

We will now need to start the loop with r10+4 since we the first instruction in the loop subtracts r10 with 4.
The above code should use 6 clock cycles/loop.

But maybe you could add another case in your memcopy to unroll the loop.
If you know that the number of bytes to move is a multiple of 8, you can do this instead.

Loop:
    lw r4,r5,0
    lw r7,r5,4
    addik r5,r5,8
    sw r4,r6,0
    sw r7,r6,4
    addik r6,r6,8
    bneid r10,loop
    addik r1,r10,-8

This loop takes 1+1+1+2+2+1+2+1 = 11 clock cycles/loop which is 5.5 cycles/word.

You can unroll how many times you want but you can never get better than 4 cycles/word.


Göran Bilski

-----Original Message-----
From: owner-microblaze-uclinux@xxxxxxxxxxxxxx [mailto:owner-microblaze-uclinux@xxxxxxxxxxxxxx] On Behalf Of Jim Law
Sent: Tuesday, April 29, 2008 10:01 PM
To: microblaze-uclinux@xxxxxxxxxxxxxx
Subject: Re: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)

Hi again,

Just heading out the door here and might not be available for a couple of 
days, so thought I'd pass this along, untested, for your consideration.

I wrote a move_halfword_higher() function in assembler, but didn't have the 
time to test it.  If my assumptions were correct about the intent of the 
memmove() in adapter.c, then this should drop in and do a quicker move.  It 
assumes a bunch of stuff, so both the overhead and transfer loops should be 
quicker than the memmove it replaces, but only for this specific case. 
Inner loop is 12 cycles per word to move, only 11 cycles of overhead on top 
of that.

If I didn't get it right, I'll test and fix it on my return.

Jim Law
Iris Power

----- Original Message ----- 
From: "Brettschneider Falk" <fbrettschneider@xxxxxxxxxxxxxxx>
To: <microblaze-uclinux@xxxxxxxxxxxxxx>
Sent: Tuesday, April 29, 2008 1:10 PM
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff 
*orig_skb,structnet_device*dev)


> Hi,
>
> Jim Law wrote:
>> Here is a new assembler memcpy for the MB.
> cool...trying that...stay tuned...
>
>> ... not sure what else I can add at this point. Is there any other
>> significant transfers or checksums that use other functions
>> that might benefit?
> drivers/net/xilinx_emac/adapter.c has that ugly memmove() in 
> FifoRecvHandler(). Maybe memmove() can also be faster in assembler... ;-)
>
> Cheers, Falk
>
> ___________________________
> microblaze-uclinux mailing list
> microblaze-uclinux@xxxxxxxxxxxxxx
> Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
> Mailing List Archive : 
> http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/
> 


___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/