[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)



Hi Goran,

Thanks for the insight. I'm totally new to MB assembler, and was just going by the cycle values in the MB Ref Guide, so wasn't really aware of the pipe stall issue. The Guide lists both SW and LW as 2 cycles - I gather that is incorrect (if you can keep the pipe from stalling), according to your numbers below.

With memcpy, there is an underlying assumption that it will do an ascending copy - memmove uses it when it's doing the destination is lower than source. We could re-jig the pair to optimize the memcpy (to descending) and let memmove have the ascending move in it, so long as there is no-one else depending on the memcpy being ascending.

Is it fair to say that for "small" copies, the added cost in cycles of the extra overhead of finding out if there is block of 8 or 12 aligned words that can be copied intact is an acceptable tradeoff against the added efficiency of the block approach for "large" copies?

Jim Law
Iris Power LP


----- Original Message ----- From: "Goran Bilski" <Goran.Bilski@xxxxxxxxxx>
To: <microblaze-uclinux@xxxxxxxxxxxxxx>
Sent: Wednesday, April 30, 2008 4:31 AM
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)


Hi,

Actually the limit when unrolling the code is 3 clocks/word.

Moving 12 words (48 bytes) can be done by this:

lw r4,r5,0
lw r7,r5,4
lw r8,r5,8
sw r4,r5,0
sw r7,r5,4
sw r8,r5,8
lw r4,r5,12
lw r7,r5,16
lw r8,r5,20
sw r4,r5,12
sw r7,r5,16
sw r8,r5,20
lw r4,r5,24
lw r7,r5,28
lw r8,r5,32
sw r4,r5,24
sw r7,r5,28
sw r8,r5,32
lw r4,r5,36
lw r7,r5,40
lw r8,r5,44
sw r4,r5,36
sw r7,r5,40
sw r8,r5,44

The above code takes 12*1+12*2= 36 clock cycles which is 3 clocks/word.

Göran

-----Original Message-----
From: Goran Bilski
Sent: Wednesday, April 30, 2008 10:23 AM
To: 'microblaze-uclinux@xxxxxxxxxxxxxx'
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)

Hi,

When I want to do the fastest memcopy with MicroBlaze and the addresses are aligned, I use the offset as the loop index and move data backwards.
This will remove one instruction from the loop.

Loop:
  lw r4,r6,r10
  sw r4,r5,r10
  bneid r10,loop
  addik r10,r10,-4


Now the above code is without consideration of the pipe stalls (which are different between the area version and the performance version).

I guess the main target for Linux is the performance version with caches.
The above code would take (100% cache hit assumed) 1+4+2+1 = 8 clock cycles/loop. The major drawback here is the lw-sw combination since we do a store on the register that load has just read. The forwarding of this value goes from the WB pipe stage to the OF pipe stage, which means that we will have 2 stall cycles in EX and MEM.
(sw takes 2 clock cycles when caches are used).

We can now rearrange the code a little to minimize the stalls in the pipeline:

Loop:
   addik r10,r10,-4
   lw    r4,r6,r10
   bneid r10,loop
   sw    r4,r5,r10

We will now need to start the loop with r10+4 since we the first instruction in the loop subtracts r10 with 4.
The above code should use 6 clock cycles/loop.

But maybe you could add another case in your memcopy to unroll the loop.
If you know that the number of bytes to move is a multiple of 8, you can do this instead.

Loop:
   lw r4,r5,0
   lw r7,r5,4
   addik r5,r5,8
   sw r4,r6,0
   sw r7,r6,4
   addik r6,r6,8
   bneid r10,loop
   addik r1,r10,-8

This loop takes 1+1+1+2+2+1+2+1 = 11 clock cycles/loop which is 5.5 cycles/word.

You can unroll how many times you want but you can never get better than 4 cycles/word.


Göran Bilski

-----Original Message-----
From: owner-microblaze-uclinux@xxxxxxxxxxxxxx [mailto:owner-microblaze-uclinux@xxxxxxxxxxxxxx] On Behalf Of Jim Law
Sent: Tuesday, April 29, 2008 10:01 PM
To: microblaze-uclinux@xxxxxxxxxxxxxx
Subject: Re: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)

Hi again,

Just heading out the door here and might not be available for a couple of
days, so thought I'd pass this along, untested, for your consideration.

I wrote a move_halfword_higher() function in assembler, but didn't have the
time to test it.  If my assumptions were correct about the intent of the
memmove() in adapter.c, then this should drop in and do a quicker move. It assumes a bunch of stuff, so both the overhead and transfer loops should be
quicker than the memmove it replaces, but only for this specific case.
Inner loop is 12 cycles per word to move, only 11 cycles of overhead on top
of that.

If I didn't get it right, I'll test and fix it on my return.

Jim Law
Iris Power

----- Original Message ----- From: "Brettschneider Falk" <fbrettschneider@xxxxxxxxxxxxxxx>
To: <microblaze-uclinux@xxxxxxxxxxxxxx>
Sent: Tuesday, April 29, 2008 1:10 PM
Subject: RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff
*orig_skb,structnet_device*dev)


Hi,

Jim Law wrote:
Here is a new assembler memcpy for the MB.
cool...trying that...stay tuned...

... not sure what else I can add at this point. Is there any other
significant transfers or checksums that use other functions
that might benefit?
drivers/net/xilinx_emac/adapter.c has that ugly memmove() in
FifoRecvHandler(). Maybe memmove() can also be faster in assembler... ;-)

Cheers, Falk

___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive :
http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/



___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/


___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/