[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: [microblaze-uclinux] xenet_FifoSend(struct sk_buff *orig_skb,structnet_device*dev)



Hi,

John Williams wrote:
> Does it still work if CPU unaligned exceptions and handler 
> are not enabled?
Yes, it does. We never tried to enable that. Looking at the 
TCP sources (e.g. do_checksum()) it looks to me they programmed
it 16bit-wise anyway, that's why it still works. 
I also checked the Platform Studio settings and xmd says:

MicroBlaze Processor Configuration :
-------------------------------------
Version............................4.00.b
MMU Type...........................No_MMU
No of PC Breakpoints...............2
No of Read Addr/Data Watchpoints...1
No of Write Addr/Data Watchpoints..1
Instruction Cache Support..........on
Instruction Cache Base Address.....0x80000000
Instruction Cache High Address.....0x87ffffff
Data Cache Support.................off
Exceptions  Support................off
FPU  Support.......................off
Hard Divider Support...............on
Hard Multiplier Support............on - (Mul32)
Barrel Shifter Support.............on
MSR clr/set Instruction Support....off
Compare Instruction Support........off
Number of FSL ports................1

> > My user program was permanently sending data. Time 
> > measuring with an oscilloscop showed that most of the time 
> > the spent time in xenet_FifoSend() is reduced from about 
> > 600us to about 200us. The 200us consists of 100us checksum 
> > calculation plus 100us spent in XEmac_FifoSend().
> 
> confirms that memcpy is a killer!  Note there is an API to combine 
> memcpy and do_csum - maybe this can help?
The original version of xenet_FifoSend did use that combination by skb_copy_and_csum_dev(). In most cases the new version still need the checksum part, only.

> > Though I noticed that the call of xenet_FifoSend() is just 
> > every millisecond. That is why the data throughput (Megabytes 
> > per second) to my PC keeps the same. Is xenet_FifoSend() 
> > called by a timer with such frequency of 1/ms? If yes, can it 
> > be reduced?
> 
> FifoSend is be called by the upper level network stack (as 
> dev->hard_start_xmit()). See linux-2.6.x/net/core/dev.c, around line 
> 1350.  This is called by dev_queue_xmit(),
> 
> At a quick glance it's not immediately obvious to me how this gets 
> scheduled, whether outgoing packets are queued and dumped on 
> some kind 
> of timer, or how that works.  Just dig the source or hit the 
> google, see 
> if there's a nice explanation out there.
Currently, I'm analysing the TCP sources. A good helper for the send part is this link: http://book.chinaunix.net/special/ebook/oreilly/Understanding_Linux_Network_Internals/0596002556/understandlni-CHP-11-SECT-1.html.

A few days ago I saw gdb backtraces where sending was called from an IRQ handler function, at present I think those were retransmit actions or TCP/IP protocol handshake packages - I'm still not sure, but they are not the usual case.
Instead most send() calls of my user program are processed as direct call to xenet_FifoSend(), and xenet_FifoSend() must not defer the packet because I've never seen a failed call of XEmac_FifoSend(), thus netif_stop_queue() is not called in xenet_FifoSend(). This means the hardware is still faster than the software+kernel can provide new packets.

My user program is permanently sending data in a loop using the posix functions select() for setting up a timeout and of course send(). The portion of data given to each send() is 328 bytes. The used socket is in blocking mode which means send() is a blocking call for the user program.
Then I measured the time at some places in net/ipv4/tcp.c:tcp_sendmsg(). Here is the log (- I collected it in internal vars and dumped it to console by printk() just from time to time to not disturb the sending):

                          diff (usec) to
tracepoint                previous tracepoint   note
----------                -------------------   ----
entered tcp_sendmsg():            175
entered inner while-loop:         40            seglen=328
before sk_stream_alloc_pskb       6
after sk_stream_alloc_pskb        57
before skb_add_data():            21            copy=328
after  skb_add_data():            57
exit tcp_sendmsg():
-
entered tcp_sendmsg():            148
entered inner while-loop:         37            seglen=328
before skb_add_data():            5             copy=328
after  skb_add_data():            56
exit tcp_sendmsg():
-

entered tcp_sendmsg():            148
entered inner while-loop:         36            seglen=328
before skb_add_data():            5             copy=328
after  skb_add_data():            56
exit tcp_sendmsg():
-
entered tcp_sendmsg():            148
entered inner while-loop:         36            seglen=328
before skb_add_data():            5             copy=328
after  skb_add_data():            56
exit tcp_sendmsg():
-
entered tcp_sendmsg():            148
entered inner while-loop:         36            seglen=328
before skb_add_data():            5             copy=148
after  skb_add_data():            35
before __tcp_push_pending_frames  10
after __tcp_push_pending_frames   614
entered inner while-loop:         3             seglen=180
before sk_stream_alloc_pskb       570
after sk_stream_alloc_pskb        50
before skb_add_data():            21            copy=180
after  skb_add_data():            41
exit tcp_sendmsg():
-
entered tcp_sendmsg():            174
entered inner while-loop:         40            seglen=328
before sk_stream_alloc_pskb       5
after sk_stream_alloc_pskb        57
before skb_add_data():            21            copy=328
after  skb_add_data():            57
exit tcp_sendmsg():
-
entered tcp_sendmsg():            156
entered inner while-loop:         37            seglen=328
before skb_add_data():            5             copy=328
after  skb_add_data():            56
exit tcp_sendmsg():
-
entered tcp_sendmsg():            148
entered inner while-loop:         36            seglen=328
before skb_add_data():            5             copy=328
after  skb_add_data():            56
exit tcp_sendmsg():
-
entered tcp_sendmsg():            148
entered inner while-loop:         36            seglen=328
before skb_add_data():            5             copy=328
after  skb_add_data():            56
exit tcp_sendmsg():
-
entered tcp_sendmsg():            148
entered inner while-loop:         36            seglen=328
before skb_add_data():            5             copy=148
after  skb_add_data():            34
before __tcp_push_pending_frames  10
after __tcp_push_pending_frames   614
entered inner while-loop:         3             seglen=180
before sk_stream_alloc_pskb       570
after sk_stream_alloc_pskb        50
before skb_add_data():            21            copy=180
after  skb_add_data():            41
exit tcp_sendmsg():

(You can also try such measuring. It's attached to this mail.)

In the end the call of skb_add_data() is just a memcpy() to an unaligned destination address (offset=2), 328 bytes take 56us. You can see 4 memcpy()s with 328 bytes and one memcpy() with 148 bytes, and then the packet is filled with 1460 bytes user data (max. MTU size of 100MBit ethernet).

The call of __tcp_push_pending_frames leads to this callstack:
    XEmac_FifoSend (...)
#0  xenet_FifoSend (orig_skb=0x8239d354, dev=0x87fcb000) at drivers/net/xilinx_emac/adapter.c:745
#1  0x820eae90 in dev_hard_start_xmit (skb=0x8239d354, dev=0x87fcb000) at net/core/dev.c:1356
#2  0x820f9478 in __qdisc_run (dev=0x87fcb000) at net/sched/sch_generic.c:139
#3  0x820eb2e0 in dev_queue_xmit (skb=0x8239d354) at include/net/pkt_sched.h:227
#4  0x8210938c in ip_output (skb=0x821764bc) at net/ipv4/ip_output.c:187
#5  0x821088c0 in ip_queue_xmit (skb=0x8239d354, ipfragok=0) at include/net/dst.h:228
#6  0x8211bc4c in tcp_transmit_skb (sk=0x87c1d028, skb=0x8239d354, clone_it=-17683, gfp_mask=2185054350) at net/ipv4/tcp_output.c:544
#7  0x8211d24c in __tcp_push_pending_frames (sk=0x87c1d028, tp=0x87c1d028, cur_mss=1460, nonagle=0) at net/ipv4/tcp_output.c:1420
#8  0x8211115c in tcp_sendmsg (iocb=0x8239d354, sk=0x87c1d028, msg=0x8239d2bc, size=2184827732) at net/ipv4/tcp.c:499

Function xenet_FifoSend() is using my latest patch (I sent to the mailing-list) and takes about 200us (with Jim's assembler checksum calculation). XEmac_FifoSend() never returns that it has failed, thus we're still to slow for the hardware.

A mysterious time gap of 570us in the logging above is between "entered inner while-loop" and "before sk_stream_alloc_pskb". It's a bit after __tcp_push_pending_frames. I'm not sure but I reckon it's a task switch to the receive part of the TCP stack (maybe because the TCP-packet with an ACK has been received from my PC).

I'm still going on with my analysis... Any comments, hints or measurings are highly appreciated... I just want to speed up the TCP throughput.

Cheers, Falk

___________________________
microblaze-uclinux mailing list
microblaze-uclinux@xxxxxxxxxxxxxx
Project Home Page : http://www.itee.uq.edu.au/~jwilliams/mblaze-uclinux
Mailing List Archive : http://www.itee.uq.edu.au/~listarch/microblaze-uclinux/