The Icon Bar: Programming: Test your optimization skills pt.2
|
Test your optimization skills pt.2 |
|
sirbod (15:21 27/11/2013) arawnsley (19:19 27/11/2013) sirbod (19:43 27/11/2013) Phlamethrower (20:54 27/11/2013)
|
|
Jon Abbott |
Message #122859, posted by sirbod at 15:21, 27/11/2013 |
Member
Posts: 563
|
The aim of this task is to convert a 4-bit mode to an 8-bit mode in as few cycles as possible using ARMv6 instructions:
Assumptions you can make:
R0 - 4-bit mode buffer pointer R1 - 8-bit mode buffer pointer R2 - pixels remaining R3 - R12 are free to use
The screen mode you're writing to is indexed 256 colour and you can preset the palette entries above 16 as you please.
If the word read from [R0] is 87654321, the output should be two words written to [R1]: 04030201 08070605
You can process the data one byte / word or multiple words at a time. You may also interleave the instructions for optimization purposes where appropriate.
Example 1: Rotate method
.L1 LDR R7, [R0], #4 MOV R9, #0 AND R10, R7, #&F0000000 MOV R12, #32 - 4 .L2 MOV R11, R7, LSR R12 AND R11, R11, #&F MOV R10, R10, LSL #8 ORR R10, R10, R9, LSR #32 - 8 ORR R9, R11, R9, LSL #8
SUBS R12, R12, #4 BPL L2
STMIA R1, {R9-R10}
SUBS R2, R2, #8 BNE L1
Example 2: Full palette use
This assumes the palette entries &10 / &20 / &30 etc map to the logical colours 1 / 2 / 3 etc.
Where [R0] = 87654321 the output to [R1] will be 04300210 08700650
MOV R12, #&FF00 ORR R12, R12, R12, LSL #16
.L1 LDR R4,[R0], #4 AND R3, R4, R12, LSR #8 AND R11, R4, #&FF00 AND R4, R4, R12 MOV R3, R3, LSL #4 MOV R4, R4, LSR #4 EOR R4, R4, R3, LSR #16 AND R3, R3, #&FF0 EOR R4, R4, R11, LSR #4 ORR R3, R3, R11, LSL #12 STMIA R1!, {R3-R4} SUBS R2, R2, #8 BNE L1
The winner gets their code into ADFFS and a mention in the credits. |
|
[ Log in to reply ] |
|
Andrew Rawnsley |
Message #122860, posted by arawnsley at 19:19, 27/11/2013, in reply to message #122859 |
R-Comp chap
Posts: 598
|
However you end up doing this, can I suggest that it be submitted to ROOL as 16 colour mode emulation would be a very worthwhile addition to the OS generally. Indeed, I suspect some of your other routines that you're developing for ADFFS might make worthwhile additions to the operating system as a whole.
I appreciate the code also needs to be present in ADFFS for compatibility with older systems, but the whole "handling 16 colour screen modes" is going to be an issue for every port of RISC OS, pretty much, going forwards.
Indeed, one suggestion would be to look into using the second core of a dual core CPU to sit there doing mode conversion code (one proposed solution to the infamous RGB<->BGR RISC OS "issue"), although I suspect it'd bottleneck because the conversion code cannot be executed until the 16 colour buffer has been calculated... |
|
[ Log in to reply ] |
|
Jon Abbott |
Message #122861, posted by sirbod at 19:43, 27/11/2013, in reply to message #122860 |
Member
Posts: 563
|
I was pondering that very question today. The solution is so deceptively simple it could easily be added into the core OS to provide legacy MODE support.
There's no hackery involved, it's all done using valid RO SWI's and leaves the OS to handle just about everything, with the screen buffer being in DA2 instead of the GPU.
I may well knock up a stripped down stand-alone module at some point, once I've added 1 bpp and 2 bpp translation.
The only botch I had to do, was to figure out the logical GPU screen buffer address. The OS could really do with an SWI to get that info...or OS_Memory extended to handle IO physical addresses, which seems the sensible thing to do. |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #122863, posted by Phlamethrower at 20:54, 27/11/2013, in reply to message #122861 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
Unrolling the loop will make for a much faster rotate method:
.L1 LDR R3,[R0],#4
MOV R5,R3,LSR #28 ; nibble 7 MOV R3,R3,LSL #4 MOV R5,R5,LSL #8 ORR R5,R5,R3,LSR #28 ; nibble 6 MOV R3,R3,LSL #4 MOV R5,R5,LSL #8 ORR R5,R5,R3,LSR #28 ; nibble 5 MOV R3,R3,LSL #4 MOV R5,R5,LSL #8 ORR R5,R5,R3,LSR #28 ; nibble 4
MOV R4,R3,LSR #28 ; nibble 3 MOV R3,R3,LSL #4 MOV R4,R4,LSL #8 ORR R4,R4,R3,LSR #28 ; nibble 2 MOV R3,R3,LSL #4 MOV R4,R4,LSL #8 ORR R4,R4,R3,LSR #28 ; nibble 1 MOV R3,R3,LSL #4 MOV R4,R4,LSL #8 ORR R4,R4,R3,LSR #28 ; nibble 0
SUBS R2,R2,#8 STMIA R1!,{R4-R5} BNE L1
That's 8 pixels per 24 instructions, compared to 40+ for your version. But using AND to extract a pixel and then ORRing it in at the correct offset is faster, as it'll be two instructions per pixel instead of three:
MOV R9,#&FF0 .L1 LDR R3,[R0],#4
AND R4,R9,R3,LSL #4 ; nibbles 0&1 AND R5,R3,#&FF00 ORR R4,R4,R5,LSL #12 ; 2&3
AND R5,R9,R3,LSR #12 ; nibbles 4&5 AND R6,R3,#&FF000000 ORR R5,R5,R6,LSR #4 ; 6&7
SUBS R2,R2,#8 STMIA R1!,{R4-R5} BNE L1
8 pixels in 10 instructions, using the full palette hack, compared to the 13 instructions for your approach. But considering the inner portion is so short, you could easily boost it further by unrolling the loop a few times.
The PLD instruction should also come in useful. The cache line size on the Pi is 32 bytes, so I'd suggest unrolling the loop to the point where each iteration processes 32 source bytes, with a preload instruction somewhere to preload a future cacheline. I'm not sure off the top of my head how far ahead the data should be preloaded, but I'd say 128 bytes ahead should give the hardware plenty of time to fetch the data before you need it.
It's also worth noting that this research suggests that the optimum write size is 4 words.
I'll leave the production of cycle timing optimised routines to someone with more spare time than myself
[Edited by Phlamethrower at 21:02, 27/11/2013] |
|
[ Log in to reply ] |
|
|
The Icon Bar: Programming: Test your optimization skills pt.2 |