log in | register | forums
Show:
Go:
Forums
Username:

Password:

User accounts
Register new account
Forgot password
Forum stats
List of members
Search the forums

Advanced search
Recent discussions
- Drag'N'Drop Autumn edition now available (News:)
- !DualHead puts 2 screens in one (News:)
- RISC OS London Show 2017 - Notes from the talks (News:6)
- November News (News:)
- !Organizer 2.28 reviewed (News:2)
- !OBrowse reviewed (News:10)
- Aemulor (Gen:16)
- DDE reaches release 28 and above (News:)
- Elesar quicks dispels stormy clouds (News:2)
- RISC OS London Show 2017 (News:)
Latest postings RSS Feeds
RSS 2.0 | 1.0 | 0.9
Atom 0.3
Misc RDF | CDF
Site Search
 
Article archives
The Icon Bar: Programming: Code GCC produces that makes you cry #12684
 
  Code GCC produces that makes you cry #12684
  Phlamethrower (23:22 1/8/2010)
  nunfetishist (14:25 2/8/2010)
    Phlamethrower (09:47 11/5/2016)
      Phlamethrower (21:27 11/5/2016)
        Stoppers (13:50 31/10/2016)
          Phlamethrower (19:42 31/10/2016)
            nunfetishist (11:06 1/11/2016)
              adrianl (18:36 9/11/2016)
  tribbles (22:09 2/8/2010)
    Phlamethrower (22:55 2/8/2010)
      Phlamethrower (12:02 10/2/2011)
        nunfetishist (12:09 10/2/2011)
          Phlamethrower (12:24 10/2/2011)
            nunfetishist (14:52 10/2/2011)
              trevj (15:05 10/2/2011)
                flibble (21:51 10/2/2011)
                qUE (03:38 11/2/2011)
                  nunfetishist (11:20 11/2/2011)
                    qUE (16:42 11/2/2011)
                      nunfetishist (17:43 11/2/2011)
                    swirlythingy (23:53 24/1/2013)
 
Jeffrey Lee Message #114920, posted by Phlamethrower at 23:22, 1/8/2010
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15057
Exhibit A: Why have one branch when you can have three?

000032A4 : BHI &000032E4
000032A8 : MOV R12,R4,LSL #12
000032AC : MOV R1,R12,LSR #28
000032B0 : CMP R1,#&0F ; =15
....
000032E0 : LDMDB R11,{R4,R5,R11,R13,PC}
000032E4 : BL &00001E2C
000032E8 : B &000032A8


Exhibit B: Does 15*4=60?

00003B5C : CMP R0,#&0F ; =15
00003B60 : LDREQ R2,[R4,#60]
00003B64 : LDRNE R8,[R4,R0,LSL #2]
00003B68 : MOV R3,R1,LSR #7
00003B6C : BICEQ R8,R2,#&FC000003


Exhibit C: If A then B else B:

00003C10 : CMP R0,#&0F ; =15
00003C14 : LDREQ R2,[R4,#60]
00003C18 : LDRNE R2,[R4,#60]


Exhibit D: Ever head of AND? (Take a look at how R3 is used):

00003E14 : MOV R3,R1,LSL #16
00003E18 : AND R2,R8,#&1E ; =30
00003E1C : AND R1,R1,#&FF ; ="ÿ"
00003E20 : SUB R1,R12,R1,ROR R2
00003E24 : MOV R3,R3,LSR #28
00003E28 : CMP R3,#&0F ; =15
00003E2C : MOV R0,R4
00003E30 : BIC R2,R1,#&FC000003
00003E34 : BEQ &00003E40
00003E38 : STR R1,[R4,R3,LSL #2]


Maybe I should try compiling this with GCC 4.6 and see what that makes of it!

[Edited by Phlamethrower at 00:26, 2/8/2010]
  ^[ Log in to reply ]
 
Rob Kendrick Message #114921, posted by nunfetishist at 14:25, 2/8/2010, in reply to message #114920
nunfetishist
Exposing morons since 1981

Posts: 484
GCC for ARM has been getting progressively more awful over the past 5 years or so. Roll on LLVM and Clang.
  ^[ Log in to reply ]
 
Jason Tribbeck Message #114928, posted by tribbles at 22:09, 2/8/2010, in reply to message #114920
tribbles
Captain Helix

Posts: 929
Exhibit A is vaguely understandable if the code in &00001e2c is in a different .o file.

Exhibit B would be easier to understand if the original source is in place.

Exhibit C is inexcusable! (Original source might be good here to see why it decided to opt for that approach)

Exhibit D is taking bits 12 to 15 of R1, and using them as the pointer offset - for a few seconds, I couldn't think of a way of doing it which would be smaller, but then I thought of this:

AND R3, R1, #&F000
...
CMP R3, #&F000
...
STR R1, [R4, R3, LSR#10]

Unfortunately, there is the possibility that R3 is used after this fragment, but I'm guessing you've not been too clip happy smile

I have seen GCC produce some quite nice code, but I have also seen it produce drivel.

I've been meaning to write something on how to produce optimal ARM code, but haven't had the time...
  ^[ Log in to reply ]
 
Jeffrey Lee Message #114929, posted by Phlamethrower at 22:55, 2/8/2010, in reply to message #114928
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15057
Exhibit A is vaguely understandable if the code in &00001e2c is in a different .o file.
Nope, all part of the same function. Edit: Looks like I misread your statement. Although as it turns out, it is the same .o file (just not the same function)

The problem code is the result of a ternary operator, where one outcome (LS, not shown) is just a simple LDR while the other (HI) is the result of a function call.

Exhibit B would be easier to understand if the original source is in place.
Ternary, again... it's the result of the LHS macro, where:

#define LHS ((LHSReg == 15) ? R15PC : (state->Reg[LHSReg]))
#define R15PC (state->Reg[15] & R15PCBITS)
#define R15PCBITS (0x03fffffcL)


Exhibit C is inexcusable! (Original source might be good here to see why it decided to opt for that approach)
It looks like it's a slightly messed up combination of the LHS macro and the CFLAG macro (where CFLAG gets the carry flag from R15):

dest = LHS + DPImmRHS + CFLAG;

It performs the comparison for the LHS ternary operator, then either loads R15 and extracts the PC bits into R8 (not shown, obviously), or loads the general-purpose register R0 (not shown). But somewhere along the way the instruction scheduler has decided that once R15 has been loaded it's a good time to extract the carry flag, but for R15 to be loaded all the time it would need to insert an extra instruction to load it for the R0!=15 case... without spotting that it's an identical instruction to the one used to load it for R0==15.

Exhibit D is taking bits 12 to 15 of R1, and using them as the pointer offset - for a few seconds, I couldn't think of a way of doing it which would be smaller, but then I thought of this:

AND R3, R1, #&F000
...
CMP R3, #&F000
...
STR R1, [R4, R3, LSR#10]
Correct!

Unfortunately, there is the possibility that R3 is used after this fragment, but I'm guessing you've not been too clip happy smile
Yes, the final STR was the only place in which R3 would be used.

In case you're wondering, I found all these examples in the binary from the latest version of my ARM optimised version of ArcEm

[Edited by Phlamethrower at 00:08, 3/8/2010]
  ^[ Log in to reply ]
 
Jeffrey Lee Message #116553, posted by Phlamethrower at 12:02, 10/2/2011, in reply to message #114929
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15057
As many flaws as GCC seems to have with its optimiser, the compiler I'm currently using at work (which I probably can't name for legal reasons unhappy) seems to be about 10 million times worse.

One of these days I'm going to snap and just start writing everything in assembler.
  ^[ Log in to reply ]
 
Rob Kendrick Message #116554, posted by nunfetishist at 12:09, 10/2/2011, in reply to message #116553
nunfetishist
Exposing morons since 1981

Posts: 484
As many flaws as GCC seems to have with its optimiser, the compiler I'm currently using at work (which I probably can't name for legal reasons unhappy) seems to be about 10 million times worse.
Greenhills? smile
  ^[ Log in to reply ]
 
Jeffrey Lee Message #116555, posted by Phlamethrower at 12:24, 10/2/2011, in reply to message #116554
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15057
Never even heard of it. Should I be glad? smile
  ^[ Log in to reply ]
 
Rob Kendrick Message #116557, posted by nunfetishist at 14:52, 10/2/2011, in reply to message #116555
nunfetishist
Exposing morons since 1981

Posts: 484
Never even heard of it. Should I be glad? smile
Very.

People who in my experience have produced dreadful compilers include: GNU, Greenhills, Borland, IAR, and loads others.
  ^[ Log in to reply ]
 
Trevor Johnson Message #116558, posted by trevj at 15:05, 10/2/2011, in reply to message #116557
Member
Posts: 660
People who in my experience have produced dreadful compilers include: GNU, Greenhills, Borland, IAR, and loads others.
What's up with Borland? I thought it'd be a good idea for me to get to grips with something like that before venturing into GCC and the Acorn/Castle/ROOL tools under RISC OS (and also disregarding the small matter of not being a programmer).
  ^[ Log in to reply ]
 
Peter Howkins Message #116574, posted by flibble at 21:51, 10/2/2011, in reply to message #116558
flibble

Posts: 864
I thought it'd be a good idea for me to get to grips with something like that before venturing into GCC and the Acorn/Castle/ROOL tools under RISC OS (and also disregarding the small matter of not being a programmer).
Don't spoil yourself, things like IDEs and integrated debuggers will ruin you for the basicness of RISC OS dev.
  ^[ Log in to reply ]
 
qUE Message #116576, posted by qUE at 03:38, 11/2/2011, in reply to message #116558
qUE

Posts: 168
People who in my experience have produced dreadful compilers include: GNU, Greenhills, Borland, IAR, and loads others.
What's up with Borland? I thought it'd be a good idea for me to get to grips with something like that before venturing into GCC and the Acorn/Castle/ROOL tools under RISC OS (and also disregarding the small matter of not being a programmer).
I think rjek means the machine code output from the compiler. TBH you can't expect a compiler to weave any decent code anyway it's just impossible, it's limited by the rules and preset routines it's given, a compiler won't be able to read code and then understand what's going on with it like a human can. Of course the flip side of that is human error. But from personal experience a human will always produce faster, more optimal code than a compiler.
  ^[ Log in to reply ]
 
Rob Kendrick Message #116583, posted by nunfetishist at 11:20, 11/2/2011, in reply to message #116576
nunfetishist
Exposing morons since 1981

Posts: 484
I think rjek means the machine code output from the compiler.
No, I mean buggyness, painfulness to use, documentation, completeness of implementation *and* the performance of the code it outputs.
TBH you can't expect a compiler to weave any decent code anyway it's just impossible
Except this isn't true, of course. Compilers are able to produce excellent quality code (GCC on x86, LLVM etc), often better than humans can (see almost any compiler for IA64.)
But from personal experience a human will always produce faster, more optimal code than a compiler.
I bet a compiler will produce better code than you for all of Cortex-M4, IA64, HPPA, POWER, Alpha, SPARC, ARC, and MIPS smile

The quality of the instructions fed to the CPU is not the only performance issue. I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm.
  ^[ Log in to reply ]
 
qUE Message #116586, posted by qUE at 16:42, 11/2/2011, in reply to message #116583
qUE

Posts: 168
GCC on x86
Not entirely conviced GCC produces better code than other x86 compilers, but I'll take another look, long time since I've had a proper look at GCC.

I bet a compiler will produce better code than you for all of Cortex-M4, IA64, HPPA, POWER, Alpha, SPARC, ARC, and MIPS smile
All of these are on processors which are either long gone, or will be replaced by a newer range tomorrow, so why anyone would code for these is anyones guess tongue

The quality of the instructions fed to the CPU is not the only performance issue. I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm.
I was implying the coder actually knew what they were doing, not just anyone. I'm suprised anything ran slower than BBC BASIC, that is a feat. wink
  ^[ Log in to reply ]
 
Rob Kendrick Message #116587, posted by nunfetishist at 17:43, 11/2/2011, in reply to message #116586
nunfetishist
Exposing morons since 1981

Posts: 484
I bet a compiler will produce better code than you for all of Cortex-M4, IA64, HPPA, POWER, Alpha, SPARC, ARC, and MIPS smile
All of these are on processors which are either long gone, or will be replaced by a newer range tomorrow, so why anyone would code for these is anyones guess tongueErr... I can't tell if you're being ironic, or just painfully wrong.
The quality of the instructions fed to the CPU is not the only performance issue. I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm.
I was implying the coder actually knew what they were doing, not just anyone. I'm suprised anything ran slower than BBC BASIC, that is a feat. wink
I'd say two thirds of the people I know who insist on writing huge chunks of assembler by hand know how to write fast assembly code, but have no idea how to write a fast program.
  ^[ Log in to reply ]
 
Martin Bazley Message #121805, posted by swirlythingy at 23:53, 24/1/2013, in reply to message #116583

Posts: 460
I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm.
I didn't even know it was possible to unroll a logical shift.

        SUBS    R2,R0,#&1C
BPL |L00000420|
SUBS R2,R0,#8
BMI |L0000048C|
SUBS R2,R2,#4
MOVMI R1,#1
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#2
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#3
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#4
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#5
BMI |L00000464|
|L00000420|
SUBS R2,R2,#4
MOVMI R1,#6
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#7
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#8
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#9
BMI |L00000464|
SUBS R2,R2,#4
MOVMI R1,#&0A
BMI |L00000464|
SUB R2,R2,#4
MOV R1,#&0B
|L00000464|

I particularly like the little punchline at the beginning, where he pre-emptively jumps to the middle of the chain if necessary... one presumes to reduce the number of cycles.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #123830, posted by Phlamethrower at 09:47, 11/5/2016, in reply to message #114921
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15057
GCC for ARM has been getting progressively more awful over the past 5 years or so. Roll on LLVM and Clang.
You say that, but...

#define LHS ((LHSReg == 15) ? R15PC : (state->Reg[LHSReg]))
#define R15PC (state->Reg[15] & R15PCBITS)
#define R15PCBITS (0x03fffffcL)

typedef struct {
int Reg[16];
} state_t;

int func(state_t *state, int LHSReg) {
return LHS;
}


->

clang -O3 -marm -S test.c --target=arm-linux-gnueabi -mcpu=cortex-a8


->

 cmp   r1, #15
ldrne r0, [r0, r1, lsl #2]
ldreq r0, [r0, #60]
biceq r0, r0, #-67108861
bx lr


LLVM/Clang (3.7) is almost certainly better than GCC at some things, but it seem that optimising ternary operators is not one of them.

I should probably check Norcroft when I get home, hopefully that can restore my faith in humanity.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #123833, posted by Phlamethrower at 21:27, 11/5/2016, in reply to message #123830
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15057
I should probably check Norcroft when I get home, hopefully that can restore my faith in humanity.
Bah!

 CMP   a2,#&f
LDRNE a1,[a1,a2,LSL #2]
LDREQ a2,[a1,#&3c]!
BICEQ a1,a2,#&fc000003
MOV pc,lr


Another fun way I've discovered of generating horrible code is to have a switch statement where all the cases contain identical code. Sometimes you'll get lucky and find a compiler which will share the code between all the different case statements, but I'm yet to find a compiler which is smart enough to eliminate the jump table altogether.

[Edited by Phlamethrower at 22:28, 11/5/2016]
  ^[ Log in to reply ]
 
Simon Willcocks Message #123873, posted by Stoppers at 13:50, 31/10/2016, in reply to message #123833
Member
Posts: 278
Still a recent discussion!

int func2(state_t *state, int LHSReg) {
int reg = state->Reg[LHSReg];
if (LHSReg == 15)
reg = reg & R15PCBITS;
return reg;
}

arm-none-eabi-gcc -c x.c -O4

00000014 <func2>:
14: e7900101 ldr r0, [r0, r1, lsl #2]
18: e351000f cmp r1, #15
1c: 03c003ff biceq r0, r0, #-67108861 ; 0xfc000003
20: e12fff1e bx lr
  ^[ Log in to reply ]
 
Jeffrey Lee Message #123877, posted by Phlamethrower at 19:42, 31/10/2016, in reply to message #123873
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15057
Yes, if you restructure the code then you can get the compiler to produce something sensible. But the point is that the compiler should be able to spot a lot of these transformations itself. It's incredibly basic CSE which the compilers are failing at, for example:

#define THING (fudge*fudge)
#define R15PCBITS (0x03fffffcL)

int func(int LHSReg, int fudge) {
if (LHSReg == 15) {
return THING & R15PCBITS;
} else {
return THING;
}
}


Any sane person would calculate THING once and use the result in both halves of the if. But that appears to be beyond the capabilities of GCC.

func(int, int):
cmp r0, #15
muleq r1, r1, r1
biceq r0, r1, #-67108861
mulne r0, r1, r1
bx lr


Unless you change THING to something simpler like 'fudge+fudge', in which case you get:

func(int, int):
cmp r0, #15
lsl r0, r1, #1
biceq r0, r0, #-67108861
bx lr


Lord knows why it would decide to optimise one but not the other.

http://gcc.godbolt.org/ is a fun place to go if you feel like breaking compilers. Just remember -marm when testing ARM output otherwise it may be being crippled by thumb compatibility.
  ^[ Log in to reply ]
 
Rob Kendrick Message #123878, posted by nunfetishist at 11:06, 1/11/2016, in reply to message #123877
nunfetishist
Exposing morons since 1981

Posts: 484

Any sane person would calculate THING once and use the result in both halves of the if. But that appears to be beyond the capabilities of GCC.

func(int, int):
cmp r0, #15
muleq r1, r1, r1
biceq r0, r1, #-67108861
mulne r0, r1, r1
bx lr


To be fair, this code does only calculate it once smile Still, there's a cycle that could be saved.
  ^[ Log in to reply ]
 
Adrian Lees Message #123885, posted by adrianl at 18:36, 9/11/2016, in reply to message #123878
Member
Posts: 1565
To be fair, this code does only calculate it once smile Still, there's a cycle that could be saved.
<uber pedant>
Two cycles, but you have to transpose the input arguments such that you can use:

mul r0,r0,r0
cmp r1,#15
biceq r0,r0,#-67108861
bx lr

Promoting the mul then mitigates the result-use dependency for some implementations, although also note that the result is unpredictable for pre-v6 architectures, as is the 'muleq r1,r1,r1' since Rd == Rm.
</uber pedant>
  ^[ Log in to reply ]
 

The Icon Bar: Programming: Code GCC produces that makes you cry #12684