Copy Link
Add to Bookmark
Report

ARM code optimization for the GP32

sang's profile picture
Published in 
GP32
 · 4 Mar 2024

Okay, so here goes:

The GP32 has a samsung chip with among other things, a ARM920T core. I suggest you to check www.arm.com and download the proper docs there. The core is similar to the ARM9TDMI core (the GBA has an ARM7TDMI core, so GBA coders will feel right at home) but has some extra features which might be a bit confusing. I read some docs, figured some things out and here are my findings.

WARNING!

This info I gathered from various docs and therefore may well be erroneous, if you spot errors in this doc don't mail me because I never read mail and especially not from people I don't know, so it's best to just bother me on #gp32dev over at efnet (which is a good place to hang out anyway). I haven't benchmarked any of this, so if anybody out there feels he/she wants to verify this data then please do so, you'd make me very happy.

Thanks to exoticorn for teaching me all about ARM code and mr_spiv, darkfader, mp and others for helping me out here and there. I hope this doc will help people write fast code for the GP32. Maybe a decent megadrive/genesis emu?

and now for the goodies:

1) cycle timings

Most instructions take 1 cycle to complete with a few exceptions such as multiplication and memory access. I'm not gonna doc all timings here but I'll just tell you about the ones that matter for speed optimizing:

- MUL takes 1 cycle when mulling if bits [31:8] in the operand are all zero or one, 2 if bits [31:16] are all zero or one and so on. So a mul (or mla , smull , smlal ) takes 1-4 cycles depending on what you multiply with. This means there can be a difference between

mul r0,r1,r2

and

mul r0,r2,r1

They produce the same result, but one can be faster than the other. This is important knowledge when you're doing stuff like bilinear filtering or coding a matrix mul etc.

- LDR is a shady little bugger. If you read the arm920T manual it says the following:

  • 1 cycle - normal case
  • 2 cycles - next instruction uses result (load-use interlock)
  • 3 cycles - loaded byte, halfword or unaligned word used by next instruction (2 cycles load-use interlock)
  • 5 cycles - load into PC (r15)

However, this is not the entire story. These cycle timings apply only when read from cache. (if you read from a memory address that doesn't have cache turned on, ARM9TDMI cycle timings apply, don't worry about that tho since cache is turned on when you boot your GP32). I'll explain more about cache, cache misses and such later on in the doc.

- STR is simple: it takes 1 cycle. But again there's a catch. I'll explain more about this later.

2) cache!

The ARM920T has two pieces of cache: data cache and instruction cache which I shall call dcache and icache respectively. Each cache is 16k large, which is divided as such:

8 segments * 64 cache lines * 8 words = 16kilobytes

When a piece of memory is read a check is done wether or not this data is in cache or not. If the desired data is in cache we call it a 'cache hit' and the data is simply read from cache. If not it's a cache miss and 8 words of data are read from the target address into a cache line (cache line fill), and the data is then read from cache. The problem is that a cache line fill takes a lot of time to complete. Right now, I don't know how many cycles it takes (mr_spiv should look it up, since he claims he has read about it somewhere) but I can guess. The databus is 16 bits wide, so a cache fill of 8 words should take at least 16 cycles.

Thankfully cache can be locked! You can actually load data into cache and lock it so that it won't be overwritten by any new cacheline fills. This guarantees 100% cache hits for the locked data. This can be handy for small textures and tables.

3) writeback

the writeback buffer is a very handy thing. If an str is executed, the data is not stored to memory directly. Instead it goes into the writeback buffer, and the writeback buffer writes it to memory when the databus is free or when it is forced to do so. In the meanwhile the processor can execute other instructions. If for instance you're writing a 32bit word to the gp32's memory it has to travel through a 16bit databus. This means the first cycle the first 16bits are transferred, and the second 16bits are transferred the second cycle. If you don't have a writeback buffer the cpu has to wait 1 extra cycle for the second transfer to be completed. But if you have a writeback buffer the cpu can go on about it's business and perform other instructions while the writeback buffer takes care of business. You have to be careful tho, if you read from an address that you've just written to, and that write is still in the buffer then the writeback buffer is forced to be cleared and the cpu has to wait until all data from the writeback buffer is written to memory because otherwise you'd be reading old data. Also if you put too much stuff in the writeback buffer it can get full and you'll again have to wait for it to clear.

I don't know wether the writeback buffer will write to the databus in the same cycle as the str is performed, so any insights on this would be appreciated.

Optimizing for the GP32

(or any similar system that uses the arm920t core)

1) unrolling code

I've done some benchmarking with regard to locking code/data in cache, and as it turns out locking code/data in cache isn't a noticeable speedup, since the cache handles things pretty well on it's own. So I suggest you don't bother with that stuff too much. I unroll most of my code to speed things up.

2) division

The ARM920t doesn't support division, but there's a nice trick for this. Note that

\cfrac{y}{x} == y*\cfrac{1}{x}

If x is always in a set range, a table with fixed point entries can be created for 1/x, and you can simply multiply. This means you can div in only a few cycles (read+mul). For instance, if you have a polygon routine and you need to calculate the edge deltas, you need to divide by the height of the polygon. Polygons are usually not higher than 1024 pixels so a table of 1k entries would suffice.

3) use the writeback buffer wisely

Count the number of cycles a the writeback buffer takes to complete a write and try to avoid forced clears. Easy enough :)

← previous
next →
loading
sending ...
New to Neperos ? Sign Up for free
download Neperos App from Google Play
install Neperos as PWA

Let's discover also

Recent Articles

Recent Comments

Neperos cookies
This website uses cookies to store your preferences and improve the service. Cookies authorization will allow me and / or my partners to process personal data such as browsing behaviour.

By pressing OK you agree to the Terms of Service and acknowledge the Privacy Policy

By pressing REJECT you will be able to continue to use Neperos (like read articles or write comments) but some important cookies will not be set. This may affect certain features and functions of the platform.
OK
REJECT