Copy Link
Add to Bookmark
Report

Commodore Hacking Issue 21

eZine's profile picture
Published in 
Commodore Hacking
 · 21 Jun 2019

 

########
##################
###### ######
#####
##### #### #### ## ##### #### #### #### #### #### #####
##### ## ## #### ## ## ## ### ## #### ## ## ##
##### ######## ## ## ## ##### ## ## ## ## ##
##### ## ## ######## ## ## ## ### ## ## #### ## ##
##### #### #### #### #### ##### #### #### #### #### #### ######
##### ##
###### ###### Issue #21
################## Feb 5, 2002
########
Special Focus on Minigames

...............................................................................

"Do not meddle in the affairs of wizards, for they are subtle and
quick to anger."
-- Ancient Elvish saying

...............................................................................

BSOUT

(Why yes, I _am_ re-reading Lord Of The Rings, how did you know? Sure, the
movie is pretty good, not the same as the book, but probably the best you
could do, although why does every Hollywood evil creature essentially come
from Night of the Living Dead? But hey, the elves actually looked like elves.
Black Hawk Down is really good too, and... oh, sorry!)

Hoo-ah! Welcome to another issue of C=Hacking!

There's lots of nifty stuff to get to so this will be brief. First, the
"Hacking Exchange" is now up at the Official Unofficial C=Hacking Homepage:

http://www.ffd2.com/fridge/chacking/

It's just a simple message board, with the idea that C=Hacking types can
make comments, ask questions, and otherwise talk to one another. Check
it out!

Second, if you have any projects you're working on... please contact me,
and consider writing them up for C=Hacking!

And finally, something I think you'll enjoy:


http://groups.google.com/groups?hl=en&th=87c2a2ced7e32ce1&rnum=42

From: duck@clumsy.pembroke.edu (duck@clumsy.pembroke.edu)
Subject: C= Hacker's Net-Mag
Newsgroups: comp.sys.cbm
Date: 1992-02-08 14:52:20 PST


Due to the recent influx of "Tech-Know" (People who actually understand /
hack te C=64 into doing stuff that was previously unknown) I am going to
tentativly start a C= Hackers Net-Mag (Hacking in the 90's since of the word,
not the 80's).
So - please send me your netmail address if you're interested in receiving
it.
Or if you are interested in contatcting me concerning an article etc that
you'd like to see distributed please also contact me.

For the first issue I'm tentatively planning on:
- Line Drawing on 8563 VDC (640 x 200 hi-res graphics)
- Beginning ML column
- Raster Article
. . .

- Craig Taylor
duck@pembvax1.pembroke.edu

P.S. - Not sure when the first is gonna go out but hopefully soon.


From: duck@clumsy.pembroke.edu (duck@clumsy.pembroke.edu)
Subject: Issue 1 - C= Hacking Available
Newsgroups: comp.sys.cbm
Date: 1992-02-27 17:22:44 PST


Issue 1 of C= Hackers is now available via NETMAIL and is a compilation of
several articles on the tehnical side of the Commodore 64 and 128. For those
of you who missed the first posting, please reply via email and ask to be put
on the list. The first issue had programming in ml, documented and
undocumented 6502 opcodes, and a line-drawing package in machine language for
the C=128 hi-res screen.
Issue 2 will be coming out in a month or so. Many thanks for the bandwith.

- Craig Taylor
duck@pembvax1.pembroke.edu

.......
....
..
. C=H 20

::::::::::::::::::::::::::::::::::: Contents ::::::::::::::::::::::::::::::::::

BSOUT
o Voluminous ruminations from your unfettered editor.

Jiffies
o News, rumours, and stuff.


Side Hacking

o "Pulse Width Modulation, continued" by various.
Tying up some loose ends from last issue's digi article.

o "Introducing Full-Screen IFLI mode with a SuperCPU", by Todd
Elliot <eyethian@msn.com> (Hey, what's with all this MSN crapola? :)
Using a SuperCPU, it is possible to use the first three columns
of an (I)FLI picture, and Todd shows how.


Main Articles

o "VIC-20 Kernel ROM Disassembly Project, part IV", by Richard Cini
<rcini@msn.com>

And now it's time to start on that most frightening of creations:
the tape drive code!


o "The Art of the Minigame" -- an article in six parts:

Introduction, by the editor
Part 1: Codebreaker, by David Holz <whiteflame52@yahoo.com>
Part 2: TinyPlay, by S. Judd <sjudd@ffd2.com>
Part 3: MagerTris, by Per Olofsson <magervalp@cling.gu.se>
Part 4: Compressing Tiny Programs, by S. Judd <sjudd@ffd2.com>
Part 5: TinyRinth, by Mark Seelye <mseelye@yahoo.com>
Part 6: Tetrattack!, by Stephen Judd <sjudd@ffd2.com>


.................................. Credits ...................................

Editor, The Big Kahuna, The Car'a'carn..... Stephen L. Judd
C=Hacking logo by.......................... Mark Lawrence

Special thanks to the cbm-hackers for many otherwise unacknowledged
contributions.

Legal disclaimer:
1) If you screw it up it's your own fault!
2) If you use someone's stuff without permission you're a dork!

For information on the mailing list, ftp and web sites, send some email
to chacking-info@jbrain.com.


................................... Jiffies ..................................

$01 Jochen Adler has made a program that reads the second side of
a 1541 disk in a 1571 - without turning the disk over. It reads the
blocks from end to beginning. Because of the mechanical bump however,
it can only read tracks 5 to 35. If anybody wants this program please
e-mail Jochen (NLQ@gmx.de)

$02 Soci/Singular has been working on a commented C128 ROM listing. Check
out this great effort at http://singularcrew.hu/c128rom/

$03 64net/2 has been updated:

For those that do not know: 64net/2 is yet another PC-to-C64/128 User<->LPT
parallel cable software. It supports d64/d71/d81/t64/lnx/dhd disk images,
raw files and own Internet partition. It is possible to enter disk images
like any other directory. Client programs are provided for C64 and C128.
The next goal will be to patch ROM instead of loading client.

A small BASIC example program is included that is able to send e-mails
through 64net/2 host or any other machine if its IP is known. How many
e-mail agents are there for C=? How many of them are written in plain
BASIC 2.0? :)

http://sourceforge.net/projects/c64net/ contains the latest version, and

http://venus.wmid.amu.edu.pl/~ytm/64net2win.tar.gz contains a Windows
binary.

$04 CC65 is up to version 2.7.0, with many new improvements. Check it out
at http://www.cc65.org/

$05 Moreover, Ullrich has set up his C64 as a web server at

http://c64.cc65.org/

The web server runs on a stock 64, using a Swiftlink for communications,
and uses the uIP TCP/IP stack written (with cc65) by Adam Dunkels

http://dunkels.com/adam/uip

Pretty cool, eh?

$06 A new IDE interface has become available:

Elysium is proud to announce a new software and hardware solution for
your Commodore mass data storage needs.

The CIA-IDE is yet another approach to connect an IDE hard drive to
C64/128. It differs from previous similar projects in these areas:

- it is the simpliest one (only two chips required (or just one in case of
C128))
- it is free (documentation and software),
- the software is available in source code form under GNU GPL,
- there _exists_ a ready to use GEOS 2.0 driver.

Documentation, source codes, and binaries are at:

ftp://ftp.elysium.pl/groups/Elysium/Projects/ciaide/

$07 Marko Makela has developed a tape drive emulator with an RS-232 port,
allowing transfers between any computer with a tape port and any
computer with an RS-232 interface (e.g. a PC or Swiftlink). The hardware
and software is at

http://www.funet.fi/pub/cbm/crossplatform/transfer/C2N232/

$08 GoDot -- the C64 image processing program -- is now public domain. Arndt
will still be working on it, but it's now available at

http://members.aol.com/howtogodot/godnews.htm

$09 The C64 is now listed in the Guiness Book of World Records; a scan of
the page (from Robert Bernardo, posted by Frank Michlick) is at

www.cupid.de/upload/famous.jpg

(love that line about "16K sound").

Also, I highly recommend taking Robert's advice and checking out the
infinitely cool "Logo-Matic" on the main www.cupid.de site!

$0A JOS just keeps cruising along, with lots of changes. Among the biggest
changes, of course, is the announcement that JOS will be merged with
Clips, with the new system to be called "Wings" (with some of the letters
capitalized and some not; I never remember that stuff). For the latest
JOS news, check out

http://www.jolz64.cjb.net/

$0B Frank Kontros has made a commented disassembly of the C64 ROM, with the
BASIC ROM on the way:

http://c64.uz.ua/sources/C64_Kernal_Disassembly.zip


$0C And finally, Aleksi Eeben, author of two minigames, has now written a
VIC-20 game:

http://www.student.oulu.fi/~aeeben/download/dragonwing.zip

more screenshots:

http://www.student.oulu.fi/~aeeben/screen1.png - screen4.png

Aleksi's homepage is at http://www.cncd.fi/aeeben

Neat!

................................ Side Hacking ................................


Pulse Width Modulation, continued
--------------------------------- from various

The digi article in issue #20 of C=Hacking left a few loose ends, and
generated some followups.

First, Otto Jarvinen (sounddemon) emailed to say that the SID detection
routine occasionally reported incorrect results for him, and suggested that
a workaround was to do the detect several times. YMMV!

Second, a day or two after issue #20 was released, Levente discovered a
brilliant way to play 6-bit PWM digis on a stock machine:

--
I couldn't resist, and tried something out (see attachment). It works!!! :-)

In fact, when I wrote the last letter I didn't know that I found something
useable, just had some ideas - I felt that I'm at the right place. When I read
C=H 20 this morning and read your comment about the Test bit (from the PRG), I
knew that it must work. All I had to do is then to put this idea into code.

The whole idea is about starting the pulse by software, and then having the
SID turn it back to 0 after a time.

Is it possible? ...The keys are the Test bit (the SID wave counter can be
reseted anytime), the pulse width register, the wave counter and the SIDs way
of generating pulse wave. (Ie. the pulse wave is high, as long as the wave
counter is less than the value in the pulse width register).


Check this algorithm:

- Init: volume at max, voice 1 sustain level max, start attack. Freq is
selected well (=$4000), so the wave counter is incremented by 4 every
processor clock cycles.

Loop:
- load next sample value, and put it to the pulse width low register ($d402;
ensure that $d403 is 0).
- Set test bit, and clear test bit (counter reset).
- Increase sample pointer, some delay, then loop. The delay must be 64 clock
cycles + the time while the Test bit is kept set (4 cycles if using STA $d404
: STX $d404 immediately with pre-loaded values).

What will happen? The 8-bit sample value is put directly to the pulse width
register (MSBs of the pulse width register are cleared!...). The wave counter
is started (release test bit), and it increases 4 by every CPU cycles (=
counts 256 in 64 cycles). After some time, the counter will reach the value in
the pulse width register. This happens in exactly after (8-bit sample value /
4) cycles, because of the above. In this cycle (or the next?...) the SID turns
its pulse output to 0. Voil·!

One must just make sure that the loop length in cycles matches the above
conditions, and then it runs like hell... Since it does exactly the same on
the SID as the other (bit-banging) way, it just does it with some hardware
help, there's also no problem with the 4khz maximum barrier (since the
oscillator is reset every loop).

With little enhancement, it's possible to write an about 7.5 bits player for a
stock C64 by this method. This is what you find in the attachment... The idea
is using all the 3 channels simultaneously. A slightly increased sample value
is written to the three pulse width registers, so the oscillators will finish
the duty cycle one processor cycle later, when there's a carry between
bits(0,1) to the MSBs.

The replay freq is the CPU clk / 68 (~15khz). 64 cycles (variable duty cycle)
+ 4 cycles (constant duty cycle because of the reset time - no problems with
that, it doesn't change (just gives a small constant DC...)).

By similar methods, it should be possible to write a sample player with higher
PWM freq (with less resolution of course, but eliminating this still audible
whistling).

(I tried using the filter to reduce it, but it sounded so bad that I left it
out. It clicked like hell. The FETs got saturated.)

[Richard Atkinson suggested turning down the sustain volumes to avoid this]

See the attachment, and the binary. I think the sample sounds pretty good :-).
(The cut is from 'Greece 2000' by Three drives on a vinyl).

(Another idea that popped up in my mind: since the TED sound generator can
also be reset, I could probably translate this idea to the Plus/4 :-O ).

Best regards,

Levente

--

The binary is available at http://www.ffd2.com/fridge/chacking/ towards the
bottom of the page.

Third, I received a very interesting email from an Apple-II guy, which I'd
like to pass on:

--

Hi!

I found your page as I was searching for something else 6502-related,
and was very interested. Although I have always been aware of the
C64, I have never really been a user--I have used Apple II's since 1980.

I was particularly interested in the article on playing "digis" on the
C64. I became interested in playing digitized sounds on the Apple II
in 1993, after hearing a 3-bit, 11.025 KHz PWM player. At 3 bits, you
can imagine how noisy speech samples were, but the overall effect
for a 1 MHz machine with a 1-bit speaker "toggle" was amazing. It
made me wonder how far this PWM technique could be pushed on a
stock, 1 MHz Apple II (not the somewhat faster, 65816-based IIgs).

The short answer is, much farther than I expected! Robin and Stephen
accurately describe the theoretical PWM limit as 6 bit samples at
about 16 KHz for a stock 1 MHz machine, but, as they point out,
that is not practically realizable for a number of reasons, unless the
play loop is completely unrolled!

Furthermore, in the Apple II world, sampled sounds have acquired a
few standardized sampling rates--mostly as a result of Mac influence,
which was in turn influenced by CD's. The most common rate in the
Apple II world is 11.025 KHz, or one-fourth of the audio CD sampling
rate. This is commonly considered to be "AM radio quality", with a
Nyquist bandwidth of about 5.5 KHz and a practical bandwidth of
4+ KHz, given practical anti-aliasing filters (at the sampling end, not
the playback end).

A frequency of 11.025 KHz is, though high, still painfully audible to
people whose ears are not zonked--a piercing "squeal" running
through every sound. So even though it is possible to write a
practical 6-bit 11.025 KHz PWM player (usually called a SoftDAC
in the Apple II world), the resulting listening experience is disappointing.

So I went to work on a way to do 2x oversampling, and built a 5-bit
22.050 KHz PWM player. It was sad to lose a bit, but the absence
of any audible "carrier" more than compensated for it!

If you have access to an 8-bit Apple II (preferably with lower case,
like a //e), and also preferably with a way of attaching an external
speaker or headphones in place of the miserable 2.75" internal
speaker, then you can easily give it a try and judge for yourself.

I'm pretty proud of the novel design of the code, which I would
characterize as "
vectored" unrolled loops, one for every two
pulse duty cycles, which I wrote a BASIC program to write
for me--much less painful for counting cycles!

The package is available on the web at:

http://members.aol.com/MJMahon/index.html

and is called <A HREF="
http://members.aol.com/MJMahon/sound22.shk">Sound Editor v2.2</A>, since I had to "dress up" the player
into something fun to play with. ;-) An earlier version of Sound Editor
was published on SoftDisk in 1994, IIRC, but this one is a little more
evolved. It also introduced 2:1 ADPCM compression of 8-bit sampled
sounds, to save disk space. It is a lossy compression, but not very
noticeably. The editor package also includes those routines, in 6502
assembly code.

All of this should be trivially adaptable to the stock, 1 MHz C64, with
very good results. By using the filters, you could probably filter out
the 11.025 KHz carrier and return to 6-bit accuracy!

I should note that in the Apple world, sampled sounds are usually
represented as "
excess-128" codes, which means that the sign bit
is inverted. This actually simplifies things, since the sample value
is within a few shifts of being the pulse width in cycles.

Let me know what you think!

-michael

--

(Always great to hear from Atari and Apple ][ folks!)

And finally, I have a little mathematical analysis of PWM and how it compares
to a "
straight" digi. Basically, I found some of the PWM explanations a
little unconvincing in issue #20 (even though I wrote them!). For example,
the idea of "
average voltage" seems a little funny, since every two samples
has an "
average voltage", as does every four, etc. but that set of average
voltages would give a different sounding signal than the original (or
more dramatically, there is an average voltage over a full second of digi
playback, but that's not what you hear!). So I wanted to know how a
PWM signal _really_ compares to a straight digi playback.

Another issue is changing the amplitude of a PWM digi, i.e. using two
pulse waveforms, with one 1/16 the value of the other, to get higher
resolution. If you recall the discussion of digis, the resolution of a PWM
digi depends on the number of pulse widths available, not the amplitude.
Adding two PWM waveforms together does not change the number of pulse widths
available, so I wanted to figure out what changing the amplitude _really_
does to a PWM digi, and if it can really be exploited.

And finally, I wanted to know about the carrier wave (that is so piercing
at lower playback frequencies) -- and once again, how it compares with a
standard digi (which, after all, is stair-stepping the voltages at the
playback rate).

Since the rest of this article is some Fourier analysis that 99% of people
will have zero interest in, I'll put the conclusions here. The first is:
PWM digis and standard digis are essentially identical except at higher
frequencies (except for a phase shift, which doesn't make any difference to
your ear). The second is: changing the amplitude of a PWM changes the
resolution. More specifically, the amplitude of the pulse multiplies the
digi sample value. If two pulses can be synced close enough, it should
indeed be possible to use two pulses to get a higher resolution. Moreover,
by modulating the amplitude of a single PWM digi, using the $d418 volume
register -- that is, using PWM _and_ $d418 -- it should be possible to get a
higher dynamic range, something that should be a little more achievable using
SID (but maybe not that useful, so I didn't try it out). And finally, a
standard digi has zero amplitude at the carrier frequency.

In other words, after a lot of effort I was able to demonstrate what everyone
already knows.

The analysis doesn't change anything from the previous articles (except
possibly the idea for changing the PWM amplitude to get more dynamic range).

And now, some Fourier analysis. A standard digi just sets the voltage to
the sample value s_j, for a length of time dt (dt = 1/sample rate). The
Fourier transform of a single sample s_j (occuring at time t_j) is


s_j [e^(-iw dt) - 1] * [e^(-iw t_j) / -iw]


where w = angular frequency. Since the above is a little hard to read, I'll
say it in words. The first term is the sample value s_j, which scales
amplitudes at all frequencies. The second term is due to the finite length
of the pulse (evaluating the Fourier integral at the boundaries), and
basically changes the phase of the transform. The third term is like
sin(w)/w -- a sinusoid with decreasing amplitude as frequency increases.
So: the transform goes like sin(w)/w times the sample value, with some phase
effects thrown in (we'll get back to these in a moment).

A PWM digi sets the duty cycle of a pulse to the sample value s_j, giving
a Fourier transform of

[e^(-iw s_j dt) - 1] * [e^(-iw t_j) / -iw]

Compare this with the earlier expression, and you'll see that the sample
value s_j has moved up in to the exponent of the "
phase term" but that
they're otherwise the same.

The first thing to do is to show that both expressions, PWM and standard,
reduce to the same thing -- that is, that a PWM and a standard digi sound
the same! The expressions both decrease as 1/frequency, due to the
sin(w)/w term. This means that at large frequencies the values become
negligible. (How large? For example, if the sample frequency is just 1KHz,
then sin(w)/w is .001 times smaller near w=1KHz (i.e. the sample frequency,
which is twice the Nyquist limit) than it is near w=0).

So now consider the phase terms for small w. The Taylor expansion for e^x is

1 + x + x^2/2 + ...

We can therefore expand the "
phase terms" as

regular: e^(-iw dt) - 1 = (1 - iw*dt + w^2 dt^2/2 + ...) - 1
= -iw*dt + O(w^2 dt^2)

pwm: e^(-iw s_j dt) - 1 = -iw*s_j*dt + O(w^2 dt^2)

where O(w^2 dt^2) is considered very small since w and dt are both small.
Substituting the above into the original expressions gives

s_j*iw*dt [e^(-iw t_j) / iw]

in both cases. That is, we have shown that for "
small" frequencies -- more
specifically, for frequencies where (w^2*dt^2) is much smaller than (w*dt),
which is where w*dt<1, which is frequencies less than the sample frequency,
which is all frequencies of interest! -- PWM and standard digis are the same.

The explanation lies in the phase terms. Those "
phase terms"

[e^(iw dt) - 1] (regular)

and

[e^(iw s_j dt) - 1] (PWM)

do more than just change the phase. When they multiply the sin(w)/w signal,
they take the sin(w)/w signal, change the phase, and then subtract the
sin(w)/w signal again. It's this difference of signals that makes things
work out at the frequencies we care about. PWM and standard digis are _not_
the same, but the main differences are at higher frequencies, where the
amplitudes are in general much smaller.

But... but... what about the PWM carrier frequency? If we take a constant
digi, say with sample values = 1/2, the standard digi gives a constant
voltage, whereas a PWM digi gives a square wave at the sample frequency.
The answer comes from the "
phase terms" above. The sample frequency is

w = 2*pi/dt.

Substituting this into the phase terms gives

[e^(i*2*pi) - 1] (regular)

and

[e^(i s_j 2*pi) - 1] (PWM)

The regular expression is exactly zero -- there is _nothing_ at the
sample frequency of a regular digi. But that's not the case for the PWM
term, because of the s_j up in the exponent. PWM digis have a _finite_
amplitude at the carrier frequency. Note that because of the sin(w)/w
term it gets smaller as the sample frequency increases -- but it isn't zero.

Finally, the phase term expansions give some insight into what happens
when both the pulse width _and_ height are varied. If the pulse width
is s_j, and the height is set to h_j, then the Fourier transform becomes

h_j*s_j *iw*dt [e^(-iw t_j) / iw]

That is, the amplitude multiples the width. For the case of adding two
PWM waves together, then, the amplitude really does effectively scale the
sample value, and it should be possible to add one PWM value at 1/16 the
amplitude of another to get an effective 8-bit value.

What about _varying_ the amplitude of a single PWM sequence? For a 6-bit PWM
digi, say, the sample values s_j can go from 0 to 63. If this is then
multiplied by h_j=2 say, then the values become 0 2 4 ... 126 -- a 7-bit
number where the lowest bit is always 0. What use is that? Well, we still
have the h_j=1 values of 0..63, which do include the lowest bit. So we
can effectively change the dynamic range from 0..63 to 0..126 using just two
amplitude values.

As a practical matter, then, it might be possible to use all 15 $d018 values
available to get a big dynamic range, and hence a better sounding digi,
using fewer CPU cycles. Well, ok, we're only _sort of_ changing the dynamic
range, so I pretty much doubt the usefulness of it. But maybe someone out
there would like to give it a shot.

All right, let's hope this closes the book on pulse width modulation for
digi playback!
.......
....
..
. C=H 20

..............................................................................

Introducing Full-Screen FLI mode for the SuperCPU
Copyright (C) 2002 By Todd S. Elliott

The 'FLI Bug', where the first three columns of a FLI screen are essentially
unusable, can be squashed with the help of a SuperCPU. I won't go into great
detail on IFLI, as it has been well-documented elsewhere, but I'll begin with
a short summary to get us all up to speed. I refer you to Albert 'Pasi'
Ojala's excellent coverage of the FLI mode in C=Hacking #4. Pasi also
proofread this article.

A Three-Minute Summary of the FLI mode

The VIC-II chip asserts a badline when it needs to access the databus and
fetch character data or videomatrix data. It was discovered that the VIC-II
chip can be manipulated by its vertical scroll register at $d011 (SCROLY) to
induce a badline at any given rasterline. By having a badline at every visible
rasterline, the program can manipulate $d018 (VMCSB) to point at the right
videomatrix to achieve the maximum flexibility of colors given to a
multi-color screen.

Unfortunately, when a program forces a badline via SCROLY, the BA (Bus
Available) line in the computer goes high, and for three cycles the 6510/8510
processor has to finish its write operations or halt its read operations
before the BA line is released to the VIC-II chip. The maximum number of
successive write operations is three, hence the 3-cycle delay. It is in those
three cycles that the VIC-II does not fetch video matrix data to fill in the
first three columns and causes the 'FLI Bug'.

I wish to stress that in those first three cycles, when the BA line is high,
the 6510/8510 processor is still active and can complete write operations. It
isn't fully shut down. After the badline retrigger at STA SCROLY, the code
following it is fetched on the databus and is ready to be executed by the
6510/8510 processor. When BA is high, the VIC-II will reference the value
on the databus as videomatrix data and display it in the first three columns
of the screen. The actual instructions that follow the STA SCROLY in the FLI
loop constitutes the video matrix data for the first three columns of the
screen.

Enter the SuperCPU!

Normally, a VIC-II chip access is only possible every 4 cycles. The SuperCPU
can access the VIC-II chip in 1 cycle (1MHz) intervals, making cycle to cycle
changes possible within the VIC-II chip. More importantly, the SuperCPU
tristates the 6510/8510 processor inside the host Commodore computer (which
is a fancy way of saying that you can disconnect the processor from the
system without physically removing it).

When a forced badline retrigger occurs with a STA SCROLY in a FLI loop under
the SuperCPU, the BA signal inside the host Commodore computer goes high. But,
the SuperCPU runs asynchronously and really doesn't have to pay attention to
the host Commodore as it runs code after the STA SCROLY. In fact, the SuperCPU
will execute code even if the VIC-II badline is in full swing inside the host
Commodore computer.

I knew that the instruction opcodes left on the databus after the STA SCROLY
made up the video matrix data for the VIC-II chip for those first three
columns of the screen. But I wondered how this was possible in a SuperCPU
configuration because there would be no instruction opcodes left hanging on
the databus inside the host Commodore computer. After some discussions with
Per Olofsson ("
MagerValp"), he suggested that writes/reads to the i/o area
will force a value to be put on the databus.

This is where the magic begins, when the FLI loop forces the SuperCPU to write
to the i/o area of the host Commodore after the forced badline retrigger at
STA SCROLY. The SuperCPU will note that the BA signal is still high, so it can
still access the databus and stash values there via DMA. This BA high signal
will last for 3 cycles, enough for the SuperCPU to stash three values onto the
databus.

The 6510/8510 is still tristated by the SuperCPU, and there's nothing on the
databus after the forced badline retrigger at STA SCROLY. Normally, the
6510/8510 CPU shares the databus with the VIC-II for each machine cycle. With
the 6510/8510 CPU out of the equation, the SuperCPU can stash a value onto
this shared bus on the CPU half of this machine cycle and the VIC-II chip will
see it in its other half of the machine cycle.

However, the databus is only eight bits wide. The VIC-II chip fetches video
matrix data and color ram data 12 bits at a time. The SuperCPU can force
values onto the databus during the first three cycles after the forced badline
retrigger, but on each cycle the last four bits belonging to Color RAM would
not be fed to the VIC-II chip. Only pixel values of %10 and %01 can be
individually selected in multicolor FLI mode, while %11 pixel values cannot be
individually set for those first three columns of the screen. The high
resolution FLI mode does not suffer from this problem because it does not use
color RAM for color attribute information.

Full-Screen FLI in practice

Let's get down to the nitty gritty. The Write I/O Approach requires three
200-byte tables, corresponding to each column. Each value on those tables
correspond to each visible rasterline. For example, the first byte of each
table corresponds to rasterline 50, the second byte of each table corresponds
to rasterline 51, etc. The first table contains values needed for the first
column of the screen, the second table contains values needed for the second
column of the screen, and the third table values for the third column.

In the FLI display loop prior to the STA SCROLY command, the current
rasterline is used as an index to all three tables. The values are then
fetched from the tables and inserted into the code that follows the STA SCROLY
command using self-modifying code techniques. When the STA SCROLY happens, the
code that immediately follows it starts writing the values onto the databus,
all three in a row to complete the first three columns of the screen.

There is a disadvantage with this approach. It requires that three 200-byte
tables be specially constructed and stored somewhere in memory that is not
mirrored by the SuperCPU. A routine would have to read in a FLI graphics file,
extract information from the first three columns and store it into their
respective 200-byte tables.

Pasi Ojala came up with a graph depicting the SuperCPU interacting with the
VIC-II in action, showing what happens after the forced DMA retrigger at STA
SCROLY. The 'LDA #$xx' would have been modified earlier in the FLI routine
(before the STA SCROLY) using self-modifying code. Here is the relevant source
code which takes up 4 machine cycles inside the host Commodore computer.

sta scroly : abcd, d = write Y to SCROLY on 1MHz bus CPU half - Mach. Cycle #1
lda #$00 : ef
sta $d022 : ghij, j = write 1 to $D022 on 1MHz bus CPU half - Mach. Cycle #2
lda #$00 : kl
sta $d022 : mnop, p = write 2 to $D022 on 1MHz bus CPU half - Mach. Cycle #3
lda #$00 : qr
sta $d022 : stuv, v = write 3 to $D022 on 1MHz bus CPU half - Mach. Cycle #4

There are the two shared halves consisting of a machine cycle inside the host
Commodore bus, and by stashing values onto the databus, this value is carried
over to the VIC-II half and is read as videomatrix data during the first three
columns of the FLI screen.

________ = VIC-II half of the 1MHz cycle
. = SCPU synchronizes to 1MHz bus
Each char position is equivalent to a 1 (20MHz) cycle.

Mach. Cycle #1 Mach. Cycle #2 Mach. Cycle #3 Mach. Cycle #4
+-------------------+-------------------+-------------------+-------------------
__________YYYYYYYYYY___DMA____1111111111___col0___2222222222___col1___3333333333___col2___ 1MHz
abc.......ddddddddddefghi.....jjjjjjjjjjklmno.....ppppppppppqrstu.....vvvvvvvvvv SCPU

Values on the databus which is carried over onto the VIC-II half of the databus:
DMA: DMA condition detected by VIC-II
col0: colors for column 0 read, gets the value 1 put into the bus by SCPU
col1: colors for column 1 read, gets the value 2 put into the bus by SCPU
col2: colors for column 2 read, gets the value 3 put into the bus by SCPU

An alternative approach bites the dust

The SuperCPU can also fetch values onto the databus by reading from the I/O
region. If a coder were so inclined to use a 'Read I/O Approach', where is a
program going to find 600 free bytes in the i/o region at $d000-$dfff? The
idea is to force the SuperCPU to do a read on the databus via DMA and this
can't be done with mirrored locations similar to the ones used in those VIC
optimization modes. When a SuperCPU reads a value from mirrored memory, it
does so from its local RAM and not the RAM that is inside the host Commodore
computer. However, if the SuperCPU reads from the I/O block at $d000-$dfff, it
will read a value from inside the host Commodore computer using DMA.

Unfortunately, this approach did not work when the BA line went high inside
the host Commodore computer and is unworkable for a full-screen FLI mode. The
SuperCPU stops for reads if BA is high, just like its 1MHz 6510/8510
counterpart.

Other Considerations.

There were some interesting observations while debugging the full-screen FLI
routines. The full-screen FLI routines were originally inspired by Robin
Harbron's IRQ-based IFLI routines. Because they are driven by an IRQ, the
CPU is still available for normal computational tasks.

When all three videomatrix values are written to after the STA SCROLY in the
line-interruptible FLI routine, the IRQ must then exit quickly with the
restoration of the registers. It's a good idea to avoid writing to any
mirrored location or read/write to any I/O region ($D000-$DFFF), since the
SuperCPU will have to wait for VIC to finish with the data bus.

Using a raster IRQ will naturally lead to trouble, since cycle-exact timing
is needed, so a CIA timer is used. The timer may be set to synchronize a
PAL or NTSC machine. Then in the FLI routine the timer can be checked and
indexed into a table of preset timing values so that the forced badline
retrigger at STA SCROLY will always happen at the right time on the screen,
no matter what the SuperCPU is doing when the VIC-II interrupted it with an
raster IRQ. Thanks goes to Ninja/The-Dreams (aka Wolfram Sang) for tips on
how to create a stable line-interruptible FLI routine using timers.

Source code

Without further ado, here is the source code. This source code was used in the
Santa Claus FLI Demo for Wheels OS. This code will run in either Commodore 64
or 128 computers and in either PAL or NTSC systems. It did take a lot of
tweaking at Points #1, #2, #3, #4, & #5 as I tried to perfect the routines as
closely as possible. The full source code for the Santa Claus FLI demo can be
supplied via email upon request. It is in Concept+ (geoProgrammer) format.

; Wedges the full-screen FLI interrupt handler in Wheels systems.
; Thanks goes to Robin Harbron for the idea of a line-interruptible FLI routine.
InstallFLI: ; Installs the FLI routine
jsr ClearMouseMode ; Turn off the mouse.
sei
lda CPU_DATA ; Save 6510/8510 Location #$01.
sta r6510
lda screenMode ; Check computer.
bne 2$ ; Take branch in 128 mode.
lda #IO_IN
.byte $2c
2$: lda #%00110111 ; for 128 mode only
sta CPU_DATA
lda vmcsb ; Save original video bank info for Wheels.
sta oVMBmp
lda scroly ; save screen Y axis
sta yaxis
MoveW $fffe, oldVector ; Saves the old Wheels IRQ vector.
lda #<fli ; Sets up fli raster.
sta $fffe ; At the IRQ interrupt vector.
lda #>fli
sta $ffff
; Point #1
lda #$31 ; Trigger the IRQ request
sta raster ; At rasterline 49.
lda scroly
and #$7f ; Clear bit 7 of raster register.
sta scroly
lda #1
sta vicirq ; Ack raster ints.
cli
rts

RemoveFLI: ; Removes the FLI routine
sei
MoveW oldVector, $fffe ; Restores the old Wheels IRQ vector.
lda #$fb ; Trigger the IRQ request
sta raster ; At rasterline 251.
lda yaxis ; restore screen Y axis
and #$7f ; Clear bit 7 of raster register.
sta scroly
lda #1
sta vicirq ; Ack raster ints.
lda oVMBmp ; Restore original video bank info for Wheels.
sta vmcsb
lda r6510
sta CPU_DATA ; Restores 6510 Port #$01
cli
jmp StartMouseMode ; Start the mouse on.
fli: ; The actual FLI interrupt routine lies here.
pha
.byte $da ;phx
.byte $5a ;phy
php ; Save processor flags.
; Point #2
ldx #$03 ; #$0f for PAL SuperCPU systems.
3$ dex
bpl 3$
lda raster
tax
ldy colOneClrs,x ; Get colors for the first three columns.
sty mark4+1
ldy colTwoClrs,x
sty mark5+1
ldy colThreeClrs,x
sty mark6+1
; ldy backgndTable,x ; Get background color for scanline.
; stx $d021
inx
; Point #3
cpx #$f9 ; Have we reached scanline 249?
bne 1$
; Point #1
ldx #$31 ; Restart the IRQ at rasterline 49.
1$: stx raster ; By this time, the raster interrupt register is
; incremented by one, and will re-trigger the
; same fli routine.
; This way, it is fully line-interruptible
; and frees up SuperCPU time.
ldy #$01
sty vicirq ; Ack raster ints.
and #$07 ; Mask out lower three bits.
tax
ldy tabd018,x ; Use preset values for vmcsb.
lda d011tab,x ; Use preset values for scroly.
sty vmcsb ; Select video matrix.
sta scroly ; Forces the badline.
mark4: lda #$00 ; Stores a video matrix value onto the first column.
sta $d022
mark5: lda #$00 ; Stores a video matrix value onto the second column.
sta $d022
mark6: lda #$00 ; Stores a video matrix value onto the third column.
sta $d022
plp ; Restore processor flags.
.byte $7a ;ply
.byte $fa ;plx
pla ; Do NOT use any memory accesses to the host
rti ; Commodore databus in this part because it will
; be blocked by the VIC-II badline.

; Point #4
tabd018: ; Preset video matrix values.
.byte $78,$08,$18,$28,$38,$48,$58,$68; NTSC systems
;.byte $08,$18,$28,$38,$48,$58,$68,$78 - PAL systems
d011tab: ; Preset VIC DMA retrigger values.
.byte $38,$39,$3a,$3b,$3c,$3d,$3e,$3f

ChkAbortKey: ; Checks the RUN/STOP keypress.
LoadB $dc00, #$7f ; check for the STOP key
3$: asl $dc01 ; Check for the RUN/STOP keypress.
; This also synchronizes the line-interruptible FLI routine.
bcs 3$ ; Branch if it isn't pressed.
rts
prep3Cols: ; Prepares the first three columns of the FLI screen
; Ideally, a FLI file would be loaded in and this
; routine would then be called to set up the three
; 200-byte tables corresponding to each column,
; covering the first three columns of the screen.
lda #$40
sta mark1+2 ; Prepare the marks.
sta mark2+2
sta mark3+2
ldy #$00
sty mark1+1
iny
sty mark2+1
iny
sty mark3+1
php
sei
lda screenMode
beq 1$ ; take branch in 64 mode.
lda $ff00
pha ; save 128 configuration.
lda #%01111110 ; select RAM at $4000
sta $ff00
1$: ldy #$00
ldx #$07 ; Use self-modifying code to create three
; Point #5
mark1: lda $4000 ; 200-byte tables for each column of the
sta colOneClrs+49,y ; FLI screen and each value is indexed by
mark2: lda $4001 ; the scanline in the FLI routine.
sta colTwoClrs+49,y
mark3: lda $4002
sta colThreeClrs+49,y ; use +48 for the column offset in PAL
clc ; systems.
lda mark1+2
adc #$04
sta mark1+2
sta mark2+2
sta mark3+2
iny
dex
bpl mark1
sec
lda mark1+2
sbc #$20
sta mark1+2
clc
lda mark1+1
adc #$28
sta mark1+1
tax
inx
stx mark2+1
inx
stx mark3+1
lda mark1+2
adc #$00
sta mark1+2
sta mark2+2
sta mark3+2
cpy #200
bne mark1-2
lda screenMode
beq 2$ ; take branch in 64 mode.
pla
sta $ff00 ; restore 128 configuration.
2$: plp
rts

.ramsect $1000
; All column colors are referenced by scanline.
colOneClrs: ; Column one colors of the FLI screen.
.block 256
colTwoClrs: ; Column two colors of the FLI screen.
.block 256
colThreeClrs: ; Column three colors of the FLI screen.
.block 256

Hopefully the full-screen FLI possibilities that the SuperCPU can now unlock
will bring forth cool software for our SuperCPU's and tons of 'eye candy'.

Enjoy.
.......
....
..
. C=H 20

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

Main Articles

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

VIC KERNAL Disassembly Project - Part V
Richard Cini
February, 2002

Introduction
============

In Part 4 of this series, we discovered that of the 39 ROM routines
available to be called by user code, 26 of them related to device I/O. The
path to using a device from machine code was to first "
open" it. So, we
looked at the routines required to open and begin using a logical device.

When we concluded the discussion on the OPEN command, we left out
seven routines dealing with the tape device. This is some of the most
complex code I've ever seen. Unlike the IEEE serial peripherals, the tape
deck is a "
dumb" device, meaning that the VIC Kernal has to do the work of
moving the data to/from the tape. With the serial peripherals, the Kernal
just sends a character to the device that then acts on it independently of
the VIC.

After opening the device, there are nine non-tape routines to deal
with getting data into and out of the VIC, including talking and untalking,
listening and unlistening, moving the characters in and out, and some
secondary address stuff.

In this installment, we'll conclude our discussion of the OPEN
procedure by looking at the tape routines that are called from OPEN. Later
articles will discuss the actual movement of bits to and from the cassette
and the remaining IEEE serial routines, as well as the loading and saving of
files.

Tape Routines
=============

In the last installment, we examined the beginnings of opening a
device for use by a program. The reader will find that the routines are
convoluted and hard to follow, with jumps and branches into apparently
unrelated subroutines. Overall it appears to be ugly but highly functional
code.

Here's a "
calling tree" outline to the routines called by OPEN:

IOPEN----
|-FIND (analyzed in Part IV of this series)
|-SENDSA (ultimately handles IEEE serial stuff)
|-SEROPN (handles RS232 stuff)
|
|(all tape-related from here down)
|-GETBFA (get address of tape buffer)
|-PLAYMS (prompt user to press PLAY on tape deck)
| |-TPSTAT (tape key status)
| |-TPSTOP (check for STOP keyboard key during tape ops)
|
|-SRCHMS (print "
Searching" or "Searching for..." messages)
|-LOSCPH (locate a tape header with filename matching one in
| OPEN)
|-LOCTPH (locate first/next header; no filename specified)
| |-SETBST (sets tape buffer start/end pointers)
| |-PLAYMS (see above)
| |-TPCODE (main tape code-moves bits in and out on IRQ)
| |-SBIDLE (serial buss idle check)
| |-STOIRQ1 (put key tape vectors into table)
| |-NCHAR (sets bit counter for char input operations)
| |-TPSTOP (see above)
| |-IUDTIM (update jiffy clock; previous installment)
|
|-RECDMS (prompt user to press PLAY & RECORD on tape deck)
|-WRTPHD (write a tape header to tape)
| |-SETSBT (see above)
| |-TPWRIT1 (precedes TPCODE by 12 bytes)


When dealing with the tape, it helps to understand that Commodore
built the tape protocol with an eye towards readability under adverse
conditions, including tape quality and motor speed. This made Commodore
tapes probably one of the most reliable data systems when compared to TI,
Apple, and Tandy, among others. Data headers and data blocks are repeated on
the tape and a comparison is made between the two reads to ensure integrity.
Additionally, the recording method is self-clocking so the effects of
varying tape speed are minimized.

One of the first routines used in opening the tape device is
determining the address of the tape buffer and making sure that it's in the
$02xx page (or higher) of the system RAM. A test at IOPEN_S5 bails out with
an "
illegal device" error if the tape buffer isn't just so.

F84D ;=============================================================
F84D ; GETBFA - Get start of tape buffer
F84D ; Returns buffer address in .X (LSB) and .Y (MSB)
F84D ;
F84D GETBFA
F84D A6 B2 LDX TAPE1
F84F A4 B3 LDY TAPE1+1
F851 C0 02 CPY #$02 ;is buffer at $02xx?
F853 60 RTS

PLAYMS is called by IOPEN at IOPEN_S6. A test there determines if
we're trying to read/load or write/save from/to a tape. Read mode results in
the "
Press Play..." message, the "Searching for..." message and two routines
that search for the appropriate tape header.

F894 ;===============================================================

F894 ; PLAYMS - Wait for tape key on read
F894 ;
F894 PLAYMS
F894 20 AB F8 JSR TPSTAT ;get tape key status
F897 F0 1C BEQ TPSTEX ;$F8B5 pressed? yes, exit.
F899
F899 A0 1B LDY #KIM_PLAY ;offset for "
Press Play..." message

F89B PLAYMS1
F89B 20 E6 F1 JSR MSG ;print it

F89E WTPLAY
F89E 20 4B F9 JSR TPSTOP ; check for STOP key
F8A1 20 AB F8 JSR TPSTAT ;get key status
F8A4 D0 F8 BNE WTPLAY ;$F89E wait for PLAY switch

F8A6 A0 6A LDY #KIM_OK ;offset for "
OK" message
F8A8 4C E6 F1 JMP MSG ;print it and return to caller

This simple routine prints the "
Searching for..." message only if in
direct mode, and if appropriate, the filename that is being searched for. If
no filename is present, the message is changed to "
Searching..." (without
the "
for") before it's output to the screen.

F647 ;==============================================================
F647 ; SRCHMS - Print "
Searching for [filename]"
F647 ;
F647 SRCHMS
F647 A5 9D LDA CMDMOD ;direct mode?
F649 10 1E BPL SRCHEX ;$F669 no (programmed mode), exit
F64B
F64B A0 0C LDY #KIM_SRCH ;"
Searching" message
F64D 20 E6 F1 JSR MSG ;output message
F650 A5 B7 LDA FNMLEN ;get filename length
F652 F0 15 BEQ SRCHEX ;$F669 no filename, skip "
for"
F654 A0 17 LDY #$17 ;point to "
FOR" in "Searching For"
F656 20 E6 F1 JSR MSG ;print it
F659 ;
F659 ; FLNMMS - Print filename
F659 ;
F659 FLNMMS
F659 A4 B7 LDY FNMLEN ;get filename length
F65B F0 0C BEQ SRCHEX ;$F669 no filename, exit
F65D A0 00 LDY #$00
F65F FLNMLP ; loop to print filename
F65F B1 BB LDA (FNPTR),Y ;get character
F661 20 D2 FF JSR CHROUT ; and print it
F664 C8 INY
F665 C4 B7 CPY FNMLEN
F667 D0 F6 BNE FLNMLP ;$F65F loop
F669 SRCHEX
F669 60 RTS ;exit


If the user is attempting to open a tape file with a specific
filename, the IOPEN code makes a call to LOCSPH in IOPEN_S6 to find the file
header associated with the filename. If the specific header is not found,
the routine emits the "
File not Found" error message. LOCSPH is a loop
wrapper around the LOCTPH routine that searches for the "
next" file header
regardless of name. The secondary address parameter of the OPEN command
defines whether the action is to (0) read a tape file and relocate it in
memory, (1) read a tape file without relocation (a machine program), or (2)
write a tape file and put both EOF an EOT markers after it. Once a header is
found, the filename from the header is compared to the filename in the OPEN
command to see if there's a match.

F867 ;============================================================
F867 ; LOCSPH- Find specific tape header
F867 ;
F867 LOCSPH
F867 20 AF F7 JSR LOCTPH ;search for next header
F86A B0 1D BCS LCSPEXC+1 ;$F889 returned EOT? Go to ready

F86C A0 05 LDY #$05 ;filename offset in header
F86E 84 9F STY TPTR2 ;save offset
F870 A0 00 LDY #$00 ;loop counter
F872 84 9E STY TPTR1 ;save it

F874 LCSPHLP
F874 C4 B7 CPY FNMLEN ;compare filename length
F876 F0 10 BEQ LCSPEXC ;$F888 counter 0, exit
F878
F878 B1 BB LDA (FNPTR),Y ;get filename letter
F87A A4 9F LDY TPTR2 ; offset to name in header
F87C D1 B2 CMP (TAPE1),Y ;compare to tape header
F87E D0 E7 BNE LOCSPH ;f867 different, get next header

F880 E6 9E INC TPTR1 ;increment counters
F882 E6 9F INC TPTR2
F884 A4 9E LDY TPTR1
F886 D0 EC BNE LCSPHLP ;$F874 compare next character
F888 LCSPEXC
F888 18 CLC ; exit success
F889 60 RTS


Tape searches are performed linearly, so the LOCTPH routine is used
to search for the next file header on the tape starting from the current
tape position. This routine is also called by LOAD through IOPEN if no
filename is provided as a parameter to the LOAD (or OPEN) call.
Additionally, it's called in a loop by the LOCSPH routine when searching for
a specific file.

LOCTPH reads a tape block to the tape buffer using TPREAD and
examines a few important fields in the file header. The fields examined
indicate the file type and the filename. If the file header indicates that
the file is anything other than a program file or a data file, the routine
looks for another file header until it reaches the end of the tape. If BASIC
is in "
direct mode", LOCTPH prints the "Found" message in addition to the
filename.


F7AF ;============================================================
F7AF ; LOCTPH - Read any tape header
F7AF ; Header type: 1=BASIC program, 2=data block, 3=machine program,
F7AF ; 4=data header, 5=end-of-tape marker
F7AF ;
F7AF LOCTPH
F7AF A5 93 LDA IOFLG2
F7B1 48 PHA ;save load/verify flag
F7B2 20 C0 F8 JSR TPREAD ;read tape block to buffer
F7B5 68 PLA
F7B6 85 93 STA IOFLG2 ;restore flag
F7B8 B0 2C BCS LOCTPEX ;F7E6 error, end search

F7BA A0 00 LDY #$00 ;index reg
F7BC B1 B2 LDA (TAPE1),Y ;get header type
F7BE C9 05 CMP #$05 ;EOT?
F7C0 F0 24 BEQ LOCTPEX ;$F7E6 yes, exit

F7C2 C9 01 CMP #$01 ;BASIC program?
F7C4 F0 08 BEQ LOCTP1 ;$F7CE yes, branch

F7C6 C9 03 CMP #$03 ;ML program?
F7C8 F0 04 BEQ LOCTP1 ;$F7CE yes, branch

F7CA C9 04 CMP #$04 ;data header?
F7CC D0 E1 BNE LOCTPH ;must be data block-skip it

F7CE LOCTP1 ;program or data header comes here
F7CE AA TAX
F7CF 24 9D BIT CMDMOD ;direct mode?
F7D1 10 11 BPL LOCTPEX-2 ;$F7E4 no, continue

F7D3 A0 63 LDY #KIM_FOUN ;setup for "
Found" msg
F7D5 20 E6 F1 JSR MSG ;print it
F7D8 A0 05 LDY #$05 ;offset to file name

F7DA LOCLOOP ; loop to print filename
F7DA B1 B2 LDA (TAPE1),Y ;print loop
F7DC 20 D2 FF JSR CHROUT
F7DF C8 INY
F7E0 C0 15 CPY #$15 ;21 characters max
F7E2 D0 F6 BNE LOCLOOP ;$F7DA loop
F7E4 18 CLC
F7E5 88 DEY

F7E6 LOCTPEX
F7E6 60 RTS


The TPREAD routine is responsible for the setup required to read or
verify a block from the tape. It prompts the user to press the PLAY button
on the tape deck, disables system interrupts and sets some important
parameters before execution falls through to the code responsible for
starting the tape IRQ routines. In many Commodore machines, tape operations
are performed under the operation of a routine installed as a temporary IRQ
handler -- sort of a cheap multitasking so that the system doesn't appear to
be hung while tape operations are occurring. Execution ultimately comes to
code at $F8EF (TPCODE) which is responsible for installing and starting the
tape IRQ routine.

All of this and we haven't yet reached the bits on the tape :-)

F8C0 ;==========================================================
F8C0 ; TPREAD - Read tape block
F8C0 ;
F8C0 TPREAD
F8C0 A9 00 LDA #$00
F8C2 85 90 STA CSTAT ;clear status variable...
F8C4 85 93 STA IOFLG2 ;and read/verif flag

F8C6 TPREAD1
F8C6 20 54 F8 JSR SETBST ;set tape buffer pointers

F8C9 ;
F8C9 ; load program
F8C9 ;
F8C9 TPREAD2
F8C9 20 94 F8 JSR PLAYMS ;wait for Play key
F8CC B0 1F BCS TPCODE-2 ;$F8ED (in TPWRIT1)
F8CE
F8CE 78 SEI ;disable interrupts
F8CF A9 00 LDA #$00 ;clear work memory for IRQ routines
F8D1 85 AA STA RIDATA
F8D3 85 B4 STA BITTS
F8D5 85 B0 STA TPCON1
F8D7 85 9E STA TPTR1
F8D9 85 9F STA TPTR2
F8DB 85 9C STA BYTINF
F8DD A9 82 LDA #$82 ;Timer H constant
F8DF A2 0E LDX #$0E ;number for IRQ vector
F8E1 D0 11 BNE TPCODE1 ;$F8F4 (TPCODE1 in TPWRIT below)
; falls through to TPWRIT below)


SRCHMS also calls this small routine to determine if a key is
pressed on the tape deck. TPSTAT looks at PA6 on VIA1 to determine the key
state and sets up the Z flag for a compare to be performed in SRCHMS.

F8AB ;============================================================
F8AB ; TPSTAT - Check tape key status
F8AB ;
F8AB TPSTAT
F8AB A9 40 LDA #%01000000 ;$40
F8AD 2C 1F 91 BIT D1ORAH ;switch sense
F8B0 D0 03 BNE TPSTEX ;$F8B5 not pressed, exit
F8B2 2C 1F 91 BIT D1ORAH ;button pressed. Setup for another
F8B5 ;compare later. Z=1 if pressed
F8B5 TPSTEX
F8B5 18 CLC
F8B6 60 RTS ;return clear


One of the tests performed in IOPEN (also at IOPEN_S6) is to
determine if the tape operation is a read or a write. If we're in the write
mode, the code jumps to IOPEN2. At IOPEN2, the Kernal prompts for the user
to press play and record on the tape deck and then writes a file header by
calling WRTPHD with a control ID of $04 (a data header). Then, IOPEN writes
a control byte ID of $02 (block is a data block) to the tape buffer and
returns.

RECDMS is called by IOPEN to determine if a key is pressed on the
tape deck and if not, sets the message flag to the "
Press Play & Record"
message and prints the message by calling into the PLAYMS routine. PLAYMS
prints the message and then checks for the key press.

F8B7 ;===========================================================
F8B7 ; RECDMS - Wait for tape key on write
F8B7 ;
F8B7 RECDMS
F8B7 20 AB F8 JSR TPSTAT ;get button status
F8BA F0 F9 BEQ TPSTEX ;$F8B5 pressed? Yes, continue

F8BC A0 2E LDY #KIM_RECP ;no, remind to "
Press Play & Record"
F8BE D0 DB BNE PLAYMS1 ;exit through $F89B


IOPEN calls WRTPHD at IOPEN2 with $04 as the control byte (following
block is a data header) to be written into the header. WRTPHD then writes
some critical information into zero-page locations in advance of filling the
tape buffer with the same information. At the end of the routine, the data
is written to the tape in WRTPH1.

F7E7 ;==========================================================
F7E7 ; WRTPHD - Write tape header
F7E7 ; On entry, .A is the header type: 2=data blk; 4=data hdr
F7E7 ;
F7E7 WRTPHD
F7E7 85 9E STA TPTR1 ;save header type
F7E9 20 4D F8 JSR GETBFA ;get tape buffer address
F7EC 90 5E BCC WRTPEX ;$F84C exit if not right

F7EE A5 C2 LDA STAL+1 ; save some program info
F7F0 48 PHA ;save...start H
F7F1 A5 C1 LDA STAL
F7F3 48 PHA ;...start L
F7F4 A5 AF LDA EAL+1
F7F6 48 PHA ;...end H
F7F7 A5 AE LDA EAL
F7F9 48 PHA ;...end L
F7FA A0 BF LDY #$BF ;buffer length-1 (191)
F7FC A9 20 LDA #$20 ; {space}

F7FE WRTPLP1 ; write program data to tape buffer

F7FE 91 B2 STA (TAPE1),Y ;clear buffer
F800 88 DEY
F801 D0 FB BNE WRTPLP1 ;$F7FE

F803 A5 9E LDA TPTR1 ;get header type
F805 91 B2 STA (TAPE1),Y ;write it
F807 C8 INY
F808 A5 C1 LDA STAL ;get start L
F80A 91 B2 STA (TAPE1),Y ;write it
F80C C8 INY
F80D A5 C2 LDA STAL+1 ;get start H
F80F 91 B2 STA (TAPE1),Y ;write it
F811 C8 INY
F812 A5 AE LDA EAL ;get end L
F814 91 B2 STA (TAPE1),Y ;write it
F816 C8 INY
F817 A5 AF LDA EAL+1 ;get end H
F819 91 B2 STA (TAPE1),Y ;write it
F81B C8 INY
F81C 84 9F STY TPTR2 ;save buffer offset
F81E A0 00 LDY #$00 ;filename loop counter
F820 84 9E STY TPTR1 ;save loop counter

F822 WRTPLP2 ; write filename to buffer
F822 A4 9E LDY TPTR1 ;get loop counter
F824 C4 B7 CPY FNMLEN ;compare filename length
F826 F0 0C BEQ WRTPH1 ;$F834 done

F828 B1 BB LDA (FNPTR),Y ;get filename char
F82A A4 9F LDY TPTR2 ;get tape buffer pointer
F82C 91 B2 STA (TAPE1),Y ;write char to buffer
F82E E6 9E INC TPTR1 ;increment loop counters
F830 E6 9F INC TPTR2
F832 D0 EE BNE WRTPLP2 ;$F822 loop

F834 WRTPH1 ; flush buffer to tape
F834 20 54 F8 JSR SETBST ;set start and end address pointers
F837 A9 69 LDA #$69 ;checksum block ID
F839 85 AB STA RIPRTY ;save parity character
F83B 20 EA F8 JSR TPWRIT1 ;$F8EA write block
F83E A8 TAY ;restore start and end addresses
F83F 68 PLA
F840 85 AE STA EAL
F842 68 PLA
F843 85 AF STA EAL+1
F845 68 PLA
F846 85 C1 STA STAL
F848 68 PLA
F849 85 C2 STA STAL+1
F84B 98 TYA
F84C
F84C WRTPEX
F84C 60 RTS ;exit


SETBST is a helper routine called by LOCTPH and WRTPHD to setup the
tape buffer before a tape operation takes place. It sets the starting
address of the buffer to the first address of the assigned buffer range and
sets the end of the buffer to start + 192 bytes.

F854 ;==========================================================
F854 ; SETBST - Set buffer start/end pointers
F854 ;
F854 SETBST
F854 20 4D F8 JSR GETBFA ;get buffer address
F857 8A TXA
F858 85 C1 STA STAL ;start=start of buffer
F85A 18 CLC
F85B 69 C0 ADC #$C0 ;end=start+192
F85D 85 AE STA EAL
F85F 98 TYA
F860 85 C2 STA STAL+1
F862 69 00 ADC #$00
F864 85 AF STA EAL+1
F866 60 RTS


TPWRIT performs the nuts and bolts of moving cassette data in and
out of the VIC. The VIC moves tape data by using a series of interrupt
routines that are swapped into the IRQ vector as needed. The benefit here is
that the tape IRQ code is then executed 60 times per second, along with
normal processing, until the operation is complete, resulting in a cheap
form of multitasking.

TPWRIT performs some setup chores before changing the IRQ vector, including
clearing interrupts, ensuring that the IEEE serial bus is idle, saving the
old vector, assigning the new vector and setting-up the variables used by
the tape IRQ routine to actually move bits. Finally, interrupts are enabled
at $F92E which starts the whole process. When the tape IRQ routine is
finished it restores the IRQ vector to the standard $EABF which is detected
by TPWRIT at TPCDLP2 (an I/O loop). When completed, TPWRIT exits through the
TPSTOP routine. This loop also updates the jiffy clock.

F8E3 ;==========================================================
F8E3 ; TPWRIT - Initiate tape write
F8E3 ;
F8E3 TPWRIT
F8E3 20 54 F8 JSR SETBST ;get buffer pointers
F8E6 A9 14 LDA #$14 ;checksum
F8E8 85 AB STA RIPRTY ;save it

F8EA ;
F8EA ; write buffer to tape
F8EA ;
F8EA TPWRIT1
F8EA 20 B7 F8 JSR RECDMS ;wait for Record+Play keys
F8ED B0 68 BCS TPSTPX1 ;$F957 exit
F8EF ;
F8EF ; TPCODE - Common tape code
F8EF ;
F8EF TPCODE
F8EF 78 SEI ;disable interrupts
F8F0 A9 A0 LDA #%10100000 ;$A0 Timer H constant
F8F2 A2 08 LDX #%00001000 ;$08 offset for tape IRQ vector

F8F4 TPCODE1
F8F4 A0 7F LDY #%01111111 ;$7F disable interrupts
F8F6 8C 2E 91 STY D2IER ;save to interrupt enable reg
F8F9 8D 2E 91 STA D2IER ; on VIAs
F8FC 20 60 F1 JSR SBIDLE ;wait for timeout
F8FF AD 14 03 LDA IRQVP ;save current IRQ Vector
F902 8D 9F 02 STA TAPIRQ
F905 AD 15 03 LDA IRQVP+1
F908 8D A0 02 STA TAPIRQ+1
F90B 20 FB FC JSR STOIRQ1 ;$FCFB .X=8 set tape IRQ vectors
F90E A9 02 LDA #$02 ;read # of blocks
F910 85 BE STA FSBLK
F912 20 DB FB JSR NCHAR ;set bit counter for serial shifts
F915 AD 1C 91 LDA D1PCR
F918 29 FD AND #%11111101 ;$FD CA2 manual low
F91A 09 0C ORA #%00001100 ;$0C force bits 2,3 high
F91C 8D 1C 91 STA D1PCR ;switch on tape drive
F91F 85 C0 STA CAS1 ;flag as on
F921 A2 FF LDX #$FF ;delay loop for high (outer)
F923 A0 FF LDY #$FF ;inner loop

F925 TPCDLP1
F925 88 DEY
F926 D0 FD BNE TPCDLP1 ;$F925

F928 CA DEX
F929 D0 F8 BNE TPCDLP1-2 ;$F923 outside loop
F92B 8D 29 91 STA D2TM2H
F92E 58 CLI ;allow tape interrupts
F92F ;
F92F ; wait for I/O-end
F92F ;
F92F TPCDLP2
F92F AD A0 02 LDA TAPIRQ+1 ;check IRQ direction
F932 CD 15 03 CMP IRQVP+1 ;standard vector?
F935 18 CLC
F936 F0 1F BEQ TPSTPX1 ;$F957 yes, ready

F938 20 4B F9 JSR TPSTOP ;no, check STOP key
F93B AD 2D 91 LDA D2IFR ;timer 1 IF clear?
F93E 29 40 AND #%01000000 ;$40
F940 F0 ED BEQ TPCDLP2 ;$F92F continue

F942 AD 14 91 LDA D1TM1L ; get timer 1 loword
F945 20 34 F7 JSR IUDTIM ;update RTC
F948 4C 2F F9 JMP TPCDLP2 ;$F92F loop

 

TPSTOP is another helper routine. It scans for the keyboard STOP
key, and if detected, restores the standard IRQ vector and returns to the
caller's caller (the caller to TPWRIT).

F94B ;===========================================================
F94B ; TPSTOP - Check for tape stop
F94B ;
F94B TPSTOP
F94B 20 E1 FF JSR STOP ;scan STOP key
F94E 18 CLC
F94F D0 0B BNE TPSTPX ;$F95C not pressed, return

F951 20 CF FC JSR RESIRQ ;pressed, turn drive off and restore IRQ
F954 38 SEC ;set error flag
F955 68 PLA ;pop return address
F956 68 PLA

F957 TPSTPX1
F957 A9 00 LDA #$00 ;flag normal IRQ vector
F959 8D A0 02 STA TAPIRQ+1

F95C TPSTPX
F95C 60 RTS ; return to caller's caller


The SBIDLE routine is used in TPWRIT to detect if the RS-232 serial
bus is active and if so, will wait until it's idle before returning to
continue the tape code.

F160 ;===========================================================
F160 ; SBIDLE - Set timer for serial bus timeout
F160 ;
F160 SBIDLE
F160 48 PHA ;save .A
F161 AD 1E 91 LDA D1IER ;get IER
F164 F0 0C BEQ SBIDLEX ;$F172 no interrupts pending, exit

F166 SBIDLLP
F166 AD 1E 91 LDA D1IER ;get IER
F169 29 60 AND #%01100000 ;$60 T1 & T2
F16B D0 F9 BNE SBIDLLP ;$F166
F16D
F16D A9 10 LDA #%00010000 ;$10 kill CB1 RS232
F16F 8D 1E 91 STA D1IER

F172 SBIDLEX
F172 68 PLA
F173 60 RTS


RATS3 is at the tail end of the RAMTAS routine - the subroutine that
precedes the tape vectors.

FCF6 ;===========================================================
FCF6 ; STOIRQ - Set IRQ vector
FCF6 ; usually called with .x=$08 or $0e. $08 points to the first
FCF6 ; entry in TAPVEC while $0e points to the last entry
FCF6 STOIRQ
FCF6 20 CF FC JSR RESIRQ ;restore std IRQs
FCF9 F0 97 BEQ TPEOI ;$FC92

FCFB STOIRQ1
FCFB BD E9 FD LDA RATS3,X ;$FDE9,X set vectors from table
FCFE 8D 14 03 STA IRQVP
FD01 BD EA FD LDA RATS3+1,X ;$FDEA,X
FD04 8D 15 03 STA IRQVP+1
FD07 60 RTS

These are the actual routines responsible for writing bits to the
tape. Various calling routines place these vectors into the IRQ vector that
then gets executed 60 times per second. In order, these routines are for
writing a leader block to the tape, the routine to write data to the tape,
the standard IRQ vector, and the routine to read bits from the tape.

FDF1 ;===========================================================
FDF1 ; TAPEVC - Tape IRQ vectors
FDF1 ;
FDF1 TAPEVC
FDF1 ; .dw $FCA8, $FC0B, $EABF, $F98E
FDF1 A8FC0BFCBFEA .dw WRLDR2, TWRD7, IRQVEC, RDTPBT
FDF7 8EF9


The NCHAR routine resets the internal bit counters to their initial
state before a calling routine begins to shift the bits over a serial or
tape channel. It sets the bit length to 8 and resets intermediate variables
to 0.

FBDB ;===========================================================
FBDB ; NCHAR - Set bit counter for new character (serial output)
FBDB ;
FBDB NCHAR
FBDB A9 08 LDA #$08
FBDD 85 A3 STA SBITCF
FBDF A9 00 LDA #$00
FBE1 85 A4 STA CYCLE
FBE3 85 A8 STA BITCI
FBE5 85 9B STA TPRTY
FBE7 85 A9 STA RINONE
FBE9 60 RTS


That's all of the room we have for this part of the series. All of
this code and we still haven't moved the bits off the tape. So, next time
we'll look at the actual routines responsible for moving bits on and off of
the tape and we'll begin to look at the IEEE serial routines.
.......
....
..
. C=H 20

::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

-----------------------
The Art Of The Minigame
-----------------------

Introduction
------------

In summer 2001, an 8-bit minigame contest was held. Voter turnout may have
been low, but author turnout was high, with a total of thirty entries for
the C64, Spectrum, Amstrad, and Atari 8-bit. The entries and the results
are available at http://demo.raww.net/minigame/, and what follows are a
series of articles by several minigame authors. The articles are for
enjoyment, to stimulate thought, and, hopefully, to motivate people for
next year's contest. (Everyone had a great time, btw.)

Minigames -- writing tiny programs in general -- present a set of very
unique challenges. Whereas many of us are used to optimizing programs for
cycle-efficiency, optimizing programs for byte-efficiency turns out to be
very different -- challenging, at times aggrivating, but very rewarding. I
think you will find the challenges and solutions ahead to be very interesting.

The game authors were pioneers, exploring pretty new territory, and I
salute all that entered the contest (especially for making such a nice C64
showing). A special mention should be made of MagerValp ("Skinny Puppy" in
Swedish, in case you've been wondering), who motivated a number of lazy
people (e.g. myself) to get involved with the contest.

Before diving in to the articles ahead, which contain lots of tricks to
save memory, I thought it might get us in the mood to go over some general
ideas and concerns for saving bytes on a C64.

It should be obvious that some balance has to be found between cycle-
efficiency and memory-efficiency. Routines should be reasonably fast
in most cases, but the major portion of that balance is -- has to be --
memory efficiency.

Obviously, everything that can be put into zero page should be put there,
since zero page instructions are two bytes instead of three. Equally obvious
is that every list-type fetch should use the .X register, since there is
no zero-page lda xx,y instruction.

Moreover, memory is initialized to various values when the computer is
reset. Many zero-page locations have specific values -- by careful code
design, these can be taken advantage of to avoid any initialization code.
For example, many times the code needs a zero-page pointer with the low
byte zero (i.e. sta (point),y). If that variable is chosen as a location
which is already zero, the code doesn't have to waste four bytes on
lda #00 sta point. One-time counters can also be used in this way.
Finally, certain locations can be manipulated to have certain values;
for example, the Tetrattack load address forces the end address+1 to
be $10f0, because $10 and $f0 are the z-coordinates of two object vertices.

There is also a fairly substantial kernal to take advantage of -- routines
to clear the screen, perform memory manipulations, and so on. Knowing what
is available, and what the routines do, is handy.

Self-modifying code comes in handy when large portions of code can be
used for similar things with just a few changes. For example, the line
routine in Tetrattack uses a single routine for lines in both the x- and
y-directions, by just interchanging a few INX/INY instructions.

And finally, there are things which Mark Seelye terms "injections", which
seems like a good term to me. The idea is to reuse known register values
whenever possible. That is, instead of having LDA #00 STA zp in a subroutine,
you might put an "STX zp" in an earlier subroutine, where you know the value
of .X is zero. In other words, the instruction to clear the zp location is
"injected" into a different subroutine to save a couple of bytes. Every
program below uses this trick in one form or another.

You'll see lots of neat tricks in the articles ahead, but the basic design
framework is: put stuff in zero-page, take advantage of default zero-page
values, take advantage of the kernal, always know the register values, and
reuse as much code as possible.

With that, let's check out the programs.

.......
....
..
. C=H 20

=====================================Part 1===================================

Postmortem: Codebreaker

by White Flame (aka David Holz)
http://white-flame.com

The most difficult part of making an entry for this compo was deciding what
game to do that could fit in 512 bytes. The ubiquitous snake clones and
scroll/dodge games abound; finding something different is truly a challenge.
I wanted to do a Mastermind game a looong time ago, with a full-on "We're
getting attacked, solve to code to save the world!" type of setting with lots
of animation, but usually found myself in application and library/utility
programming instead. But luckily this idea popped back into my head as
something different that would fit in the 2-page limit.

The concept is pretty simple (as most 512-byte games are). There's a 4-color
secret key, you get 10 guesses at it. Two numbers reflect the score for the
current guess: a black number shows how many slots have the right color, and a
white number shows how many correct colors are in the guess, but in the wrong
slot. Experienced players should be able to deduce the correct answer in 4-6
guesses.


Random number generation

There are 3 obvious ways to get random numbers easily in the 64: SID
oscillator with noise waveform, BASIC prng, or the raster counter. I decided
to go for the raster solution. Obviously, you need to wait a certain amount
of time between reads of the raster location, or they'd be the same or linear.
To fix this, every time I wanted a random number, I'd read the raster and call
a loop that many times. When it was done, $d012 should have a "random enough"
number in its lower 3 bits (to get a color from 0 to 5).

randomDelay:
ldx $d012
randomLoop:
jsr delay
dex
bne randomLoop;
delay:
dey
bne delay
rts


Code space saving

What's interesting is that it actually took less bytes of code and data to
have custom draw loops for the 3 textual messages than setting a pointer and
calling a drawText function.

ldy #22
titleloop:
lda titletext,y
sta screenLoc-48,y
dey
bpl titleloop:

These functions assume that you're not running on an ancient kernal, so color
memory is initialized to the cursor color (white).

I also had all variables (except the secret key) use screen memory, color
memory, or register .Y instead of manipulating off-screen variables and then
updating the screen to match:

- For entering a guess, color memory is INC'd and DEC'd, with .Y holding the
current offset from the fixed screen location. JSR $e8ea is used to scroll
the screen, and everything is drawn to screen line 23.

- To find out if you've already made 10 guesses, a memory location near the
top of the screen was polled to see if it was still a space. As guesses are
made, the screen was scrolled, and if the title line ever scrolled into that
memory location, it's game over.

- The current guess score '0 0' is drawn to the screen first, then those
screen values are INC'd or DEC'd as matches are found.

- Winning constitutes checking to see if the screen has a '4' in the black
score location.


Score calculation

This was a bit of a pain. At first I trying to calculate the white score by
the scoring rules (every matching color that was in the wrong slot), INCing
the white score on every hit, then looping for exact matches, INCing the black
score on every hit. This code ended up being way too big, so I got the idea
to simplify the white loop: INC the white score on every color match, whether
in the right spot or not, then on every black match, DEC the white score and
INC the black score. This gave the same result, but with much simpler code.


Conclusion

All in all, it was an interesting challenge for me, and had I not gotten stuck
in Phoenix for an extra week, I'd have finished my 2k entry. The code is
pretty straightforward, and at 415 bytes total, I'm quite happy to claim
having the smallest entry, even if I didn't win (why Pacman didn't beat out a
low-res scroll/dodge/single-shot game is a mystery to me). Being a hobbyist
C64 programmer, and my career history being more on the research end of
things, having a deadline is a nice change in terms of getting a project
completed. :)


Publisher: Gasman/Raww Arse
Number of full-time developers: 1 programmer
Number of contractors: 0
Estimated budget: 0 USD (0 EUR)
Development time: About 5 hours
Release date: August 22, 2001
Platforms: Commodore 64 and compatibles
Development hardware: pee sea (1.4GHz T-bird, 512MB RAM, 100GB HD, Win98SE)
3rd party tools: xa, Vice, Notepad
In-house tools: none
Project size: 290 lines of 6502 assembly code, containing 1 line of tokenized
BASIC

--- cb.asm ---
.word $0801 ;Starting address for loader

* = $0801

.word nextLine ;Line link
.word $0 ;Line number
.byte 158 ;SYS
.byte 50,48,54,54 ;"2066"
.byte 0
nextLine .byte 0,0 ;end of basic

screenGameOver = $0400+(4*40)+16
screenLoc = $0400+(23*40)+16
colorLoc = $d800+(23*40)+16
screen = $fd
color = $fb
peg = 81
cursor = 87
crsr_up = 145
crsr_down = 17
crsr_left = 157
crsr_right = 29

WaitStart:
jsr $ffe4
beq WaitStart

Start:

;Change to white
lda #$05
jsr $ffd2

;Clear screen
lda #147
jsr $ffd2

;Set bg colors
lda #$00
sta $d020
sta colorLoc
sta colorLoc+1
sta colorLoc+2
sta colorLoc+3
sta $ff ;for cursor
lda #11 ;dark gray
sta $d021

;Init zp pointers
lda #<screenLoc
sta screen
sta color
lda #>screenLoc
sta screen+1
lda #>colorLoc
sta color+1

;Draw text
ldy #22
titleloop:
lda titletext,y
sta screenLoc-48,y
dey
bpl titleloop:

;Randomize values

iny
newGuess
sty $02
guessAgain
jsr randomDelay
lda $d012
and #$07
cmp #$06
bpl guessAgain
ldy $02
checkLoop: ;See if the color's already in the key
dey
bmi goodGuess
cmp secret,y
beq guessAgain
bne checkLoop

goodGuess: ;Save the color and put a peg on the screen to show progress
ldy $02
sta secret,y
lda #peg
sta (screen),y
iny
cpy #$04
bne newGuess

Game_Loop:

;Get cursor pos
ldy $ff

;Draw cursor
lda #cursor
sta (screen),y

Key_Loop:

;Wait for key
getin: jsr $ffe4
beq getin
ldy $ff

;If up, bump color up
cmp #crsr_up
bne notUp
lda (color),y
and #$07
adc #$00 ;carry is always set after the cmp
cmp #$06
bne colorNotHigh
lda #$00
colorNotHigh:
sta (color),y
notUp:

;If down, bump color down
cmp #crsr_down
bne notDown
lda (color),y
and #$07
bne colorNotLow
lda #$06
colorNotLow:
sbc #$01 ;carry is always set after the cmp
sta (color),y
notDown:

;If left,
cmp #crsr_left
bne notLeft
; Erase cursor
lda #peg
sta (screen),y
; Bump cursor left
dey
bpl cursorOKLeft
ldy #$03
cursorOKLeft:
; Draw cursor
lda #cursor
sta (screen),y
notLeft:

;If right,
cmp #crsr_right
bne notRight
; Erase cursor
lda #peg
sta (screen),y
iny
cpy #$04
bne cursorOKRight
ldy #$00
cursorOKRight:
lda #cursor
sta (screen),y
notRight:

;Store cursor pos
sty $ff

;If not enter, goto Key_Loop
cmp #13
bne Key_Loop

;------------------------

;Erase cursor
lda #peg
sta (screen),y

;Check score

;Draw "0 0"
lda #'0
sta screenLoc+5
sta screenLoc+7
lda #0 ;black
sta colorLoc+5

;Calculate white score (any matches between the guess & key)
ldy #3
yLoop:
ldx #3
xLoop:
lda (color),y
and #$07
cmp secret,x
bne noWhiteMatch
inc screenLoc+7
noWhiteMatch
dex
bpl xLoop
dey
bpl yLoop

;Calculate black score (exact matches between the guess & key)
ldy #3
blackLoop:
lda (color),y
and #$07
cmp secret,y
bne noBlackMatch
inc screenLoc+5
dec screenLoc+7
noBlackMatch:
dey
bpl blackLoop

;Check win (black score = "4")
lda #'4
cmp screenLoc+5
beq Win

;Check for Game Over (screen's getting full)
lda screenGameOver
cmp #32
bne Game_Over

;Scroll screen up 2 rows
jsr $e8ea
jsr $e8ea

;Copy last hand
ldy #3
copyLoop:
lda colorLoc-80,y
sta (color),y
lda #peg
sta (screen),y
dey
bpl copyLoop:

jmp Game_Loop

Game_Over:
jsr $e8ea
ldy #8 ;Show "YOU SUCK!"
loseloop:
lda losetext,y
sta (screen),y
dey
bpl loseloop:

jsr $e8ea
ldy #3 ;Reveal the secret key
losecopy:
lda secret,y
sta (color),y
lda #peg
sta (screen),y
dey
bpl losecopy
jmp WaitStart

Win:
jsr $e8ea
ldy #15 ;Show Zero Wing reference
winloop:
lda wintext,y
sta screenLoc-4,y
dey
bpl winloop
rts

randomDelay:
ldx $d012 ;Loop n times, n = current raster location
randomLoop:
jsr delay
dex
bne randomLoop;

delay:
dey
bne delay
rts

wintext
.byte 1,32,23,9,14,14,5,18,32,9,19,32,25,15,21,33
losetext
.byte 25,15,21,32,19,21,3,11,33
titletext .byte 58,58,58,32,3,54,52,32,3,15,4,5,2,18,5,1,11,5,18,32,58,58,58
secret = *
--- end cb.asm ---
.......
....
..
. C=H 20

=====================================Part 2===================================

Tinyplay by SLJ
--------

Like many people, I was a serious Ultima guy growing up. Not only did
I love the games (and still do), but I absolutely loved the music (and still
do). In fact, I think the music from III and IV is some of the finest,
most musical composing ever done on the C64, and the best example of how
appropriate, atmospheric music can add volumes (so to speak) to a game.

It also turns out that I had been thinking about some new ideas for a new
music system for a while. So when Robin said he was writing an Ultima-like
game -- well! I'm certainly no Kenneth W. Arnold, but a set of tiny tunes
sure sounded like a neat and fun thing to do, especially in the Ultima style!
Robin removed a lot of neat features and text to add in the music, so I feel
pretty honored to have been included.

The player and tunes were written on pretty short notice; I think we started
a week or so before the deadline. Moreover, I had to go out of town the
weekend of the deadline, getting back that Sunday, where many of the
optimizations took place in frantic coding sessions! So it shouldn't be
surprising that the code is not as optimal as it could be -- if you get
around to looking at the source code, you'll see an awful lot of weird,
duplicate, and otherwise embarrasing lines of code which anyone with a clear
head would never have used. But as is, the player and three tunes take up
some 428 bytes (I think), which isn't too bad. (Later on I'll describe
a music compression routine which would have worked great, but that I didn't
have time to implement.)

Broadly speaking, the player is a note-duration based player, using two
sawtooth voices for that Ultima sound (originally, before reality kicked in,
the thinking was to put some sound fx in the third voice). It uses a single
play routine, using the .X register to tell which voice to play, and can
play multiple tunes. The original version has several effects included like
major and minor arpeggios, but memory constraints forced these to be taken
out (along with the original tune, alas, alas).

The full source code for the player is at the end of this article, and can be
divided into four primary parts: a one-time init routine, a routine to select
a tune, a routine to play notes, and the music data. But perhaps the best
way to understand the player is to begin by looking at how the data for a
tune is stored.


Tune encoding
-------------

Here is the third, "castle" tune, in its entirety:

* Tune 3: Castle

t2v1
dfb 97 ;gate
dfb 24+$80-1
dfb 70,96,72
dfb 6+$80-1
dfb 96,96,70,72
dfb 12+$80-1
dfb 74,74,96,75
dfb 24+$80-1
dfb 77,75,74,96,72
dfb 6+$80-1
dfb 96,96,70,72
dfb 24+$80-1
dfb 70,96,96,96
dfb 00
t2v2
dfb 97
dfb 24+$80-1
dfb 34,41,46,41
dfb 34,41,46,41
dfb 58,62,53,57
dfb 58,53,46,96
dfb 0

end



The first chunk of data (t2v1) is the voice 1 data, and the second chunk the
voice 2 data (t2v2).

There are 12 notes in an octave, and eight octaves total, for 96 possible
notes. In the player, data bytes 0-95 represent these notes -- the player
just uses the data as an index into a frequency table. This leaves bytes 96
and above free for other uses; also, since the lower note values are rarely
used, values like 0, 1, etc. are also free in principle.

Byte values >= 128 are used to specify the note duration. When the player
encounters these byte values it simply strips the high bit and stores the
result. When a new note is read in, this stored duration is placed into a
counter which is decremented on each player call.

In the castle tune above, the duration for voice 2 is only set once, at the
very beginning -- every note has the same duration. In the voice 1 data
it is set several times, to change the duration when appropriate.

For some reason the code decrements the duration counter until it becomes
negative, instead of decrementing it until it becomes zero. I'm sure it was
because of something relevant in an earlier version of the player, I just
totally don't remember what. But that is why the durations above are encoded
as 24+$80-1 -- the -1 is to make it go negative, instead of to zero. Looks
weird.

Byte values >95 and <128 are available for 'special' features. The player
supports one "effect": the gate toggle. The player can either leave the gate
bit on all the time or can turn it off right as a note is about to finish
(say, when the duration counter decrements down to 2). A byte value of 97
turns the gate toggle on -- in the castle tune above it is the first byte
of data. In other tunes, a value of 100 is used to turn off gate toggling.
(Historically speaking, values 98 and 99 were used for major and minor
arpeggios, and 100 was for no fx at all.)

The byte value 96 is used as a "rest" (or hold) -- the player resets the
duration counter but does not reset the gate bit or the note value. When
there is no gate toggling this provides a way of holding a note for longer
than the basic duration. When there is gate toggling, it starts the release
cycle, and lets notes fade away. In the castle tune above, the first two
notes in voice 1 are "70 96" -- a note, then a rest, during which the note
fades away.

Finally, the value 0 is used to indicate the end of data. While 0 is
technically a note it never gets used as such. When the player hits a zero
it simply resets the data pointer.

So with all that in mind, let's take a look at the player.


The Player
----------

Before playing a tune, a one-time routine is called which sets up the
frequency tables; when the player reads a new note, it looks up the SID
frequency settings in this table, along the lines of:

ldy note,x
lda freqlo,y
sta $d400
lda freqhi,y
sta $d401

Frequencies in different octaves are related by powers of 2. When you go
up an octave, the frequency doubles -- for example, the frequency of C-5
is twice the frequency of C-4. The routine starts with a table of twelve
frequency values for the highest octaves, and copies those values into
the end of the frequency table (e.g. freqlo+95, freqlo+94, ..., freqlo+84).
After copying, it divides each value by 2 and repeats the process -- now
the frequencies correspond to the next lower octave -- and this is done until
the frequency table is full. Piece of cake.

Once the frequency table is set up, the main play routine is called with
the tune number in .A. If the tune number is different than the current
tune, the player simply resets the music data pointers and note durations,
and falls through to the main player routine.

And the main player routine simply decrements the duration, and when it
becomes negative reads in the next data.

Because the process -- decrement duration, read next data, store frequencies
in SID -- is the same for each voice, a single routine is used for both voices
one and two, using .X as the voice index. That is, instead of

dec duration

the code can use

dec duration,x

and instead of

sta $d400

the code can use something like

ldy SidOffset,x
sta $d400,x

where .X is 0 or 1.

And that's about all there is to it -- the code is pretty straightforward.


Compressing Music
-----------------

Most music has the property that it doesn't jump around all over the place,
but rather notes progress in relatively small intervals (seconds, thirds,
fifths, etc.). For example, a major chord (as used in a typical arpeggio) is

note note+4 note+7 note

So the idea is to use a differential scheme, encoding the note _intervals_
instead of the note _values_. That is, the above could instead be encoded as

x 4 3 -7

where each value is added to the current note value. Thus, if intervals are
restricted to -8..7 then only four bits are required, cutting the data size
in half. Well, not quite of course, because an escape code is needed to
specify notes outside of the -8..7 range, not to mention durations and fx
settings. Luckily, not all interval values are used, so these can be used
for escape codes; code 96 -- the rest or "hold" note -- can also use an
otherwise unused value.

So the decompression code looks something like

read next four bits
if escape code then read next byte
otherwise add to current note value

Pretty simple, and leads to quite substantial savings. Ah, but for a little
more time...

Another possibility arises if not all note values are used. I set up the
frequency tables to contain all 12 notes in eight octaves. If not all twelve
notes are actually used -- if none of the tunes contain a D# or whatever --
then a few bytes can be shaved off of the frequency table. But this means
a lot of renumbering and such... it all depends on just how much you want
those extra bytes!

And that's about all there is to it.


Source code
-----------

*
* Tiny music player, for the
* minigame contest.
*
* slj 9/23/01
* sjudd@ffd2.com
*

; org $2800-491
org $2800-425

ARPDEL = 2
SR = $c9
GATEDEL = 2

curtune = $fe
curdur = $fc

tunepos = $b0 ;position in note sequence
seqpos = $b2 ;position in seq list
dur = $b4 ;2 bytes each
notefx = $b6
curnote = $b8
newnote = $ba

arpdur = $bc
arpoff = $bd

noteptr = $fa
temp = $fa
freqtab = $c000




*
* Code begins here
*
start

*-------------------------------
do 0
jmp InitFreq
jmp Play
jmp Reset
fin
*-------------------------------

*
* InitFreq -- set up note table
*
InitFreq
ldy #192
:l2 ldx #24
:loop lda freq-1,x
sta freqtab-1,y
dex
dey
lda freq-1,x
sta freqtab-1,y
lsr freq,x
ror freq-1,x
dey
beq :xit
dex
bne :loop
beq :l2
:xit rts

*
* Play routine
* tune in .A
*

Reset ; Force a tune reset
sec
ror curtune
; ldx #255
; stx curtune
Play
ldx #1
cmp curtune
beq PlayTune
sta curtune

ldy #00
sty tunepos
; sty tunepos+1
sty dur
sty dur+1

lda #15
sta $d418
lda #SR
sta $d406
sta $d406+7


*
* Main routine
*
* voice in .X (0, 1)
*

SetTune
ldy curtune
txa
beq :v1
:v2 lda v2tunepos,y
bne :sta
:v1 lda v1tunepos,y
:sta sta tunepos,x


PlayTune

Get
lda #00
sta newnote,x
dec dur,x
bpl DoFx

:next ldy tunepos,x
inc tunepos,x
lda tunes,y
beq SetTune
bpl :c1
and #$7f ;$80+dur
sta curdur,x
bne :next
:c1 cmp #96
bcc :setfreq
beq :hold
sta notefx,x
bne :next
:note ;sta curnote,x
; sta newnote,x
; jsr setfreq
:SetFreq
asl
tay
lda Freqtab,y
pha
lda Freqtab+1,y
ldy SidOffset,x
sta $d401,y
pla
sta $d400,y
:gate
lda #$21
ldy SidOffset,x
sta $d404,y


:hold lda curdur,x
sta dur,x



DoFx
ldy notefx,x
cpy #97
bne AllDone

ldy dur,x
cpy #GATEDEL
bcs :c1
lda #$20
ldy SidOffset,x
sta $d404,y
:c1

AllDone
dex
bpl PlayTune
RTS rts


*
* Frequencies
*
freq
da 34334
da 36377
da 38539
da 40831
da 43258
da 45831
da 48557
da 51444
da 54502
da 57743
da 61177
da 64815

;arptab dfb 0,4,7

*
* Tune positions (offset into sequence
* list)
*
v1tunepos
dfb t0v1-tunes
dfb t1v1-tunes
dfb t2v1-tunes
v2tunepos
dfb t0v2-tunes
dfb t1v2-tunes
dfb t2v2-tunes


tunes
; dfb 00 ;dummy byte
*
* Sid offset table
*
SidOffset
dfb 00
dfb 07


t0v1
dfb 97
dfb 10+$80-1 ;dur=10
dfb 42,47,51
dfb 42,47,51
dfb 42,47,47,47
dfb 51,51,51
dfb 54,52,51
dfb 52,51,49
dfb 51,51,51
dfb 49,49,49
dfb 51,49,47
dfb 49,47,46
dfb 44,44,44
dfb 46,46,46
dfb 47,51,47
dfb 46,42,46
dfb 47,96,96
dfb 96,96
dfb 00

t0v2 ;v2
dfb 97
dfb 10+$80-1
dfb 35,35,96
dfb 35,35,96
dfb 35,35,96
dfb 35,39,96
dfb 39,42,96
dfb 42,30,96
dfb 30,35,96
dfb 35,34,96
dfb 34,32,96
dfb 32,32,96
dfb 32,32,96
dfb 32,34,96
dfb 34,35,96
dfb 35,30,96
dfb 35,35,96
dfb 35,35,96
dfb 00

* Tune 2: Combat

t1v1
dfb 100 ;No gate
dfb 11+$80-1 ;dur=11
dfb 51,54,58,59,96
dfb 58,57,54,51,54,58,63,96
dfb 62,61,60,59,56,53
dfb 58,54,51,56,53
dfb 46,50,53,56
dfb 59,58,57,54
dfb 0

t1v2 ;v2
dfb 100
dfb 11+$80-1
dfb 15,27,15,39
dfb 15,27,15,39
dfb 15,27,15,39
dfb 15,27,15,39
dfb 47,44,41,51,46,42,47,44
dfb 22,34
dfb 22,34
dfb 22,34
dfb 22,34
dfb 0

* Tune 3: Castle

t2v1
dfb 97 ;gate
dfb 24+$80-1
dfb 70,96,72
dfb 6+$80-1
dfb 96,96,70,72
dfb 12+$80-1
dfb 74,74,96,75
dfb 24+$80-1
dfb 77,75,74,96,72
dfb 6+$80-1
dfb 96,96,70,72
dfb 24+$80-1
dfb 70,96,96,96
dfb 00
t2v2
dfb 97
dfb 24+$80-1
dfb 34,41,46,41
dfb 34,41,46,41
dfb 58,62,53,57
dfb 58,53,46,96
dfb 0

end

.......
....
..
. C=H 20

=====================================Part 3===================================

MagerTris -- by Per Olofsson <magervalp@cling.gu.se>

Way back in 1990 or thereabouts I wrote a Tetris clone in basic on the
Amiga. When the compo was announced I scratched my head for a bit and
figured that I should have a good chance of fitting Tetris in 512
bytes. If I couldn't I would at least have a good engine for the 2K
compo.

The Game

This puzzle game was invented in the 80s by the Russian programmer
Alexey Pazhitnov. The game has a vertical field where one of seven
puzzle pieces appears at the top and falls towards the bottom. The
player can rotate the piece 90 degrees or move it left or right. When
it reaches the bottom the game field is checked for filled lines that
are removed. The player is awarded points for removing lines (with a
bonus for clearing more than one in one go) and the game is over when
the screen fills up so that new pieces can't enter the game field.

Drawing the Screen

The screen is filled with inverted space and the game pieces and the
border is drawn into the color ram. There aren't any real tricks used
here, the only one is that the color of the border is a side effect
of the last character printed, ascii 13 (light green).

Random Numbers

We need a decent source of random numbers to generate a random Tetris
piece. The SID chip's noise waveform is probably the best one
available on the C64, and fortunately the code to access it is really
short. To initialize I used:

initrnd subroutine
lda #$81 ; enable noise
sta $d412 ; voice 3 control register
sta $d40f ; $81xx as the frequency

and to get an 8-bit random number all you have to do is lda $d41b.

The Tetris Pieces

Tetris has 7 puzzle pieces that you have to store in a table. As every
piece consists of four boxes we need a total of 7*4 = 28 entries. The
smallest tetris table is probably the one where each piece is
represented by 8 bits, like this:

0010 0000 0011 0110 0100 0010 0110
0111 1111 0110 0011 0111 1110 0110

The table is very compact, a mere 7 bytes, but then you need code to
rotate every piece and accessing the table is somewhat clunky on the
6502. I've seen an x86 version that did this though, with a total size
of less then 256 bytes. However, on the 6502 I needed four bytes for
each piece for the code to be reasonably compact and I also kept all
the rotated pieces in the table, for a total of 112 bytes.
Fortunately, as every piece has a center box we only need to store 3
which trims the table to 84 bytes. But what do we actually store? As
the screen is painted to color ram I used a pointer to the current
position. In the table I simply stored the offset in color ram from
the center box. With indirect indexed addressing the routine is nice
and short.

Failure

The rest of the program is fairly straight forward. User input is just
a cmp loop, there are dec zp timers for most events, and a raster wait
to time everything. This is also where I failed to fit everything
inside the 512 byte limit, as everything put together took about 600
bytes, even after some serious optimizations. There are two very
similar routines: the one that draws a piece on the screen and the one
that checks for collisions. Both iterate through the four boxes in a
piece, but because of the way I used some zeropage pointers I couldn't
merge the two routines into one without rewriting most of the code.

Success

As rewriting everything meant too much work, I just decided to go for
plan B: write a 2K entry. With a 600 byte engine there was plenty of
space to add features, and at around 700 bytes or so compressÃŒon
starts to make sense -- pucrunch breaks even around there somewhere,
and the final binary is actually 6500 bytes uncompressed. I could
easily fit a title screen complete with custom graphics, a nice tile
set for the tetris pieces, and even a hiscore list that saves to
disk. The only trick I used here was that the custom character set was
EOR:d with the ROM font, and as most characters are the same or very
similar it compresses much better that way.

And that's the story of the 2K MagerTris.
.......
....
..
. C=H 20

=====================================Part 4===================================

Tuff -- Compressing tiny programs
-----------------------------------> Stephen L. Judd <sjudd@ffd2.com>

Most compression schemes address the problem of taking a large program or
data file and compressing it down. But have you given much thought to the
problem of trying to compress a _small_ program, even a 512-byte program?

I spent some time trying to come up with a compression scheme for tiny
programs -- saving even 2 or 3 bytes in tetrattack would have been helpful.
The effort was basically a failure, as the best I thought I could do was to
break even (on paper, for tetrattack). I never actually implemented the code
either -- just worked it out on paper.

It is fairly interesting though, and maybe these preliminary efforts will
give other people ideas (especially for compressing 2k programs). And after
writing this article, I now think it might have really worked, and saved
a handful of bytes -- maybe next year?

Tiny Compression Basics
-----------------------

If you remember your C=Hacking #16, there are two basic approaches to
compression: taking a fixed-length input while giving a variable-length
output, or taking a variable-length input and producing a fixed-length output.
An example of the latter is LZ compression -- looking for repeated
strings of length 2, 3, etc. and replacing those with a single byte (or two
bytes, or whatever). An example of the former is a Huffman tree, which
replaces each byte with a string of bits, using fewer bits for the most
common bytes.

To decompress the compressed data, of course, a decompression routine is
needed. For tiny compression, this routine obviously needs to be as tiny
as possible. Most C64 decompression programs copy themselves somewhere
before decompressing the data stream to wherever it is supposed to go.
Obviously, if we put a few restrictions on the data to be decompressed
the decompression routine can be made smaller, and more specific to the
task.

Finally, it is worth mentioning that a tiny program in general does not
use all possible byte values.

Now, first consider Huffman encoding. In a typical Huffman decoding scheme,
the decoder reads a bit at a time, traversing the Huffman tree with each
bit, until a byte is hit. This implies a fairly complicated decoder, and
worse the tree must be stored. Finally, the tree can be fairly large, if
the number of possible symbols is large. So Huffman doesn't seem like a
good approach.

So, consider LZ-style encoding: replacing n-byte sequences with a shorter
code. In LZ, the n-byte code is replaced with a reference to the previously
decompressed bytes -- a code which says how far back to go, and how many
bytes to copy. What a drag -- an escape code, the distance backwards, and
the number of bytes to copy. For a byte-aligned decompressor, this implies
at least two bytes (since the decompression routine must be small, a byte-
aligned decompressor is of greater interest).

So far so bad. But, since a tiny program doesn't use all possible byte
values, perhaps a simple substitution method is possible -- that is, replacing
various 2-, 3-, 4- etc. byte sequences with one of those bytes. Alas, the
numbers don't work out very well: in the above scheme, let's say an
n-byte sequence appears m times, for a total of n*m bytes. We replace
each of those sequences with a single byte, giving a savings of

n*m - m

bytes. But we also need to store the original sequence in a table somewhere,
plus the byte corresponding to that sequence. This means another n+1 bytes,
so the total savings is

n*m - (m+n+1)

So, for example, if we have a 2-byte sequence repeated 3 times, the savings
is

2*3 - (2+3+1) = 0 bytes!

The issue is that replacing a 2-byte sequence with a 1-byte sequence saves
one byte, and this must happen at least three times to overcome the three
bytes of table storage. With 4 repetitions, one byte is saved. And so
on. The net result is that you have to have a _lot_ of sequences repeated
a lot of times to get any savings, and somehow this savings needs to be
greater than the decompression code.

Note that, in general, if there are only a few sequences to replace then
some custom code can be shorter, but as the number of sequences increases
then generic code, with sequences stored in tables, quickly becomes
shorter.

The final dagger in the heart of these schemes is the program itself. I
wrote a simple program to find all the repeated strings in tetrattack.
There were numerous 2-byte sequences, but rarely repeated more than three
times. There were a few 3-byte sequences, and even one 4-byte sequence.
But overall, there just aren't enough sequences, especially long sequences,
to make any sequence-shortening scheme worthwhile.

There are, however, lots of repeated bytes. For example, there are an
awful lot of STA instructions and zp variables, and usually a lot of
zeros. This suggests taking another look at a Huffman-style scheme to
replace those bytes with a smaller number of bits.

One major problem identified earlier is storing the Huffman tree -- it's
just too big. So, the first thing to consider is to just compress the
most common bytes -- maybe the top five or eight bytes or whatever --
to make the tree smaller. But that still leaves a substantial decompression
code to traverse the tree, and a tree with several empty nodes in it.

Once again, if there are only a few nodes then custom code might be shorter,
but beyond a few nodes it's more practical to store the tree and have some
generic traversal code.

But consider the following alternative to a Huffman tree: what if we had a
tree which looked like

0/\1
/ \
lit 0/\1
/ \
b1 0/\1
/ \
b2 ...

where "lit" is a literal byte and b1, b2, etc. are encoded bytes. This would
encode the most common byte with two bits (10), the next-most common byte
with three bits (110), and so on, and encode literal bytes with nine bits
(0byte).

To see if this is a viable scheme, check out the numbers. Let's say that
we encoded just the top three most common bytes in a program; in an early
version of tetratack those bytes were

00 $2F occurances
03 $1A occurances (ZP variable)
8D $18 occurances (STA opcode)

In a real implementation, I'd choose zp variables to be common opcodes
like $8D, which would double the number of occurances of that byte, but
let's just consider the above numbers in a 512-byte program. Those top
three bytes total $5A bytes, almost 1/8 of the total bytes (the top seven
bytes take a little less than 1/3 of the total, by contrast). The number
of bits required is

9*($200-$5A) + 2*$2F + 3*$1A + 4*$18 = $FE2 bits = $1FD bytes.

so, a whopping 3 byte savings -- just enough to store the three "tree"
values. For the top seven bytes, a similar calculation suggests $01D8 bytes
of compressed data, again for code not optimized for compression. For a
more optimized code, with zp variables the same as common opcodes and such,
the total bytes could conceivably be on the order of $1C0 bytes -- maybe
64 bytes of savings (but again, I never actually tried coding a more
optimized version so I don't know for sure).

Of course, we haven't talked about decompression yet. One important
feature of this tree is that the "treeness" of it is unimportant -- it's
really just a substitution code. Remember that the data is encoded as

0+8 bits literal byte
10 byte 1
110 byte 2
1110 byte 3

and so on. A decompression algorithm simply reads bits until a 0 is found,
and then looks up the relevant value in a table. The exception is if
the first bit read is 0, in which case the code needs to read the next
eight bits and output the byte. There is no tree traversal, in the sense
of looking up nodes and moving left/right, to speak of.

Reading bits is trickier than reading bytes. After every eight bits read
some sort of memory pointer has to be incremented to the next eight bits.
Moreover, when reading codes like 110 or 1110 the number of 1's read has
to be counted too, to be used as an index into the lookup table. And if
there's a literal byte then the next eight bits need to be read, regardless
of their value.

When reading the bitstream, after every eight bits the pointer into that
bitstream must be incremented. To count eight "global" bits at a time, I use
a zp variable "count" and just rotate a bit through the different bit
positions. Every time it turns zero, the data stream pointer is incremented.
The advantage of this method is that it frees up a register (and winds up
saving a byte overall).

To count the number of "local" bits read, I use the .Y register. Initially
it is set to zero, and increments as 1s are read. For a literal byte,
I set .Y to -8, and put a little check in the bit reader for negative .Y;
that way, it simply increments to .Y=0.

When bit=0 is finally found, the code simply does an lda data,y to get the
coded data. Since .Y counts the number of bits read, .Y will be at least 1
for any coded data, but .Y=0 for a literal byte as described above.
Therefore, for a literal byte, the code can simply rotate bits into the
location "data", and use the same lda data,y instruction to fetch the
literal byte.

With all that in mind, here's the code I finally came up with (on paper!):

Tuff+1
[data lookup table]

Source [compressed data bitstream]

:lit ldy #-8
:getbit lsr count ;Increment pointer every eight bits
bne :skip
ror count
inx
bne :skip
inc :src+2
:skip
:src asl Source,x ;Get next bit
bcs :next
tya ;.y=0 means first bit is 0
beq :lit ;so read next 8 bits
bpl :found ;If .y<0 then literal
:next rol Tuff ;Save bit
iny
bne :getbit ;until .y=0
:found
lda Tuff,y ;Fetch code/literal byte
ldy #00
sta (zp),y ;Output data
inc zp
bne :getbit
inc zp+1
bpl :getbit
[Uncompressed code -- (zp) points here!]

for a total of 45 bytes, plus 7 bytes for the "Tuff" byte table -- total
of 52 bytes, compared with a possible 64 byte savings (maybe I should have
tried to implement this after all!). Hopefully someone can improve on
this further. (And "Tuff", btw, is a contraction of "TinyHuff").

Now, there may be a few raised eyebrows here which I hope I'll lower. This
code makes several assumptions. It can either be entered in at ":src",
or at ":getbit" depending on the value of "count", the global bit counter.
I assume count is already initialized (e.g. default kernal/restart value),
either to 1 (in which case :getbit is the entry point) or $80, in which case
:src is the entry point, with the "Source" stream modified accordingly in
each case.

I further assume that .axy have their default values of 0 when the routine
is called from BASIC.

I also assume "zp" is initialized to the correct value already. One easy
way to do this is to use $ae/$af, the end load address+1, but other
options are certainly available.

Finally, you may have noticed that the thing just keeps outputting data
until zp=$8000 -- correct. Unless you really really care about garbage in
memory above the program, there's no reason to waste bytes on an "end of data"
check.

Another variant
---------------

I spent a little time looking at a variant of this scheme: using a fixed
length bit sequence, for example, assigning four bits to each substitution
byte, i.e.

0+ literal byte
1000 byte 1
1001 byte 2
1010 byte 3
...
1111 byte 8

The numbers work out pretty well, compression-wise, but at the time I felt
the decompression code was more complicated and that it wouldn't break
even. Looking at it now, though, I'm starting to think otherwise -- that
loading three bits at a time isn't any more complicated that fetching
eight bits so code can be reused, and there isn't any issue about checking
for a 1 bit, so the decompression code ought to be even simpler. So this
may actually be the better scheme.

Sigh... I don't know about you, but every time I revisit a problem I wind
up feeling much stupider.

Final thoughts
--------------

So, that's it: assign shorter bit codes to the most common bytes, and use
a highly optimized decompression routine. This method is very specialized,
obviously, but might, just might, make it possible to save a few bytes on
a tiny program that has been optimized for compression -- using zp locations
the same value as common opcodes, reusing vars as often as possible, and
so forth.

The reason this works is that, in general, repeated long strings in a small
program just don't happen, whereas repeated bytes are common and make up
a substantial fraction of the total.

Pelle (magervalp@cling.gu.se) has suggested the possibility using the first
eight bytes of the compressed program for the 'Tuffman' table -- to structure
the code to make these common bytes, and thereby save eight bytes by not
having to store the 'tree'. Might save a few more bytes with some programs.

Finally, while writing this article, I started to wonder about byte sequences
with "holes", for example, <byte skip byte skip> for things like STA zp1
STA zp2. I wonder if there's some way of taking advantage of these repeated
"mask" patterns.

But I leave that thought for another person, for another day.
.......
....
..
. C=H 20

=====================================Part 5===================================

Tinyrinth
by Mark Seelye (aka Burning Horizon)
mseelye@yahoo.com
http://www,burninghorizon.com


Introduction
When the mini-game competition was first announced I was originally not
interested because of time considerations -- I spend most of my time at work,
or watching the kids. But once I saw that nearly everyone on EFNet #c-64 was
doing one, I thought it would be fun if I could just keep it simple.

I had no idea what game to do. I had a few other ideas but I settled on
a maze game because I thought it'd be an interesting challenge doing a maze
algorithm and have game code in under 512 bytes. As it turned out, I ended up
getting the code down to like 475 bytes, but the version I turned in was 509
bytes. (I ended up shaving an additional 34 bytes just before the due date.)

The other reason for the maze game is I have this weird thing I do where
in each language I know I code this little maze generation algorithm that I
made a long time ago. I had used the algorithm in ASP, VB, C, and a few
others but for some reason I had never done it in 6502! Imagine that! It was
going to be a bit of a challenge because I had never done it in so few
instructions, and never in ASM. It would have to lose some weight and fluff -
but I knew I could do it. So obviously there are MANY better ways to generate
a maze, but in light of trying to make it small this routine ended up being
such a strange mutation of my idea that I'm surprised it worked at all.

Setup
Before I do anything I must setup some things. This is also used when a
player dies OR enters a new level. The only difference between a reset of the
game and moving to the next level is the ldx #$00: sta $53. The $53 is also
used to set the color of the maze so it rotates from a level one of white, to
a level 16 of BLACK (which is evil because then you can't see the maze except
for the small hint of the corners on a turn.)

; Initialization
setup = *
gameover = *
;Setup game variables
ldx #$00
stx $53 ; Start at level 1 (to be inc'd)
init = *
;Setup level
inc $53 ; next level
ldx $53
stx $54 ;Counter for drawing Keys (next level)
stx $0286 ;Character Color to use on clear (e544)

;Set Render Cursor Start Pos / Player Color
lda #$05
sta EBCM1 ;Set ebcm color PLAYER to GREEN ($d022)
sta $45 ; Cursor/Player Position X (0-39)
sta $46 ; Cursor/Player Position Y (0-24)
;Clear Screen
jsr $E544 ;clear screen set to char color ($0286)

lda #$5b
sta $d011 ;turn on EBCM
lda #$18
sta $d018 ;Activate Cset

As you can see that I mention extended color background mode, I'll talk about
that later. You also might notice I use a character set, but that I will talk
about a little later as well, first I want to jump right into the maze
generation.


Maze Algorithm
The easiest way to describe how the algorithm works is to compare it to a
worm. The worm eats through an area until it gets blocked by one of its
former meals. While eating the worm is counting how many meals he has, once
he gets to a certain number he knows he can eat no more.
With that said, I'm sure you're completely confused and probably think
(know?) I'm crazy. So now for the technical explanation. Starting with a
2-dimensional grid of some size, initialized with 0's, set a start point for
the formation of the maze within the grid somewhere. After that has been
taken care of the rest follows simple logic:

Have we reached every part of the grid?
if so: done;
if not: Can we grow the maze into an adjacent area from the
current position?
if so: Grow in one of those directions at random, count it,
continue;
if not: Find a place in the maze that can grow, continue

I had to change the order of the logic around to get it to better fit into
a smaller number of bytes. So I moved the "find a place to grow from"
routine, and the "Have we reached every part of the grid" check after the
check if we can grow in any direction from the current position.

To check in all directions from the current position I used a subroutine.
The subroutine has 5 entry points which translate to: cangrow up/right/down/
left, and the 5th entry point is actually the first entry point - it checks
to see what is in .a and checks the direction that value represents. To check
the direction I use a kernal routine to set some zp so I can easily check the
value on the screen to the left, right, below, or above the current position.
This is also where the boundaries are set: 40 columns wide and 25 characters
high.

popgridloop = *
ldy $45 ;xpos
ldx $46 ;ypos
cangrowxy = * ;Can grow in any direction?
jsr cgup ;check up
beq _cgxy ;if 0 then we can grow
inx ;offset up check
jsr cgright ;check right
beq _cgxy ;if 0 then we can grow
dey ;offset right check
jsr cgdown ;check down
beq _cgxy ;if 0 then we can grow
dex ;offset down check
jsr cgleft ;check left
_cgxy beq growloop ;if 0 then we can grow

....The subroutine!... There is a weird "injection" in the bottom of this
routine to save a couple bytes.. it is explained in the comments to the code.

;Check if a cell can grow a direction
;1-up 2-right 3-down 4-left
; (y xpos, x ypos, a=dir) x/y switched for indirect lda($xx),y below
; return: a == 0 : true (can move)
; a <> 0 : false (cannot move)
cangrow = *
cmp #$01
beq cgup
cmp #$02
beq cgright
cmp #$03
beq cgdown
;cmp #$04
;beq cgleft *** not needed falling through
cgleft = *
dey ;set xpos - 1
cpy #$ff ;check xpos
beq cgno
bne cgloadcell
cgright = *
iny ;set xpos + 1
cpy #$28 ;check xpos (<40)
beq cgno
bne cgloadcell
cgup = *
dex ;set ypos - 1
cpx #$ff ;check xpos
beq cgno
bne cgloadcell
cgdown = *
inx ;set ypos + 1
cpx #$19 ;check ypos (<25)
beq cgno
;*** fallthrough, bne cgloadcell not needed
cgloadcell = *
lda #$1f
loadcell = * ;x = ypos, y = xpos, a = and value
sta $50
jsr $e9f0 ; * sets $d1 $d2
lda ($d1),y ;load byte (x pos in y reg!)
cgno = *
and $50 ;#$1f = use only low 5 bits!
;rts see below!
; This is mixed with the rts because the first byte would
; be wasted
growfrom = *
rts
.byte 1
; this is part growfrom part growto 48 is used for both
; again first byte would be wasted so we overlap with the previous
growinto = *
.byte 2, 4, 8, 1, 2

;explanation of above
; 0 1 2 4 8
; 0 4 8 1 2
;rts 1 2 4 8
; 4 8 1 2

*From Mapping: ;59888 $E9F0
;Set Pointer to Screen Address of Start of Line
;This subroutine puts the address of the first byte of the
;screen line designated by the .X register into locations
;209-210 ($D1-$D2).

Well, if the code at "cangrowxy" discovers that it cannot grow in any
direction from the current position, it has to find a place to grow from. To
do this it falls into the "findgrow" routine; the findgrow routine is
interesting because it is a "stateful" routine, meaning that upon reentry it
resumes what it was doing before. The reason for this is I didn't want it
restarting at the top of the maze each time it entered this find routine - I
wanted it to continue to search from where it left off. This routine works by
"walking" the current x and y positions through the grid. To walk it through
it keeps track of where it left off and it just keeps setting the x and y
pointers to the next place and "returns" to the start of the generation
portion of the code.
The other advantage to doing it this way is I can reset this routine to
place a "key" at a dead end in the maze. This reset is done elsewhere in the
maze by setting a zp value to 0. When the reset is done the findgrow routine
runs a small piece of code to reset itself, but before it does this it places
a key if it has not exceeded the max number of keys.

Here is the find grow routine, each line of code is documented to hint at
what it is doing:

; *** fall into findgrow
findgrow = *
lda $4b ; Check byte 0 != resume findgrow
bne _fgresume
sta $49 ;Reset Findgrow Xpos
sta $4a ;Reset Findgrow Ypos
inc $4b ;Set findgrow flag to resume (<>0)

;Place keys in corners (injected here for ease of placement, d1/d2 is
;pointed at a dead end)
iny ; offset left check
beq _fgresume ;Do not try when column is 0, it freaks out
lda $54
beq _fgresume ;if 0 then keys are done
dec $54 ;dec # of keys left to place
inc $51 ;actual num keys left
lda ($d1),y ;load byte
ora #$c0 ;EBCM value for key!
sta ($d1),y ;store new value
;(end of injection)

_fgresume = *
_fgx ldx $4a ;Findgrow ypos
_fgy ldy $49 ;Findgrow xpos
inc $49 ;Next xpos (next round)
cpy #$28 ; < 40
beq _fgcr ; next line if >= 40
jsr cgloadcell ; load cell byte
beq _fgy ; if 0 then get next xpos/byte
sty $45 ;Set Current xpos
stx $46 ;Set Current ypos
jmp cangrowxy ;Check if this can grow
_fgcr lda #$00 ;Reset Findgrow xpos
sta $49 ; 0->xpos
inc $4a ;Next Findgrow ypos
lda $4a
cmp #$19 ;check ypos (<25)
bne _fgx ;If we're at x40y25 we are ready to play!
beq gameinit ;Start game logic

That code also checks to see if we have completed generating the maze, at
which point it enters the game logic. But before we jump into that I have to
explain what happens when the "cangrowxy" routine discovers that it CAN GROW,
instead of falling into the "findgrow" routine.

In the "cangrowxy" routine described a bit above there was the final line
of:

_cgxy beq growloop ;if 0 then we can grow

This "growloop" routine is the place where the actual "growing" or
"eating" occurs. This routine works by choosing a direction at random, and
attempting to grow in that direction. It reuses the same "cgup" etc. routines
that the "cangrowxy" uses because although we know we can grow in a direction,
we don't know which direction. In a larger version of this algorithm I
usually link all of this stuff together, but this is how it worked out in
tinyrinth.

growloop = *
randdir = *
;jsr getrand; not a func, not reused yet
getrand = *
lda #$80
sta $028a ; Key Repeat; (injected here for #$80) just using #$80 for
;smaller code
sta $d412 ;sta $d404 d412 is V3, d404 is V1!!
sta $d40f ;set v3 random # gen bits
lda $d41b ; read random number from V3
and #$03 ; Force Random number to 0-3
clc
adc #$01 ; Add 1 to get 1-4
sta $4c ; store rand direction
ldy $45 ; Current Xpos
ldx $46 ; Current Ypos
jsr cangrow ; Check if we can grow in that direction
bne randdir ; if <> 0 then Try again
sta $4b ; reset findgrow flag (injected here for .a==0)
grow = *
ldx $4c ;Get saved rand direction
lda growinto,x ; 1-4 (4, 8, 1, 2) Get bit set for new cell
sta ($d1),y ; write new location
lda growfrom,x ; 1-4 (1, 2, 4, 8) Get bit to set for old
sta $4d ; Save growfrom bit
ldy $45 ; Reload Current xpos
ldx $46 ; Reload Current ypos
jsr cgloadcell ; Load base cell again
ora $4d ; Combine with growfrom bit
sta ($d1),y ;Modify old cell to connect to new loc
;Change current position
lda $4c ; Get saved rand direction
jsr cangrow ; Get new x y again - (this will only perform next x/y
;adj, returns <>0)
sty $45 ; xpos set to new location
stx $46 ; ypos
jmp popgridloop ; Return to populate grid loop

As you can see above I use the random number generator from V3 to get the
random direction. A HUGE bonus here is I can use the value returned to
directly call that main entry point in "cangrow" so it will figure out which
direction to check by itself. If the direction it checks is bad it just loops
and tries again. In a larger version of the program I'd make this much more
efficient, but efficiency costs bytes. Many coders would be afraid of using a
loop like this but since I pre-checked that I can grow before entering this
loop, unless my random number generator never returns that direction, I should
be fine!
Once we have a good direction I use the already set $d1 zp to read and
update the screen data. Nice bonus of using the routine that sets it! The
code that updates the data must update the cell you are growing to, but ALSO
it must update the cell you are growing from, which is why I have to reload
the original data up there. When it is finished doing the growing it moves
the x/y positions ($45, $46) to the place it grew to; and resumes the maze
generation from there.

As I said before, the findgrow routine will jump into the game logic once
it knows it is finished, but before we talk about that I should describe
another problem that I needed to solve.

 

In a 2 dimensional maze there are 16 possible pieces that can be formed.
Starting with completely closed ending with completely open. I represent
these pieces with numbers 0-15; each bit represents a side of a cell that is
open/closed. So when I finish with the maze generation I have a grid of
numbers 0-15. (Actually 1-14; in this game I never end up with 0's or 15's
as all pieces are used.) In Tinyrinth the "grid" I used was the screen memory
(40x25), so I ended up with a screen full of characters A-N.
Navigating a maze of characters A-N would be rather difficult so I had to
decide whether to use the built in character set and transform what was on the
screen or somehow use a character set. Well, another hitch was I wanted this
to be a game and losing the simplicity of 0-15 for detecting which way a
player could move would be bad - so I opted for a character set of 16
characters. Hmmm 8 bytes per character time 16 characters.. 128 bytes!
Yikes!

Character Set Generation
So I was going to need a 128 byte character set, but I didn't want to have
to lose 128 bytes. So what did I do? I coded about 70 bytes of code to
generate the character set. This too pained me because I was sure I could
make it smaller somehow, but once I got it working and a decent percentage
smaller than the size of what it produced I stopped worrying about it.
Here is how it works: we know we want to represent the 16 pieces with 0-15
so the bits must give some hint to the shape of the piece. What I did was
create a loop to check the current counter value and either set or skip the
top, sides, bottom of the piece.
Here is how it works, I have commented each line of code to describe
what it is doing.

; Generate Cset!
lda #$20 ; write hi
sta $48 ; use zp
lda #$00 ; write lo

;Initialize Screen, variables (injected here to save bytes - using
;lda #$00)
sta $51 ; Clear actual num keys placed counter (see findgrow)
sta $d020
sta EBCM0; Set BG Color ($d021)
;(end of injection)

tax ; counter = 0
_again sta $47 ; use zp
ldy #$00 ; index
txa ; counter to a
and #$01 ; check for top
beq _ytop ; yes top
eor #%01111111 ; 00000001 -> 011111110 -> 10000001
_ytop eor #%11111111 ; 00000000 -> 111111111
_6sides sta ($47),y ; store top/sides to cset
iny ; next mem location
txa ; counter to a
and #$02 ; check for right
eor #%00000010 ; flip
lsr ; 00000010 -> 00000001 || 0->0
sta $49 ; store for right side
txa ; counter to a
and #$08 ; check for left side
bne _noleft ; no left
eor #%10001000 ; 00000000 -> 10001000 -> 10000000
_noleft eor #%00001000 ; 00001000 -> 00000000
ora $49 ; merge with right
cpy #$07 ; 7->15->23->...
bne _6sides ; total of 6 side pieces
txa ; counter to a
and #$04 ; check for bottom
beq _ybot ; no bottom
eor #%01111010 ; 00000100 -> 01111110 -> 10000001
_ybot eor #%11111111 ; 00000000 -> 11111111
sta ($47),y ; store bottom to cset
inx ; next counter
clc ; clear carry
lda $47 ; inc zp
adc #$08 ; by 8
bne _again ; do it again
inc $48
ldy $48
cpy #$28 ;repeat through cset 2000-2800
bne _again
sty EBCM2 ;Set Death color ($d023) (using result of cset gen for
;color value!)

I still think I could have made that smaller; either way though it was fun
making a little routine to generate the cest. With that all said, we now know
how the maze is generated. But now I needed a way to play it, as suggesting
players use dry erase markers on their screens was not a great idea.


Game Loop
Ok, once "findgrow" reaches the end of its maximum size it branches to
"gameinit". Game init sets up some quick things and gets going on the game
play. The game loop loops until the player either gets all the keys or dies.
When the player gets all the keys the level number ($53) is increased which is
used to increase the maximum number of keys and the color of the maze!
Here is the game loop, I have commented much of the code and separated it
into sections:

*** Flashes the keys, sets speed of "minitaur" based on the level

; Game Initialization and Game Loop
gameinit = *
gameloop=*
inc EBCM3 ; Flash Keys ($d024)
inc $58 ; Increase Speed counter #1 (0-255)
bne moveplayer ; Skip move
inc $57 ; Increase Speed counter #2 ($57|#$f0 - 255)
bne moveplayer ; Skip Move

;set death speed
lda $53 ;Use level for Speed value
cmp #%11111000 ;If more than this use default speed
bmi _dsp
lda #%11111000 ;Default speed
_dsp ora #%11110000 ;Set high nibble so counter counts up to 255
sta $57 ; Set Speed counter #2

*** Moves the minitaur to the next position if it is time. (Checks for player
*** hits)

;move death
movedeath = *
ldy $55 ;Baddy Xpos
ldx $56 ;Baddy Ypos
jsr cgloadcell ; load the cell/point the zps (ANDs by #$1f)
sta ($d1),y ;store cleared value
_newy iny ;increase xpos
cpy #$28 ;less than 40?
bmi _go ;don't reset
ldx $46 ;ypos of player
stx $56 ;ypos of death
ldy #$00 ;clear xpos counter
_go sty $55 ;Set baddy xpos
lda #$ff ;Get all bits! (see loadcell)
jsr loadcell ;load the cell/point the zps
sta $59 ;Save cell value (withh all possible bits)
and #%11000000 ;and by EBCM bits
cmp #%11000000 ;Check for KEY - (so it can skip over)
beq _newy ;Jump ahead 1 more to skip key position
cmp #%01000000 ;Check for player hit
bne _nodie ;Player is not dead
jmp gameover ;Game Over!
_nodie lda $59 ;Reload stored value
ora #$80 ;EBCM for Death
sta ($d1),y ;store value
; *** fall through to Move Player

*** Checks the keyboard for input, moves player if possible

;Move Player
moveplayer=*
_ffe4 jsr $ffe4 ;Get keypress
beq gameloop ;no key - goto gameloop
and #%00000011 ;.a == 0-7 at this point
tax
lda keytodir,x ;Loads from keytodir
;Move entity in game
; .a=direction 1-up 2-right 3-down 4-left
glmove tax
stx $4e ; store direction
lda growinto,x ; get check bit
sta $4f ; store check bit
ldy $45 ; current xpos
ldx $46 ; current ypos
jsr cgloadcell ; load the cell (and with #$1f)
sta ($d1),y ; store the data (clear the EBCM)
lda #$00 ; Bottom "fall out" fix
sta $50 ; clear and of cangrow Bottom "fall out" fix
lda $4e ; load direction
jsr cangrow ; call cangrow to move xpos/ypos
and $4f ; check bit
bne glmyes ; if we have a bit then we can move!
ldy $45 ; reload xpos - do not move
ldx $46 ; reload ypos - do not move
glmyes lda #$ff ; bits to obtain from loadcell
jsr loadcell ; load the cell/point the zps
pha ; temp store value for later checks
and #$1f ; clear other EBCM bits
ora #$40 ; EBCM ORA Player/Baddy
sta ($d1),y ; store new data
sty $45 ; store xpos of new position
stx $46 ; store ypos of new position

*** Checks for key hits, minitaur hits, Check for end-level:
(jumps to next level if it is the end of the level, loops to the beginning
of the game loop if not.)

;Hit checks
pla ; load previous value
and #$c0 ; check for hits "11xxxxxx"
cmp #$c0 ; check for key hit
bne _back ;_notkey ; to next check
dec $51 ; dec number of keys left in level
bne _back ; if 0 then we should go to the next level
jmp init ; gen maze again
;_notkey cmp #$80 ; check for death hit!
; bne _back
; jmp gameover ; game over
_back jmp gameloop
;more checks here?

keytodir=*
.byte 2,1,4,3


To make this all work, I reused the "cangrow" routine again! Because
basically it is the exact same logic. One thing that I found unfortunate is I
never coded a way for the minitaurs to have any intelligence, there was an
attempt that worked - but it was FAR too many bytes (over by like 20-30). So
I just axed it. I also at one time made the movement routines all one
subroutine that took parameters that would move any "entity". This also took
too many bytes to be useful for this version of the game. Perhaps some day
I'll write a kick butt 1k or 2k version with all those features coded in.

Various Space Saving Techniques Used

If you look throughout the code, you'll notice I comment a lot on
"injections". These are little bits of code that are conveniently put in
places to take advantage of the state of a register, or a memory location.
Here is an example: in the "getrand" routine I injected a line to set the
key repeat, using the value that I wanted to also put into $d412 for the
random number generation.

lda #$80
sta $028a; Ket Repeat; (injected here for #$80) just using #$80 for smaller code
sta $d412 ;sta $d404 d412 is V3, d404 is V1!!


Another example, here I mixed some static data together to save some bytes and
just labeled it so some would be reused. To save additional bytes I also
positioned the label in front of the "rts". (This was because the code
accessing the data never used an index of 0.) This is mixed with the rts
because the first byte would be wasted

growfrom = *
rts
.byte 1
; this is part growfrom part growto 48 is used for both
; again first byte would be wasted so we over lap with the previous
growinto = *
.byte 2, 4, 8, 1, 2

;explanation of above
; 0 1 2 4 8
; 0 4 8 1 2
;rts 1 2 4 8
; 4 8 1 2

I also reused the "cangrow" routine as much as possible, and on top of
that I made it more usable by making multiple entry points so it could be used
in different way.

Finally the best thing I did was use kernal routines, like $E544 to clear
the screen to the set char color at $0286. This allowed me to change the
color of the maze each level just by changing the value at $0286.
I also used $e9f0, Mapping the c64 says, "Set Pointer to Screen Address of
Start of Line. This subroutine puts the address of the first byte of the
screen line designated by the .X register into locations 209-210 ($D1-$D2)."
This was handy because I could use this to read from and write to the screen.

Extended Color Background Mode
I could not use sprites. This made me sad, but there was just no room for
it. So I just decided to allow the cset to repeat itself through the whole
cset area ($2000-$2800) and use ECBM! This allowed me to just set a bit and
make a maze piece a different color, what more when I read this value I could
easily tell what the piece represented! That made collision detection very
easy, and also allowed me to have cool flashing keys.

Conclusion
I had a LOT of fun coding this and look forward to another competition. I
was surprised and amazed by the entries! This even inspired me to make a
better version of Tinyrinth, of which I have not completed. I did add to it,
I made a version in which the minitaurs move through the maze. I made another
version that would also increase the number of minitaurs in the maze. I made
yet another version that would change the size of the maze, and thought about
changing the shape too. All of these ideas were easy to implement once I had
the base game working.
Oh by the way, I know how to spell Minotaur, the Minitaur thing is just a
bad joke. :)


Final Code Stats
Total source compiled: 445 lines in 1 file(s)
Symbol table: 29 global and 1 local constants, 0 global variables,
2 global and 15 local labels, 0 source labels.
Data size: 10 bytes
Code size: 475 bytes in 231 instructions
Memory block affected: $1000 - $11E4 (total size: 485 bytes)

Source
Note: This source is not the source I released for the event. It is a bit
smaller, I cannot locate my original source for some reason. Also, some of it
may differ a tiny bit from the code documented above. I had written the
article using the wrong copy of the source, I have attempted to go back and
update it though. (Note to self: keep better records!) :)

Another note: This source compiles with c64asm by Balint Toth. :)

-- tinyrinth.asm --
; Tinyrinth
; entry for a 512b game contest
; Burning Horizon/FTA
; Mark Seelye
; mseelye@yahoo.com
; ===================================================================

* = $1000

;name loc desc color ecbm bits
EBCM0 = $d021 ; untouched black $00 00
EBCM1 = $d022 ; cursor red $40 01
EBCM2 = $d023 ; touched green $80 10
EBCM3 = $d024 ; keys yellow $c0 11

;ZPs used: (Consolidation Possible if needed)
;43/44 - Not Used
;45/46 - Current X/Y Position, Maze Generation & Game
;47/48 - CSet Location, CSet Generation
;49 - Temp Storage, CSet Generation
;49/4a - Xpos/Ypos Findgrow, Maze Generation
;4b - Flag, Findgrow, Maze Generation
;4c - Temp Storage, randdir, Maze Generation
;4d - Temp Storage, grow, Maze Generation
;4e/4f - Temp Storage, glmove, Game
;50 - Temp Storage, loadcell, Maze Generation & Game
;51 - NumKeys Left in level (affected by: destroyed & found keys)
;52 - not used
;53 - Current Level
;54 - # of Keys to try and place, gameinit, Game
;55/56 - X/Y Pos of Death
;57/58 - speed counter for death

;Collect all keys
; gen crummb path for visited areas?
; once all keys collected next maze is gen'd

; Initialization
setup = *
gameover = *
;Setup game variables
ldx #$00
stx $53 ; Start at level 1 (to be inc'd)
init = *
;Setup level
inc $53 ; next level
ldx $53
stx $54 ;Counter for drawing Keys (next level)
stx $0286 ;Character Color to use on clear (e544)

;Set Render Cursor Start Pos / Player Color
lda #$05
sta EBCM1 ;Set ebcm color PLAYER to GREEN ($d022)
sta $45 ; Cursor/Player Position X (0-39)
sta $46 ; Cursor/Player Position Y (0-24)
;Clear Screen
jsr $E544 ;clear screen set to char color ($0286)

lda #$5b
sta $d011 ;turn on EBCM
lda #$18
sta $d018 ;Activate Cset

; Generate Cset!
lda #$20 ; write hi
sta $48 ; use zp
lda #$00 ; write lo

;Initialize Screen, variables (injected here to save bytes - using lda #$00)
sta $51 ; Clear actual num keys placed counter (see findgrow)
sta $d020
sta EBCM0; Set BG Color ($d021)
;(end of injection)

tax ; counter = 0
_again sta $47 ; use zp
ldy #$00 ; index
txa ; counter to a
and #$01 ; check for top
beq _ytop ; yes top
eor #%01111111 ; 00000001 -> 011111110 -> 10000001
_ytop eor #%11111111 ; 00000000 -> 111111111
_6sides sta ($47),y ; store top/sides to cset
iny ; next mem location
txa ; counter to a
and #$02 ; check for right
eor #%00000010 ; flip
lsr ; 00000010 -> 00000001 || 0->0
sta $49 ; store for right side
txa ; counter to a
and #$08 ; check for left side
bne _noleft ; no left
eor #%10001000 ; 00000000 -> 10001000 -> 10000000
_noleft eor #%00001000 ; 00001000 -> 00000000
ora $49 ; merge with right
cpy #$07 ; 7->15->23->...
bne _6sides ; total of 6 side pieces
txa ; counter to a
and #$04 ; check for bottom
beq _ybot ; no bottom
eor #%01111010 ; 00000100 -> 01111110 -> 10000001
_ybot eor #%11111111 ; 00000000 -> 11111111
sta ($47),y ; store bottom to cset
inx ; next counter
clc ; clear carry
lda $47 ; inc zp
adc #$08 ; by 8
bne _again ; do it again
inc $48
ldy $48
cpy #$28 ;repeat through cset 2000-2800
bne _again

sty EBCM2 ;Set Death color ($d023) (using result of cset gen for color value!)


popgridloop = *
;can grow from current?
ldy $45 ;xpos
ldx $46 ;ypos
;Can grow in any direction?
cangrowxy = *
jsr cgup ;check up
beq _cgxy ;if 0 then we can grow
inx ;offset up check
jsr cgright ;check right
beq _cgxy ;if 0 then we can grow
dey ;offset right check
jsr cgdown ;check down
beq _cgxy ;if 0 then we can grow
dex ;offset down check
jsr cgleft ;check left
_cgxy beq growloop ;if 0 then we can grow
; *** fall into findgrow
findgrow = *
lda $4b ; Check byte 0 != resume findgrow
bne _fgresume
sta $49 ;Reset Findgrow Xpos
sta $4a ;Reset Findgrow Ypos
inc $4b ;Set findgrow flag to resume (<>0)

;Place keys in corners (injected here for ease of placement, d1/d2 is pointed at a dead end)
iny ; offset left check
beq _fgresume ;Do not try when column is 0, it freaks out
lda $54
beq _fgresume ;if 0 then keys are done
dec $54 ;dec # of keys left to place
inc $51 ;actual num keys left
lda ($d1),y ;load byte
ora #$c0 ;EBCM value for key!
sta ($d1),y ;store new value
;(end of injection)

_fgresume = *
_fgx ldx $4a ;Findgrow ypos
_fgy ldy $49 ;Findgrow xpos
inc $49 ;Next xpos (next round)
cpy #$28 ; < 40
beq _fgcr ; next line if >= 40
jsr cgloadcell ; load cell byte
beq _fgy ; if 0 then get next xpos/byte
sty $45 ;Set Current xpos
stx $46 ;Set Current ypos
jmp cangrowxy ;Check if this can grow
_fgcr lda #$00 ;Reset Findgrow xpos
sta $49 ; 0->xpos
inc $4a ;Next Findgrow ypos
lda $4a
cmp #$19 ;check ypos (<25)
bne _fgx ;If we're at x40y25 we are ready to play!
beq gameinit ;Start game logic


growloop = *
randdir = *
;jsr getrand; not a func, not reused yet
getrand = *
lda #$80
sta $028a; Ket Repeat; (injected here for #$80) just using #$80 for smaller code
sta $d412 ;sta $d404 d412 is V3, d404 is V1!!
sta $d40f ;set v3 random # gen bits
lda $d41b ; read random number from V3
and #$03 ; Force Random number to 0-3
clc
adc #$01 ; Add 1 to get 1-4
sta $4c ; store rand direction
ldy $45 ; Current Xpos
ldx $46 ; Current Ypos
jsr cangrow ; Check if we can grow in that direction
bne randdir ; if <> 0 then Try again
sta $4b ; reset findgrow flag (injected here for .a==0)
grow = *
ldx $4c ;Get saved rand direction
lda growinto,x ; 1-4 (4, 8, 1, 2) Get bit set for new cell
sta ($d1),y ; write new location
lda growfrom,x ; 1-4 (1, 2, 4, 8) Get bit to set for old
sta $4d ; Save growfrom bit
ldy $45 ; Reload Current xpos
ldx $46 ; Reload Current ypos
jsr cgloadcell ; Load base cell again
ora $4d ; Combine with growfrom bit
sta ($d1),y ;Modify old cell to connect to new loc
;Change current position
lda $4c ; Get saved rand direction
jsr cangrow ; Get new x y again - (this will only perform next x/y adj, returns <>0)
sty $45 ; xpos set to new location
stx $46 ; ypos
jmp popgridloop ; Return to populate grid loop


; Game Initialization and Game Loop
gameinit = *
gameloop=*
inc EBCM3 ; Flash Keys ($d024)
inc $58 ; Increase Speed counter #1 (0-255)
bne moveplayer ; Skip move
inc $57 ; Increase Speed counter #2 ($57|#$f0 - 255)
bne moveplayer ; Skip Move

;set death speed
lda $53 ;Use level for Speed value
cmp #%11111000 ;If more than this use default speed
bmi _dsp
lda #%11111000 ;Default speed
_dsp ora #%11110000 ;Set high nibble so counter counts up to 255
sta $57 ; Set Speed counter #2


;move death
movedeath = *
ldy $55 ;Baddy Xpos
ldx $56 ;Baddy Ypos
jsr cgloadcell ; load the cell/point the zps (ANDs by #$1f)
sta ($d1),y ;store cleared value
_newy iny ;increase xpos
cpy #$28 ;less than 40?
bmi _go ;don't reset
ldx $46 ;ypos of player
stx $56 ;ypos of death
ldy #$00 ;clear xpos counter
_go sty $55 ;Set baddy xpos
lda #$ff ;Get all bits! (see loadcell)
jsr loadcell ;load the cell/point the zps
sta $59 ;Save cell value (withh all possible bits)
and #%11000000 ;and by EBCM bits
cmp #%11000000 ;Check for KEY - (so it can skip over)
beq _newy ;Jump ahead 1 more to skip key position
cmp #%01000000 ;Check for player hit
bne _nodie ;Player is not dead
jmp gameover ;Game Over!
_nodie lda $59 ;Reload stored value
ora #$80 ;EBCM for Death
sta ($d1),y ;store value
; *** fall through to Move Player

;Move Player
moveplayer=*
_ffe4 jsr $ffe4 ;Get keypress
beq gameloop ;no key - goto gameloop
and #%00000011 ;.a == 0-7 at this point
tax
lda keytodir,x ;Loads from keytodir
;Move entity in game
; .a=direction 1-up 2-right 3-down 4-left
glmove tax
stx $4e ; store direction
lda growinto,x ; get check bit
sta $4f ; store check bit
ldy $45 ; current xpos
ldx $46 ; current ypos
jsr cgloadcell ; load the cell (and with #$1f)
sta ($d1),y ; store the data (clear the EBCM)
lda #$00 ; Bottom "fall out" fix
sta $50 ; clear and of cangrow Bottom "fall out" fix
lda $4e ; load direction
jsr cangrow ; call cangrow to move xpos/ypos
and $4f ; check bit
bne glmyes ; if we have a bit then we can move!
ldy $45 ; reload xpos - do not move
ldx $46 ; reload ypos - do not move
glmyes lda #$ff ; bits to obtain from loadcell
jsr loadcell ; load the cell/point the zps
pha ; temp store value for later checks
and #$1f ; clear other EBCM bits
ora #$40 ; EBCM ORA Player/Baddy
sta ($d1),y ; store new data
sty $45 ; store xpos of new position
stx $46 ; store ypos of new position

;Hit checks
pla ; load previous value
and #$c0 ; check for hits "11xxxxxx"
cmp #$c0 ; check for key hit
bne _back ;_notkey ; to next check
dec $51 ; dec number of keys left in level
bne _back ; if 0 then we should go to the next level
jmp init ; gen maze again
;_notkey cmp #$80 ; check for death hit!
; bne _back
; jmp gameover ; game over
_back jmp gameloop
;more checks here?

keytodir=*
.byte 2,1,4,3


;Check if a cell can grow a direction
;1-up 2-right 3-down 4-left
; (y xpos, x ypos, a=dir) x/y switched for indirect lda($xx),y below
; return: a == 0 : true (can move)
; a <> 0 : false (can not move)
cangrow = *
cmp #$01
beq cgup
cmp #$02
beq cgright
cmp #$03
beq cgdown
;cmp #$04
;beq cgleft *** not needed falling through
cgleft = *
dey ;set xpos - 1
cpy #$ff ;check xpos
beq cgno
bne cgloadcell
cgright = *
iny ;set xpos + 1
cpy #$28 ;check xpos (<40)
beq cgno
bne cgloadcell
cgup = *
dex ;set ypos - 1
cpx #$ff ;check xpos
beq cgno
bne cgloadcell
cgdown = *
inx ;set ypos + 1
cpx #$19 ;check ypos (<25)
beq cgno
;*** fallthrough, bne cgloadcell not needed
cgloadcell = *
lda #$1f
loadcell = * ;x = ypos, y = xpos, a = and value
sta $50
jsr $e9f0 ; sets $d1 $d2
;59888 $E9F0
;Set Pointer to Screen Address of Start of Line
;This subroutine puts the address of the first byte of the screen line
;designated by the .X register into locations 209-210 ($D1-$D2).
lda ($d1),y ;load byte (x pos in y reg!)
cgno = *
and $50 ;#$1f = use only low 5 bits!
;rts see below!
; This is mixed with the rts because the first byte would
; be wasted
growfrom = *
rts
.byte 1
; this is part growfrom part growto 48 is used for both
; again first byte would be wasted so we over lap with the previous
growinto = *
.byte 2, 4, 8, 1, 2

;explanation of above
; 0 1 2 4 8
; 0 4 8 1 2
;rts 1 2 4 8
; 4 8 1 2



;Notes

; 1
;8 2
; 4
;
;
;*** * * *** * *
;*0* *1* *2 *3 @ A B C
;*** *** *** ***
;
;*** * * *** * *
;*4* *5* *6 *7 D E F G
;* * * * * * * *
;
;*** * * *** * *
; 8* 9* a b H I J K
;*** *** *** ***
;
;*** * * *** * *
; c* d* e f L M N O
;* * * * * * * *

-- end: tinyrinth.asm --
.......
....
..
. C=H 20

=====================================Part 6===================================


Tetrattack!

by Stephen L. Judd <sjudd@ffd2.com>

Introduction
------------

The main point of Tetrattack!, of course, was to try and write a 3D engine
in 512 bytes. I had a few other routine ideas but couldn't think of anything
remotely fun to do with them, but I finally thought of a neat 3D game idea
and decided to go for it. Alas, as code kept getting bigger that idea had
to be discarded, and the eventual result was tetrattack, the simplest/smallest
3D game that wasn't (totally) stupid.

But it is a 3D engine in 512 bytes!

At first blush a 512-byte 3D engine might seem ridiculous but think about it:
a basic 3d program doesn't need to store any graphics definitions or things
like that -- it draws its own as it goes. In fact, all that is needed is a
line routine, a rotation/projection routine, and some routines to keep track
of angles and positions and such. What could possibly be hard about that,
right?

Well, it seemed reasonable at the time, anyways. At exactly 512 bytes, the
code really had to sweat out the bytes, and every single byte saved was
important; meanwhile, the routines still had to be reasonably fast. So this
article will discuss the line routine, the rotation/projection routine, and
the object and game manipulation routines, and the various optimizations used.

I must say that writing this program really was fun; I'm sure that everyone
else had the same problems I had, trying to shift from a "cycle-optimization"
mindset to a "byte-optimization" mindset. Finding those extra few bytes
here and there was a mighty challenge, sometimes taking hours or even days
just to save one or two bytes. That part was frustrating, of course, as
is the fact that the 'game' aspect of the program is thoroughly lame. But
it was a neat challenge, and there was never an issue of adding too many
features to the program and never finishing it!


Quick Summary
-------------

A full-on 3D program uses matrix multiplications to compute rotations, but
3D calculations get a whole lot easier in a Battlezone or driving-type
situation, when you only turn left or right. There are no longer rotation
matrices to keep track of -- just a single angle. This means that rotations
are done with a simple INC/DEC, and there are no matrix multiplies to deal
with. Rotations are done with some simple table lookups.

The line drawing routine is pretty standard. Self-modifying code is
used to manipulate a single routine to draw different lines, by swapping
INX/INY instructions (which turns out to be less efficient, memory-wise --
sigh, oh well). Lines are plotted to a double-buffered 128x128 charmap,
and once again setting up the screen took extra bytes.

The "game code" is the simplest possible -- tetrahedrons are moved by
just incrementing their z-coordinate, and rotated by incrementing their
angle. Their locations are determined by whatever zero page is initialized
to -- the program never sets explicit locations. Instead, I just chose a
portion of zero page that had a reasonable range of default values (default
zero-page values are used extensively in the code)

In addition to default zero-page values the code uses several kernal
routines (including a few unusual ones, for example the shift-C= routine
for swapping charset buffers), self-modifying code, lots of zero-page
instructions, and a whole bunch of tightly optimized code.

The code was written over a few weeks -- a good week of planning and writing
out possible routines, and a good week or two of coding.

A couple of times I froze the program (using AR) and the freeze almost always
happened in the same place: the buffer clearing routines. So the rotation
and line drawing routines turned out to be fast enough, even with all the
byte savings.

The code was developed using the Sirius assembler on a 128D+SCPU -- and btw,
the Jammon debugging mode has turned out to be awfully cool, with being able
to visually single-step through code while simultaneously viewing memory and
registers. Action Replay was downright clunky afterwards (but could freeze!).

Yep, just a plug, with a bit of amazement that it actually all worked.

I also tested it with VICE, to make sure it worked OK. VICE initializes
zero page to different values than my 128, unfortunately, so some tweaking
had to be done.

And that's about it. Now on to more detail.


----------
Setting up
----------

The setup routines initialize VIC, initialize tables used by the line and
rotation routines, and set up the screen. I note that when I started
planning this program I computed the sizes of line and rotation routines
without considering the code needed to set up tables and such for those
routines. Oops!

The default initial values for .X, .Y, and .A are 0 after a SYS, so the
code made a little use of this.

Line routine setup
------------------

The line routine uses tables to look up pixel addresses and bit locations
for each x,y coordinate (the line routine stores the x and y coordinates
directly in the .X and .Y registers, so looking up pixel addresses amounts
to LDA Bitp,X kinds of instructions). The "bitmap" is a 128x128 charmap,
so the calculations are pretty easy. The bit patterns which correspond to
different x-coordinates (10000000 01000000 etc.) are computed using

cmp #$80
rol

to cyclically rotate a single bit.

Rotation routine setup
----------------------

Another little piece of code is piggybacked into this loop, which extends a
sine table from 0..pi. In this program, "angles" can go from 0..63,
corresponding to 0..2*pi; the most memory-efficient way to store the sine
table is to store the values for 0..pi/2, and then extend the table to
pi/2..pi. That is, angle=16 (angles go from 0..63, remember) corresponds to
pi/2, so the job of extending the table amounts to copying table values 15..0
to values 17..32 (value 16 is not copied because it is pi/2 -- the values
are 'mirrored' through pi/2. To put it another way: where would it be copied
to?).

This amounts to starting one index register at 15 and decrementing it, while
starting the other register at 17 and incrementing:

ldy #15
ldx #17
lda table,y
sta table,x
inx
dey

The problem here is piggybacking into a loop where .Y is initialized to
ldy #$7F instead of ldy #15. The solution is to initialize .X carefully

[ldy #$7f]
ldx #$90

and use the following:

lda sin0,y
sta sin0+17,x
inx
[dey]

The idea is that when .Y=15, .X will equal 0 and start incrementing. This
copies a lot of junk into the table that is never used but doesn't overwrite
the existing 0..16 table values, and saves several bytes.

The rotation tables are tables of r*sin(theta), one table for every allowed
value of theta=0..64 and r=-128..127. The idea is that every 256-byte page
contains r*sin(theta) for one value of theta, so that a table lookup amounts
to

lda theta
ora #>high byte of sin tables
sta zp+1
ldy r
lda (zp),y

The table values are computed using a standard shift-and-add multiply routine,
with a little extra code to handle negative values of r. What about
negative values of sin(theta), though?

Recall that theta=0..63 corresponds to 0..2*pi, but the init code only
extends the sine table from 0..pi (i.e. theta=0..32). The table values
for pi..2*pi are computed simply by taking the negative of the values
for 0..pi:

sta (p4000),y
eor #$ff
clc
adc #1
STA (p6000),y

(The clc adc #1 code was painful, byte-wise, but necessary). The pointers
p4000 and p6000 are initialized to $4000 and $6000 by default -- or at
least are supposed to be. Unfortunately, VICE and my 128D had different
ideas about this, so a little extra code was necessary to initialize them.

A big reason for putting a table at $6000 is that it ends at $8000, to
provide an easy exit condition for the loop:

inc p6000+1
bpl :mult8 ;to $8000

Screen setup
------------

A couple of things need to be done here: clear the screen, set the color RAM
to the background color, and put a 16x16 block of characters in the middle
of the screen (and do so in as few bytes as possible!).

Clearing the screen is a piece of cake:

sta $d021
sta $0286 ;clear color
jsr $e544 ;clr scr

The $0286 STA is to clear the color RAM to black, and work with different
kernals. (With both color RAM and background the same, pixels won't
appear where they aren't wanted).

Storing a block of chars isn't quite so straightforward; rather, the
straightforward methods use up a fair number of bytes, especially since both
screen and color RAM have to be set up. Here's the method I finally
came up with:

:l2a
jsr $e8ea ;scroll up
;.A = ($AC)
;.X=00
clc ;necessary?
:l2 ;set one line
sta $0720+12,x
inc $db20+12,x
inx
adc #16
bcc :l2
; adc #$00
inc $ac

cmp #15
bne :l2a

Kernal routine $e8ea is the routine to scroll the entire screen up one line,
and disassembly shows that when it exits .A is equal to the contents of $AC
and .X is 0. So the routine stores one line of chars to the middle of the
screen, increments the corresponding color RAM (to white), and moves the
screen up (chars and color RAM). By incrementing $AC before moving the
screen up ($AC starts at zero normally), .A is re-initialized to its previous
value+1, so this prints the next line of chars to the screen (that is, the
previous line of chars + 1). By doing this 16 times a 16x16 block of chars
is produced, with color RAM set to white inside the block, and set to black
everywhere else.

And that brings us to the main program loop.


---------
Main Loop
---------

The main loop is pretty straightforward: clear buffer, get input, update
positions and angles, render the buffer, and swap buffers.

The only thing special in this process is swapping buffers via the kernal
routine at $eb59 -- this is the routine that is called when C= and shift
are held down.

There is also a lame dot that pulses to the screen when you hold the fire
button down (how cool some laser lines would have been). Alas...

But that's pretty much it. So now let's talk a little about some of the
optimizations that can be made for a 3D program, followed by a discussion of
the line routine and the rotation routine.

--------
3D Stuff
--------

For detailed information on how to "do" 3D worlds I suggest reading issue #16
of C=Hacking, but the basics of rendering a 3D world are:

1) figuring out where an object is, relative to you,
2) if the object is visible, computing what it looks like from your
perspective, and
3) rendering the object.

To define and operate on an "object", we need to know the vertices of the
object, and how to connect them together with lines, i.e. to draw the edges.
The simplest object of all is a tetrahedron -- four points, and every point
is connected to every other point. This turns out to be pretty important,
as a lot of memory is saved by using only tetrahedrons. To see why, first
consider what the code to draw edges from a list of vertices must look like:

rotate points

ldx #number_of_edges
:loop stx temp
ldy edge_vertex1,x ;index into some list of vertices
lda edge_vertex2,x
tax
jsr drawline
ldx temp
dex
bpl :loop

Now consider a cube; in my original code idea, I was going to use cubes
and tetrahedrons (pyramids). A cube has eight vertices and 12 edges.
Each vertex has an x, y, and z coordinate, and from the code above each edge
needs two bytes, one for each vertex of the edge. 48 bytes, just to store a
cube -- almost 1/10 of the entire 512-byte code! (Using code to generate the
vertices is possible, but still takes a fair number of bytes.)

Compare this to the humble tetrahedron, which only requires twelve bytes:
three for each vertex. What about the edges? Remember that every point
is connected to every other point in a tetrahedron. This means that we do
not have to store the edge connections, but can compute them manually, and
draw the whole thing using the following code:

[count = 3]
; Plot
:ploop ldy count
:l2 ldx count
dey
tya
pha
jsr DrawLine
pla
tay
bne :l2
dec count
bne :ploop

This code simply draws lines between all vertices: 3-2, 3-1, 3-0, and so on.
It uses pretty much the same number of bytes as the earlier code, but
doesn't require any storage of the edge connections (12 bytes to store
the connections).

Next consider the vertices of a tetrahedron. It turns out that by carefully
selecting the vertices, default zero-page values can be used, something
that isn't possible with e.g. a cube. Consider the following tetrahedron
vertices:

Px = 0, -4, 16, -4
Py = 16, 0, 8, 0
Pz = 0, 16, 0, -16

(just the vertices of a more or less regular tetrahedron which I drew on
a piece of paper). The trick is to arrange these coordinates in the right
way.

Location $ae/$af is used by the load routine (among others) to store to
memory while loading; at the end of a load, this location contains the
end address+1. And, it is surrounded by zeros:

>C:00ab 00 00 00 00 00 00 00 3c 03 00 00 00 00 00 00 00 .......<........

which means that by engineering the end address right we can store two
coordinates here. In this case, I chose the end address+1 to be $10f0, which
stores $f0 and $10 in $ae/$af -- -16 and 16, two of the Pz values, the other
two being zero.

This was my original plan anyways, until I re-discovered that the screen
setup routine altered the value of $ac (the kernal routine to move the screen
up exits with .A=$ac, so incrementing it re-initializes .A in every loop
iteration.) On the other hand, it increments it 16 times, so that after the
load and the init the zp values are

>C:00ab 00 10 00 f0 10 ..

The values 00 10 00 f0 are exactly the Pz values above: 0, 16, 0, -16.
Careful study of default zero-page values then shows

>C:0040 00 00 08 00 00 00 00 24 00 00 00 00 00 00 00 00 .......$........

This is pretty much Py, except for the first value -- so all the code has
to do is store a 16 at location $40, and it turns out the screen init loop
ends with .X=$10. So, one STX does the job, saving yet more bytes.
Moreover, by locating the vertices in zero page, zero-page instructions
can be used to access the point lists, saving even more bytes.

Now, the above are the vertices, but what about the _centers_, i.e. the
object locations? Once again, to save space I just used some default
zero-page values, with the x-coordinates at $14 and the z-coordinates at $7d.
The reason these were chosen is pretty straightforwards. Since objects
just move 'downstream' with increasing z-coordinate, the z-coords should
be somewhat spread out; the values at $7d are:

>C:007d 3a b0 0a c9 20 f0 ef 38 e9 30 38 e9 d0 60 80 4f

which are somewhat spread-out. The x-coords, on the other hand, need to
be clustered around 0. The camera (you) is located at x=0, and if objects
have a large value for the center x-coordinate they are way off to the side
(you've probably noticed a few objects like this). Location $14 has some
zeros and some nonzero numbers that are near zero, i.e. that cluster around
zero:

>C:0014 00 00 19 16 00 0a 76 a3 00 00 00 00 00 00 76 a3

Not great values, but good enough -- beggars can't be choosers.

Of course, poking different values into locations 20 and above will put the
tetrahedrons in different places.


-----------------------
Rotation and Projection
-----------------------

Rotations are done very simply, using tables. As outlined earlier, there
is a complete set of tables of r*sin(theta). At first blush, the 'natural'
way of setting up these tables would be to have a table of 1*sin(theta) in
one page, a table of 2*sin(theta) in the next page, and so on, maybe fitting
a table of cos(theta) in the same page. The location of r*cos(theta) would
then be given by

page = offset+256*r (e.g. $40+r for tables starting at $4000)
value = page+theta

It makes a lot more sense, though, to instead let _theta_ be the page index,
and let each page contain r*sin(theta) for different values of _r_; the lookup
is then

page = offset+256*theta (e.g. $5300 for theta=$13)
value = page+r

For one thing, the pointer setup is very easy -- there's only one value
of theta to set up, instead of different values of r (for different x,y
coordinates). For another, the entire range r=-128..127 can be used,
which means the same tables can be used to rotate object _centers_ in
addition to object _vertices_.

(If you recall programs like lib3d, the object centers are 16-bit signed
values and have their own rotation routine, whereas the object vertices
are limited to -96..95 and have a separate, optimized rotation routine; in
this case, both the vertices and centers are eight-bit signed numbers.)

This all means that rotations (e.g. y*cos(theta) - x*sin(theta)) amount to

LDY Py,X
LDA (cos),Y
LDY Px,X
SEC
SBC (sin),Y

to rotate a point.

Now, the usual 3D world calculations look like:

rotate object centers
rotate vertices, add to rotated center, and project

Consider rotating the z-coordinate; the code will look like

LDA (cos),Y
LDY Px,X
SEC
SBC (sin),Y
CLC
RotM1 ADC CZ

that is: rotate vertex (z*cos(theta) - x*sin(theta)) and add to center
(ADC CZ). Here's the great part: if we instead want to rotate the _center_
of the object, we want to _store_ the rotated point in CZ; the code would
look like

LDA (cos),Y
LDY Px,X
SEC
SBC (sin),Y
STA CZ

This is exactly the same code as above, except the ADC is changed to a STA.
Therefore only one routine is needed, using self-modifying code to decide
whether to STA the center coordinate or ADC it. In the code, the appropriate
instruction is passed into the routine in the .Y register.

There's one other thing to be mentioned about the rotation routine. The
rotation tables are of r*sin(theta), and r*cos(theta) is of course done
by adding pi/2 to the theta-coordinate (theta+16 in the program). There
are two ways to do this -- add 16 to the theta coordinate and correct
any overflow (AND #63, for example), or extend the tables by an extra pi/2.

Byte-wise, the first method is the best, but check this out: instead of
adding pi/2, we can also subtract 3*pi/2. The setup code then becomes:

RotProj
AND #$3F
ORA #>SINTAB
STA sin+1
SBC #$2F ;-3*pi/4
; ADC #$10
; AND #$3F
ORA #>SINTAB
STA cos+1

The trick works here because of the ORA #>SINTAB, which is $4000 -- this
allows negative angles to work, by borrowing higher bits. Consider:

.A ORA #>SINTAB SBC #$2F ORA #>SINTAB
-- ------------ -------- ------------
$12 $52 - sin tab at $5200 $22 $62 - cos tab at $6200
$22 $62 - sin tab at $6200 $32 $72 - cos tab at $7200
$32 $72 - sin tab at $7200 $42 $42 - cos tab at $4200

(C is clear in the SBC above). By using an SBC, angles automatically
wrap around from $8000 to $4000. This saves two bytes over the ADC #$10
AND #$3F method (commented out above). Every byte counts!!!

Projections
-----------

The easiest way to do projections, once again, is via tables of f(x,z)=x/z,
much like the tables of f(r,theta) = r*sin(theta). Alas, the setup code
to generate these tables took up too much space, so the projects are
computed 'manually', using a shift-and-add division routine.

Now, it turns out that the routine really computes 64*x/z -- the 64 is the
magnification factor. The nice thing about a shift-and-add routine is that
multiplying by 64 is possible just by changing the number of shifts. So
the routine adds an extra 6 shifts into the division loop -- a little
inefficient cycle-wise, but pretty thrifty byte-wise.


-------------
Drawing lines
-------------

Obviously, to render an object we need a line drawing routine. The line
routine in the program is pretty standard, and described in several earlier
issues of C=Hacking, so I won't go into too much detail. Most of the
paragraphs below are just random tidbits, so feel free to skip any (there
won't be a test).

The simplest and most efficient routines draw to a charmap -- you can read
about this in detail in C=Hacking #8 and beyond -- since the .Y register maps
directly to the y-coordinate, and the column-offsets are easy to compute.
The biggest charmap available is 128x128, so that's what I used.

Overall, the line routine is pretty small, while being reasonably fast.
The plot routine is 'dumb', and recalculated for every pixel, but the
recalculation itself is fairly efficient, with a few table lookups. It
draws into a 128x128 buffer and can handle off-screen coordinates, and
the buffer can be anywhere (that it, it supports a double-buffer display).
So, I think it turned out OK, all things considered.

With that background, here's the main line drawing loop:

plot
pha
tya
bmi calcline
lda bitphi,x
ora base
sta point+1
lda bitplo,x
sta point
lda bitp,x
beq calcline
ora (point),y
sta (point),y
calcline
pla
mod2 sbc #dx
bcs mod5
mod3 adc #dy
mod4 inx
mod5 iny
dec temp
bne plot


The basic line routine iteration goes like

plot x,y
y=y+1
a = a - dx
if a<0 then a=a+dy, x=x+1

for slope>1. The code for slope<1 is exactly the same, except for
swapping x and y, and dx and dy. Therefore, I just used one routine and
self-modifying code to set INX and INY etc. in the right places (mod2, mod3,
mod4, and mod5 above). In retrospect this was not smart, and I think it
would have saved 5 or 6 bytes to use two routines, and jsr to the plot
routine -- the routines to set up the self-modifying lines take more space
than the basic iteration.

The x and y coordinates are stored directly in the .x and .y registers.
This makes the plotting really easy and relatively fast, and works well
with the 128x128 charmap.

Off-screen points are a must, so the allowed coordinates are -64..192,
giving 64 pixels to either side of the 128x128 visible area. Checking for
these is easy, since the high bit will be set for coords <0 and >127.
These are checked in the plot routine. The y-coordinate is checked directly,
using tya; the x-coordinate is checked for range by use of the bitp table --
this table contains the bit pattern for the x-coordinate (%1000000 %01000000
etc.), and all of the values for x>127 are set to zero. Once again, in
retrospect it would have been more memory-efficient to do this with txa (two
more bytes), but darn it all, SOMETHING has to be at least a little bit cycle
efficient here! Old habits die hard!


The line routine always draws from left to right, which means the
initialization routine calculates dx = x2-x1, and if dx<0 then it swaps the
two endpoints. This creates a problem when coordinates are -64..192 though;
after subtracting x2 and x1 do you check the negative bit, or the carry bit?
Cases like x1=-1, x2=1 should work, so the carry bit is useless; on the other
hand, long lines, say x1=1, x2=130, should also work, so the high bit is also
useless. The solution I used was to offset coordinates by +64 (and once
again, in retrospect...).

The point coordinates are stored in two lists, one of x-coords and one of
y-coords. This means that the endpoints are simply indexes into those
point lists, i.e.:

xlist contains list of x-coords; ylist list of y-coords
xlist,x and ylist,x contain left endpoint
xlist,y and ylist,y contain right endpoint

That simplified a lot of things -- the lists can be in zero-page, swapping
coordinates is very simple, and so on.

The routine is double-buffered -- flickery single-buffers are just too
painful. Since $2000 and $2800 are the locations of the two charset buffers,
the current buffer is stored in zero-page variable $ce ("base" in the
plot code above), which starts out initialized to #$20 (location $81 would
also work) -- one more variable to not initialize.

So, in summary: a pretty standard line routine, with a few memory optimizations
thrown in.


-------
Wrap-up
-------

Well, that's about it. As you can see, the code really runs on the edge --
zero-page has to be initialized correctly, little bits and pieces are missing
from certain routines, things have to line up just right... but it works!
There are undoubtably other places where bytes can be saved, of course,
but I think it does a pretty good job of shaving bytes overall.

See you at next year's contest!

*
* tetrattack!
*
* Mini-3D rendering package
*
* - Motion in the x-z plane => rotations about one axis only
* - Plots to charmap
*
* SLJ 9/01
*

; org $1000
org $0ef0

NUMOBJS = 9

temp = $02
dx = $03
dy = $04
point = $05

AUX = $03
ACC = $04

CX = $07
CZ = $08
rottemp = $09 ;0 initially

zpoint = $0a ;low byte already set to zero
zpoint2 = $0c

count = $0e
angle = $0f

cenx = $14

Py = $40

XCoord = $50
YCoord = $60

theta = $70

cenz = $7d

; $ac taken by screen setup routine
; final value = $10

Pz = $ab


base = $ce ;$20 initially
;alternate: ($81)

p4000 = $8f ;$4000 initially
p6000 = $89 ;$60xx initially
p8000 = $9c ;$8000

tagged = $e0

sin = $f7 ;0 low byte
cos = $f9

curobj = $ff


bitp = $1c00
bitplo = $1d00
bitphi = $1e00
bitphi2 = $1f00
SINTAB = $4000
PROJTAB = $8000


DEY = $88
INY = $c8
INX = $e8
ADC = $65
STA = $85

start
sta $8f ;set up p4000

*
* Set up VIC
*
sta $d021
sta $0286 ;clear color
jsr $e544 ;clr scr
lda #$18
sta $d018


*
* Line drawing tables
*

ldx #$90
ldy #$7f
lda #$01
:l1 sta bitp,y
cmp #$80
rol

pha
tya
lsr
lsr
lsr
lsr
sta bitphi,y
lda #00
sta bitp+128,y ;0 = dont draw
ror
sta bitplo,y

*
* Extend sin table to 0..31
*
lda sin0,y
sta sin0+17,x
;sta sin0+32,x
inx

pla

dey
bpl :l1
;.y=$ff
;.x=$10

*
* Set up sin tables
*
InitRot

iny
sty p6000 ;rats :(

; ldy #00 ;rats :(
;
; sty temp
;:l3 ldx rottemp
; lda sin0,x
; sta aux ;sin(theta)
;
;:l4
;
; clc
; jsr MULT8 ;.Y * AUX

;
; AUX*.Y -> ACC,.A (low,hi) 16-bit result
; AUX, .Y unaffected
; .Y can be negative
;

:Mult8
sty temp

;dumb multiply
; sty acc
; lda #00
; tay
;:mloop clc
; adc sin0-$10,x
; bcc :skip
; iny
;:skip dec acc
; bne :mloop
; tya

sty acc
lda #00
ldy #9
clc
:mloop
ror
ror acc
bcc :mul2
; clc
;dec sin0 table by 1 instead
adc sin0-$10,x
:mul2 dey
bne :mloop

ldy temp
bpl :pos
sec
sbc sin0-$10,x
:pos

sta (p4000),y
eor #$ff
clc
adc #1

STA (p6000),y
iny
bne :mult8
inx
inc p4000+1
inc p6000+1
bpl :mult8 ;to $8000

;.x=$20
;.y=0

*
* Set up Screen
*


; lda #00 ;bah!
; tya
:l2a
; pha
jsr $e8ea ;scroll up
; pla
;.A = ($AC)
;.X=00

clc ;necessary?
:l2 ;set one line
sta $0720+12,x
inc $db20+12,x
inx
adc #16
bcc :l2
; adc #$00
inc $ac

cmp #15
bne :l2a
:done ;.X=$10

stx $40 ;rats :(
;init Py

*
* Main program loop:
* - get input
* - update positions/angles
* - render
* - swap buffers
*

MainLoop

; Clear buffer and swap

lda base
eor #$08
sta base

sta zpoint+1
ldy #00
tya
ldx #8
:l0 sta (zpoint),y
iny
bne :l0
inc zpoint+1
dex
bne :l0

; Get Input
; .X=0

lda $dc00
lsr
lsr
lsr
bcs :c1
inc theta
:c1 lsr
bcs :c2
dec theta
:c2 and #$01
sta tagged
eor #$01
ora $23c0+6
sta $23c0+6 ;shot fired



; lda $cb
; and #3
; beq :skip
; lsr
; bcc :inc
; dec theta
; lsr
; bcc :skip
; jsr Forwards
;:inc inc theta
;:skip

ObjLoop

; Compute relative center

inx
stx curobj

dec cenz,x ;Update pos

lda cenx,x
; sec
; sbc cenx
sta Px+4
lda cenz,x
; sbc cenz
sta Pz+4


; Rotate
Rot

lda theta,x
adc theta
sta angle

lda theta
ldx #4
ldy #STA
jsr RotProj
lda cz
sbc #14
; bmi :skip
cmp #125
bcs NoDraw
cmp cx
bmi NoDraw
adc cx
bmi NoDraw

; If in view, rotate project & plot

dex
stx count
:loop lda angle
ldy #ADC
jsr RotProj
sta YCoord,X
dex
bpl :loop

; Plot

:ploop ldy count
:l2 ldx count
dey
tya
pha
jsr DrawLine
pla
tay
bne :l2
dec count
bne :ploop
;.y=0
NoDraw
ldx curobj

; Update

lda tagged,x
beq :zip
lda cx
ora tagged
sta tagged,x
dfb $2c
:zip inc theta,x


cpx #NUMOBJS
bcc ObjLoop

jsr $eb59 ;Swap buffer
;.Y=1 .A=$7F

jmp MainLoop


*
* Line routine
* - Draws from L to R
* - Forces dx>0 (x1 < x2)
* - Numbers are +128 offset, to allow
* - DX and DY to be 0..255 for long lines
* - Uses same core routine for stepinx
* and stepiny by changing variables
*
* On input:
* .X = index into point list, point 1
* .Y = index of point 2
*

Pswap
sty temp
txa
tay
ldx temp
DrawLine
lda xcoord,y
sec
sbc xcoord,x
bcc Pswap ;x2 >= x1
sta dx
lda ycoord,y
sbc ycoord,x ;dy = y2-y1
bcs :posy
ldy #DEY
eor #$ff
adc #1
dfb $2c
:posy ldy #INY
sta dy
cmp dx
lda #INX
; bcs stepiny
bcs :noswap
tya
ldy #INX
:noswap sty mod5
sta mod4
stepinx
; sta mod5
; sty mod4 ;INY/DEY
lda dy
ldy dx
bcc mod
stepiny
; sta mod4
; sty mod5 ;iny/dey
lda dx
ldy dy
mod
; sty mod1+1
sty temp
sty mod3+1
sta mod2+1

Draw
lda ycoord,x
sec
sbc #64
tay
lda xcoord,x
sec
sbc #64
tax
;mod1 lda #dy
lda temp
beq done ;no steps!
; sta temp
lsr
plot
pha
tya
bmi calcline
lda bitphi,x
ora base
sta point+1
lda bitplo,x
sta point
lda bitp,x
beq calcline
ora (point),y
sta (point),y
calcline
pla
mod2 sbc #dx
bcs mod5
mod3 adc #dy
mod4 inx
mod5 iny
dec temp
bne plot
done
rts

 

*
* Rotate and project point
*
* On entry:
* .X = index into point list
* .A = rotation angle
* .Y = ADC/STA for points/centers
* CX,CZ = location of object
*
* On exit:
* Points stored in xcoord,x ycoord,x
*
RotProj
AND #$3F
ORA #>SINTAB
STA sin+1
SBC #$2F ;-3*pi/4
; ADC #$10
; AND #$3F
ORA #>SINTAB
STA cos+1
STY RotM1
STY RotM2
RotProj2
LDY Pz,X
LDA (sin),Y
PHA
LDA (cos),Y
LDY Px,X
SEC
SBC (sin),Y
CLC
RotM1 ADC CZ
STA aux

; LDY Px,X
; LDA (cos),Y
; LDY Pz,X
; CLC
; ADC (sin),Y
PLA
CLC
ADC (cos),Y
CLC
RotM2 ADC CX
JSR Div8
STA XCoord,X

LDA Py,X
; JSR Div8
; STA YCoord,X
; RTS

*
* Signed division routine
*
* 64*.A/AUX -> .A
* Only valid for .Y/AUX < 1
* .A pos or neg, AUX > 0
* AUX, .X unaffected
*

DIV8
; STY ACC
; tya
php
bpl :pos ;.A just loaded
eor #$ff
:pos sta acc
LDA #0
LDY #14
:DLOOP ASL ACC

ROL
CMP AUX
BCC :DIV2
SBC AUX
INC ACC
:DIV2
DEY
BNE :DLOOP
;RTS

lda acc
plp
bpl :pos2
eor #$ff
:pos2
eor #$80 ;+128 offset
rts

*
* Point list
* Last point is for relative cx,cz
*

;Pz dfb 0,16,0,-16,0
;Py dfb 16,0,8,0
Px dfb 0,-4,16,-4

;note: px+4 intrudes on table below

*
* 128*sin(t), t=0..15
*
* Table must be at end of code!
*

;sin0 dfb 0,13,25,37,49,60,71,81
; dfb 91,99,106,113
; dfb 118,122,126,127,128

;sin0 dfb 0,25,50,74,98,120,142,162
; dfb 180,197,212,225,236,244,250,254,255
sin0 dfb 0,24,49,73,97,119,141,161
dfb 179,196,211,224,235,243,249,253,255


end

.......
....
..
. - fin -

← previous
loading
sending ...
New to Neperos ? Sign Up for free
download Neperos from Google Play

Let's discover also

Recent Articles

Recent Comments

Neperos cookies
This website uses cookies to store your preferences and improve the service. Cookies authorization will allow me and / or my partners to process personal data such as browsing behaviour.

By pressing OK you agree to the Terms of Service and acknowledge the Privacy Policy

By pressing REJECT you will be able to continue to use Neperos (like read articles or write comments) but some important cookies will not be set. This may affect certain features and functions of the platform.
OK
REJECT