raphnet.net banner

mobile8088: An elementary emulator for porting RATillery to Android

twitter@raphnetlabs
Contents

Introduction

This is an ongoing project. Updates coming soon!
Writing my own emulator is something I had been thinking of doing for a long time. However I lacked a precise goal and a reason for not using what was already available. That is, until I decided to port a PC game to Android (an old OUYA console actually). The game in question: RATillery, a simple game written in 8086/8088 assembly.

I want the Android version of the game to use the assembly code from the original DOS game, because I think it is unusual and fascinating to do it that way. And I do need an emulator to make this work.

An obvious choice would be DOSBox of course, but I discarded this option for a few reasons[1]:

[1] ok, ok, excuses, I know. I just really want to write an emulator.

goto top


Part 1: Android and Java

Java seems to be the easiest language to start developing on Android, thanks to the endless examples and huge collection of questions with answers that are easily found on the web, especially on Stack Overflow. As I'm new to Android development, I don't really want to try new languages or a combination of languages. So even if I feel this is a very bad choice for an emulator, I will attempt to do everything in Java.

I never really liked Java much and I spent the last 18 years laughing at this language that sacrifices performance to protect the programmer against itself too much for my taste. I'll just stop there... I can't wait to see if it will be possible to emulate an old 16 bit Intel CPU running at 4.77MHz, or will even that be too slow? If so, I wonder what kind of optimisations I'll have to use. Sounds interesting.

I followed a few tutorials and guides to familiarise myself with Android, discover what's new in Java since the last time I seriously looked at the language (For instance: Generics, For-Each loops, Varargs...), and learn how one uses an IDE such as Android Studio. (I normally use vim and write my own Makefiles.. But the adaptation was easier than I expected)

When I had made enough test applications with classic "Hello World!" buttons, I dove into coding the emulator.

goto top


Part 2: General architecture

I created a class for the CPU which I named Cpu8088, as well as a few support interfaces: One for memory (Memory.class), one for system calls using the INT instruction (Int.class) and one for IO ports (IOPort.class).

Creating a CPU instance is simple:
    mCPU = new Cpu8088();

Memory ranges

This won't do anything useful without memory! The Cpu8088 object needs at least one Memory object to work with. Here is the Memory interface:
public interface Memory {
    boolean isAddressInRange(int address);
    void write8(int address, int b);
    void write16(int address, int w);
    int read8(int address);
    int read16(int address);
}
A class named MemoryBlock implements this and uses a byte array for storage. The code below shows how I instantiate the memory blocks required for RATillery.
	MemoryBlock stack = new MemoryBlock(0x400, MemoryBlock.SIZE_64K);
	MemoryBlock program = new MemoryBlock(0x10000, MemoryBlock.SIZE_64K * 4);
	cga_vram = new MemoryBlock(Cpu8088.getLinearAddress(0xB800, 0), MemoryBlock.SIZE_64K/2);
	MemoryBlock ivt = new MemoryBlock(0, 0x400);

	mCPU.addMemoryRange(program);
	mCPU.addMemoryRange(cga_vram);
	mCPU.addMemoryRange(stack);
	mCPU.addMemoryRange(ivt);

	/* Load the game executable (gametga.com) in the 'program' MemoryBlock at offset 0x100.
	 * (The .COM executable file format use an origin of 0x100). */
	program.load(this.getResources().openRawResource(R.raw.gametga), 0x100);
The memory blocks above:

BIOS Services

My game calls a few BIOS services using the INT instruction. Cpu8088 accepts objects implementing the Int class.

The Int interface has two methods: One to declare the interrupt number it implements, and another one to actually handle a request.
public interface Int {
	public void handle(Cpu8088 cpu);
	public int getNumber();
}
On a real PC, the INT instruction causes the program execution to jump into the BIOS. But with this emulator, it is Java code that is called. The Java code looks at the CPU registers or memory, does what it is supposed to do, updates memory or registers with the result and returns control to the program.
(note: The concept is not unusual for emulators)

Generally, when a service is called using the INT instruction, the sub-service is selected using the AH register. This emulator does not need to implement all possible sub-services, and it is not even required to provide a perfect or a real implementation. For instance, Int10 AH=0x0, used to select the video mode, simply returns the requested mode without any error checking. (I know which mode RATillery will set, I just need to make sure the game stays happy and does not quit thinking setting the video mode failed).

Here is the code for Int 10h, video services:
public class Int10 implements Int, Constants {
    private int current_mode = 0x03; // 80x25 color

    @Override
    public void handle(Cpu8088 cpu) {
        switch(cpu.AX >> 8)
        {
            case 0x0: // Set video mode
                current_mode = cpu.AX & 0xff;
                Log.d(LOG, "int10: Set video mode to 0x" + Integer.toHexString(current_mode));
				// return video mode in AL. (already in AL)
                break;
            case 0xF: // Get Video State
                Log.d(LOG, "int10: Get video state");
                // AH : number of screen columns
                // AL : mode currently set
                // BH : current display page
                cpu.setAH(80);
                cpu.setAL(current_mode);
                cpu.setBH(0);
                break;
            case 0xB: // Set palette
                Log.d(LOG, "int10: Set palette (ignored)");
                break;
            default:
                cpu.halt("Int10: subservice 0x" + Integer.toHexString(cpu.AX >> 8));
        }
    }

    @Override
    public int getNumber() {
        return 0x10;
    }
RATillery only requires the following services: And therefore:
	mCPU.addIntHandler(new Int1a()); // Time
	mCPU.addIntHandler(new Int10()); // Video
	keyboardInt = new Int16();
	mCPU.addIntHandler(keyboardInt);
(Int16 is kept in a variable because it will have a method to inject key presses)

IO Ports

The game also touches a few IO ports (using the IN and OUT instructions). So the Cpu8088 class also knows about objects implementing the IOPort interface:
public interface IOPort {
	public boolean isPortHandled(int port);
	public int in8(int port);
	public int in16(int port);
	public void out8(int port, int value);
	public void out16(int port, int value);
}
RATillery uses the following ports: And as a result:
	vcard_ports = new VideoCardPorts();
	mCPU.addIOPort(vcard_ports);
	mCPU.addIOPort(new PITPorts());

goto top


Part 3: Implementing the CPU

As one would expect, Cpu8088 has member variables for the 8088 registers, constants for the flags and lists for the Memory, Int and IOPort objects that are registered.

Cpu8088 also has a void runOneInstruction() method which reads and executes a single instruction each time it is called. It is mostly a very long switch/case. Here are a few lines of it:
switch (opcode)
{
	case 0x50: push16(AX); break; // PUSH AX
	case 0x58: AX = pop16(); break; // POP AX
	case 0x40: AX = alu.add(AX, 1, ALU.WORDMODE); break; // INC AX
	case 0x34: // XOR AL,IMMED8
		immed8 = getInstructionByte();
		 setAL(alu.xor(AX & 0xff, immed8, ALU.BYTEMODE));
		 break;
	case 0xAC: // LODSB
		setAL(read8(getLinearAddress(DS_CUR, SI)));
		if ((FLAGS & DF) == DF) {
			SI = (SI-1) & 0xffff;
		} else {
			SI = (SI+1) & 0xffff;
		}
		break;
	...
}

Well, this may take time. Let's only implement what we need... - Fail!

With my "8086/8088 User's Manual, Programmer's and Hardware Reference" on hand, I began at opcode 0x00 and so on, in order. But very soon, I changed my mind. I thought, well, I only use common instructions when I write assembly, I probably just need to implement a small subset. That would be faster than implementing the complete list.

I made it so that the emulator stops when it encounters an unimplemented instruction, displaying its number so I know which to implement.

Past that point, here are some of my thoughts, roughly in the order they occurred:

- Great!! I can see CALLs and RETs are working!

- Still, I encounter unimplemented instructions a lot...

- Oh, how will I implement the REP prefix? Let's try recursion.

- Well now, there really are a lot of unimplemented instructions...

- Oh my, it's nice that many instructions can operate on memory or on registers, but it's more work for me now...

- and it actually gets worse. There may be 8 or 16 bit displacement, and it is used...

- The 0x80,0x81,0x83 instructions are hell! The following byte selects from 8 other operations.

When I finally understood what I was into, it was too late to stop. I had written too many lines. Stopping now would be a shame. And it was far from finished.

So I did not give up and eventually the emulator stopped stumbling on unimplemented instructions. But RATillery was stuck in a loop that was polling IO port 0x3DA, waiting for a specific bit to change. This was the vertical retrace bit. I quickly modified my VideoCardPorts class to toggle the bit each time the port would be read. This is wrong, but at that point it did not matter. And thanks to this, execution did get past where it was stuck at before. But only to meet more unimplemented instructions...

A First Emulation Bug

There are times when we programmers are confronted with a program that misbehaves in ways that should not be possible, at least until we solve the mystery. In instances where the bug keeps eluding us, we may start to construct questionable theories about the problem. This is what I call "derailing". (In fact I use French: « Dérailler »). I have seen other programmers as well as myself "derail" on some occasions. In such cases, common scapegoats for the "unsolvable" bug are the CPU, the memory... in other words, the hardware! However it is extrrreeeemely rare that this really is the case. More often than not, it really is the software.

I find it amusing that this time, for once, I have all the rights to suspect the hardware. The "hardware" here is.... software! And knowing that RATillery works fine on all the computers I've tried it on, and in emulators such as DOSBox, it is evident where to look first.

My first problem with the emulator "Hardware" not working properly:

- The program is stuck forever in a loop. I look at the RATillery listing to understand where... I trace the execution instruction by instruction.... Ok, I see. Some conditional jumps do not work correctly because I got the flags wrong for some operations. Let's review everything.

A few unimplemented instructions later, the title screen code is finally reached, and it is looping peacefully, waiting for keyboard input. I can finally take a break from the CPU implementation. The byte array backing the VRAM MemoryBlock object must contain an image at this point. I want to see it!

goto top


Part 4: A First Image

I created a Thread that simply calls Cpu8088.runOneInstruction() in a loop. The 8088 emulation will just run as fast as it can for now. (I'll revisit later).

On the user interface side, I simply dropped an ImageView into an empty Activity. In the onCreate method, I retrieve the ImageView and create a Bitmap matching the size of the game screen (320x200). I also build a lookup table for the 16 colors that are used.
	private ImageView mScreen;
	private Bitmap mScreenBitmap;
	private final int clut[] = new int[16];
	...
	@Override
	protected void onCreate(Bundle savedInstanceState) {
		...
		mScreen = findViewById(R.id.screen);

		mScreenBitmap = Bitmap.createBitmap(320, 200, Bitmap.Config.ARGB_8888);
		mScreenBitmap.eraseColor(Color.GREEN);
		mScreen.setImageBitmap(mScreenBitmap);
		...
		clut[0] = Color.rgb(0,0,0);
        clut[1] = Color.rgb(0x00, 0x00, 0xAA);
        clut[2] = Color.rgb(0x00, 0xAA, 0x00);
        clut[3] = Color.rgb(0x00, 0xAA, 0xAA);
        clut[4] = Color.rgb(0xAA, 0, 0);
        clut[5] = Color.rgb(0xAA,0x00,0xAA);
        clut[6] = Color.rgb(0xAA, 0x55, 0);
        clut[7] = Color.rgb(0xAA,0xAA,0xAA);
        clut[8] = Color.rgb(0x55, 0x55, 0x55);
        clut[9] = Color.rgb(0x55, 0x55, 0xff);
        clut[10] = Color.rgb(0x55, 0xff, 0xff);
        clut[11] = Color.rgb(0x55, 0xff, 0xff);
        clut[12] = Color.rgb(0xff,0x55,0x55);
        clut[13] = Color.rgb(0xff,0x55,0xff);
        clut[14] = Color.rgb(0xff, 0xff, 0x55);
        clut[15] = Color.rgb(0xff,0xff,0xff);

	}
To update the Bitmap (and ultimately, the ImageView), I use handler.postDelayed() to repeatedly execute my syncVram() method which copies the pixels from the vram MemoryBlock to the Bitmap, converting the colors using the lookup table above. The Bitmap.setPixels() method is used to write to the Bitmap, and the ImageView is updated by passing this bitmap to an ImageView.setImageBitmap() call.

I have no idea at the moment if this is a bad idea and if there are more efficient ways to do this, but it works. I got this image:

The first image

The first image



Hurrah! How thrilling it is for me to finally see something! Even though the image is wrong, this is an important and very encouraging milestone.

goto top


Part 5: Tandy 16 color video to Bitmap

The 320x200x16 video mode (16-Color Medium-Resolution)

For the moment, my goal is to support the Tandy version of RATillery. The video mode used packs two pixels per byte (4 bit per pixel):

First pixelSecond pixel
7654 3210
IRGB IRGB

Each line on the screen uses 160 bytes. In memory, however, the lines are not in consecutive order.

There are 4 groups of lines:
LinesAdresse
0, 4, 8 ... 1960x0000 - 0x1F3F
1, 5, 9 ... 1970x2000 - 0x3F3F
2, 6, 10 ... 1980x4000 - 0x5F3F
3, 7, 11 ... 1990x6000 - 0x7F3F

A first attempt

I first wrote something similar to the code below to copy and convert the pixels from the vram byte array to the Bitmap:
	byte[] tandy16 = ratillery.getVRAM(); // Returns the vram byte array

	for (y=0; y < 200; y++) {
		for (x=0; x < 320; x+=2) {
			pixel1 = tandy16[y * 160+ x/2] >> 4;
			pixel2 = tandy16[y * 160 * x/2 + 1] & 0xf;
			mScreenBitmap.putPixel( x, y, clut[pixel1] );
			mScreenBitmap.putPixel( x+1, y, clut[pixel2] );
		}
	}
(Yes, calling putPixel in a loop like this is a great performance-killing tactic. To be avoided!)

Attentive readers may notice that the code above completely ignores what I just wrote about there being 4 groups of lines. And this is why the first screenshot looks like 4 squashed copies of the screen distributed vertically. I don't know what I was thinking, overlooking something like that!

The Correct Way

I reworked the code to do things line-by-line where a int[] array is filled and then passed to Bitmap.putPixels(). This is probably faster, but I plan to time execution and try a few things later. It may be possible to do better... For instance, is it useful in Java to move calculations with a fixed result (such as y*160) outside the loop? Is there something to gain by replacing multiplications and divisions by shifts where possible? Does the compiler understand that accesses to tandy16[] will never be out of bound so no need to check before every single access? I look forward to finding out answers to those questions.

For the time being, here's what the reworked code looks like:
	private int rowbuf[] = new int[320];
...
	private void doScanline(byte tandy16[], int src_offset, int screen_y)
	{
		int x;
		for (x=0; x<320; x+=2) {
			int pixelpair = tandy16[src_offset+(x/2)] &0xff;

			rowbuf[x] = clut[pixelpair >> 4];
			rowbuf[x+1] = clut[pixelpair & 0xf];
		}
		mScreenBitmap.setPixels(rowbuf, 0, 320, 0, screen_y, 320, 1);
	}

	private void syncVram() {
		int y;
		MemoryBlock vram = ratillery.getVRAM();
		final byte[] data = vram.getRawBytes();

		for (y=0; y<50; y++) {
			doScanline(data, 0 + y * 160, y * 4);
		}
		for (y=0; y<50; y++) {
			doScanline(data, 0x2000 + y * 160, y*4 + 1);
		}
		for (y=0; y<50; y++) {
			doScanline(data, 0x4000 + y * 160, y*4 + 2);
		}
		for (y=0; y<50; y++) {
			doScanline(data, 0x6000 + y * 160, y*4 + 3);
		}
	}
This new code gave a better result. Here are the first and second screenshots to compare:

First screenshot

First screenshot

Second screenshot

Second screenshot



Slowly getting there. The menu test is in the right spot, albeit not legible. Strange... But the problem was in the CPU implementation.

goto top


Part 6: Implementation errors

Let's use a simple test program

Using RATillery to troubleshoot was complicating things and muddying the waters. So I switched for a simple test executable which displays a few color frames and a line of text. Nothing else.

This test program, when running on a non-buggy emulator, displays this screen:
The simple test running in DOSBox

The simple test running in DOSBox



But in my emulator, all I got was a black screen. I found the reason in the logs:

That's it. Unimplemented instructions are back...


This time, it was the flavor of XOR that accepts a 8-bit immediate value...



After adding a few other instructions, the test screen comes up, but incorrectly. After a few minor corrections, the text was almost OK.
Simple test (incorrect result)

Simple test (incorrect result)



Anomaly 1: Extra lines in the text!

No, this is NOT Standard Galactic Alphabet:



The test program uses the same 8x8 font RATillery does. So why can I see 10 lines?



It took me a very long time to find the cause (a very simple and basic thing of course) and only a few seconds to correct it.

What an idiot! The LOOP instruction decrements CX before checking if it is zero! And the manual was quite clear on this...

So yes, some loops were doing extra laps...

Anomaly 2: Missing pixels in the frame

I also spent a good amount of time trying to understand why a few pixels were missing in action in the upper left corner:



The problem was my STOSB implementation. Can you spot my mistake?


STOSB wires AL to ES:DI, then increments or decrements DI. Why am I touching SI exactly? I blame lack of sleep and the cut-and-paste functions.

Test OK



Back to RATillery:


Oh no! I was hoping everything would be fine now...

...too bad. The background is stored in a compressed format, so it must be that the decompressor is not happy about something. But at least, the text is working fine now, and the fire animation is running.

Where is my error now? I had to dig into the decompression code for clues. A few hours later:

Ah ha! My implementation of the REP prefix is incorrect when CX = 0

When CX=0, the instruction that follows (for instance, MOVSB) was still executed 1 time!
After correcting this:



Victory! Another important milestone: The game correctly displays the title screen!


Next step: Add a way to simulate key presses to start the game.

goto top


Part 7: Keyboard support

At last, instead of passively observing glitches and encountering unimplemented instructions, the time has come to interact with the game, and then observe new glitches and more unimplemented instructions!

The game uses the BIOS keyboard services provided by the Int16 class. I've implemented single keystroke event queue. (to be revisited later):
public class Int16 implements Int {
	// TODO : Need a FIFO
	private boolean hasKeystroke = false;
	private int scancode;

	public void injectKeyStroke(int scancode) {
		this.scancode = scancode;
		hasKeystroke = true;
	}

	@Override
	public void handle(Cpu8088 cpu) {
		switch(cpu.AX >> 8)
		{
			case 1: // Get keyboard status
				if (hasKeystroke) {
					cpu.AX = scancode;
					cpu.FLAGS &= ~cpu.ZF;
				} else {
					cpu.AX = 0; // scancode (none)
					cpu.FLAGS |= cpu.ZF;
				}
				break;
			case 0: // Note: This is incorrect as it does not block.
				cpu.AX = scancode;
				hasKeystroke = false;
				break;
			default:
				Log.e(LOG, String.format("int16: service AH=0x%02x not implemented", cpu.AX >> 8));
		}
	}

	...
}

A temporary UI

For the end result, I have something better in mind than merely placing lots of buttons around the emulated screen. But this will come later. For now, I just need a way to continue testing quickly.

So I first added a bunch of buttons to the UI to quickly try the game. Each button simply ends up calling the injectKeyStroke() method with the ASCII character corresponding to the button. Scancodes are normally 16 bit values where the lower part is the ASCII character and the upper part corresponds to the physical ID of the key on the keyboard. I know RATillery only looks at the lower 8 bits, so this simple approach works.



HID Keyboard support

Later I wanted to open the story and credits screen, but this requires the S and C keys. Instead of adding buttons, I decided to connect a USB Keyboard to the OUYA console I'm using for development. Transferring keystrokes to the game was a simple matter of overriding the onKeyUp method in the Activity:
	@Override
	public boolean onKeyUp(int keyCode, KeyEvent event) {
		int unichar = event.getUnicodeChar();
		if (unichar >= 32 && unichar <= 128) {
			ratillery.pressAsciiKey((char)event.getUnicodeChar());
		}
		// extra code for ESC, ENTER and BACKSPACE keys not shown
		return true;
	}

goto top


Part 8: In-game bugs

So I started a game, and it mostly worked! There were a few things I had to fix though.

1 - It's always player 1's turn.

This subroutine makes the value of [player_turn] alternate between 0 and 1:
togglePlayerTurn:
	push ax
	mov al, [player_turn]
	not al
	and al, 1
	mov [player_turn], al
	pop ax
	ret
Note: Yes... I was an x86 assembly beginner when I wrote RATillery, and apparently I was still thinking like I was programming an AVR micro-controller where one must always load values in registers first. Today I would simply write NOT byte [player_turn], possibly in a macro...

The code above is known to work, despite it's superfluous complexity. So why wouldn't turns alternate? The NOT instruction was not implemented! This had not been caught due to a missing break in a switch/case.

2 - Cannot quit game



Pressing ESCAPE, and then confirming that I really want to quit the game by pressing Y did not work. The game would just go on as if I'd said no. What was wrong? I instantly suspected flags/alu incorrectness.

In RATillery, there is a subroutine called askYesNoQuestion. It returns 0 in CX if no was the answer, 1 in CX if yes was the answer. After calling it to confirm quitting, the game adds 0xffff to CX in order to set the carry if the value returned in CX is non-zero. (the carry flag is then checked later, after CX has been reused)

By adding a breakpoint right after this addition, I could see that CX = 0. That's correct since 1 + 0xffff overflows to 0 in a 16-bit register. But only the zero flag was set! Why not the carry flag?

This addition was done using:
AND CX, 0xFFFF
This is assembled to 83 C1 FF. Opcode 83 uses a 8-bit immediate that is sign extended before being used on a 16-bit memory location or register. In other words, 0xFF becomes 0xFFFF (or -1).

Java does not have unsigned integers, and I decided to use ints (32-bit) everywhere instead of unsigned shorts, only using the lower 16 bits. It turns out that the way I did the sign extension gave a 32 bit result. So the integer holding the sign extended immediate contained -1 (0xffffffff) instead of 0xffff.

This potentially does not matter, as it does not change the 16 lower bits and adding 1 and 0xffff gives the same result as adding 1 and -1, if you only look at the 16 lower bits. But it breaks my code that decides if the carry flag must be set:
int add16(int a, int b) {
	int result = a+b;

	if (result > 0xffff) {
		// set CF flag here
	} else {
		// clear CF flag here
	}

	// Code for other flags (OF, SF, ZF) not shown

	return result & 0xffff;
}
I have a lot of code assuming that the upper 16 bits are always 0, so I took the easy route and simply AND the result of the sign extension with 0xffff. This keeps the value positive and fixed the problem.

3 - The credits screen does not display completely



It stops right in the middle of writing my first name! As I thought, the French ë letter was the culprit. And this was once again a case of unimplemented instruction (this time, it was SUB REG8/MEM8, IMMED8).

goto top


Part 9: Optimisations for faster emulation

I could tell by looking at transition effects between screens that the CPU emulation was running a bit slower than it should, so I decided to have a go at optimising stuff.

The emulator thread runs the CPU in blocks of 250 instructions. It monitors how much time has elapsed using System.nanoTime. Each time 15.5ms has elapsed, it calls a function that changes what io port 3DA reads (that's the register indicating, among other things, vertical refresh.) and run the CPU for another 250 instructions. Then the value of 3DA returns to "not in retrace". So the game logic and animation should be running more or less at 60fps (1s / 60 = 16.66667ms).

I decided to use the title screen as a benchmark, since there are a few things going on: There is screen drawing for the 32x32 fire animation, vertical retrace is polled (through a IO port) to keep speed constant, and BIOS services are called to detect keystrokes.

In order to measure performance, I have code to count how many instructions are executed in a slice of 1 second. A 15 second average is used to stabilize the reading of instructions per second (IPS) displayed on-screen.

All right, initial IPS reading of 280000. Let's see what I can do.

I started by changing the Debuggable option to false in the build type settings. IPS increased to 278000. A very small difference, I was hoping for more. Not worth it.

Then I thought: the title screen code is polling port 3DA a lot. The IOPort handlers are stored in an ArrayList. Each time a port must be accessed, the code has to iterate over the list, calling each IOPort object's isPortHandled(int) method until it finds the correct IOPort object. This is a lot of method calls. There are in theory 65536 ports. That's not too much to use a lookup table nowadays in this land of supercomputers. So I did that.

LUT for IOPorts: IPS of 317000. Nice, 13% faster. Now what else?

I asked myself, what else is being done repeatedly? Memory accesses, of course! And similarly, an ArrayList of Memory objects was used to find an instance handling the source or destination address. Why not also use a lookup table? The 8088 has a 20 bit bus, which means it can only address up to 1 Megabyte. The LUT would need 1 million entries... Assuming Java object references are 64 bit pointers, this table will occupy 8 megabytes in memory. So I accepted the tradeoff and implemented that idea too. Here is what it looks like:
Memory memoryZones[] = new memoryZones[0x100000];
Index 0 is for address 0, index 1 is for address 1, and so on. The Cpu8088.addMemoryRange(Memory m) method simply copies the Memory object reference it was passed to all applicable positions in the array. Finding the correct Memory object to operate at a given memory address is a simple matter of doing memoryZones[address].

LUT for Memory: IPS of 1384000. Excellent, a very nice 400% improvement right there!

Was there anything else that could provide a significant speedup? Yes! In many places, the correct Memory object for performing an access is found by first calling a function (either getMemForAddress(int seg, int offset) or getMemForAddress(int linear_address). Those used to iterate over an ArrayList of Memory objects, calling isAddressInRange on each object until the one handling the particular address was found. But now that there is a lookup table, all those do is return memoryZones[address];

Cpu8088 has methods for reading or writing memory, like this:
	private Memory getMemForAddress(int address) {
		return memoryZones[address];
	}

	public int read16(int address) {
		return getMemForAddress(address).read16(address);
	}
It seemed to me that the trivial method getMemForAddress could and should be inlined. However, not knowing if I could trust the Java compiler or JVM to do this, I did it manually. So I did away with the getMemForAddress() method and rewrote all memory io methods to directly use the lookup table. For instance, the rewritten version of the code above is:
	public int read16(int address) {
		return memoryZones[address].read16(address);
	}
(There is read8, read16, write8 and write16, in linear address and segment + offset flavours)

And what do you know! IPS went up to 2003000 by inlining methods!

Ok, so more than 7 times faster so far. But why stop here? More inlining would be possible, but I decided to keep read8/16 and write8/16 and not inline them because they are useful and used in many places. But I can still inline them in places where I think it matters.

Cpu8088 has a few methods that are used to read from the instruction stream. In other words, read from CS:IP, then increment IP:
Those are continuously used as Cpu8088 reads an opcode (getInstructionByte), then depending on the opcode, use one or more of the above depending on what additional information is present (source/destination register or memory addresses, immediate values, displacements...). Here is one example implementation:
public int getInstructionWord() {
	int b = read16(CS, IP);
	IP = (IP + 2) & 0xFFFF;
	return b;
}
is rewritten as:
public int getInstructionWord() {
	int b = memoryZones[(CS<<4)+IP].read16(CX, IP);
	IP = (IP + 2) & 0xFFFF;
	return b;
}

Not bad, IPS is now 2227600

I am not willing to inline the getInstructionX() methods, it becomes too messy. At some point, maintainability and readability starts to matter. But I decided to inline at least the getInstructionByte() call that was used by the huge opcode switch/case because I know it's called a lot (over 2 million times per second right now).

The effect: Now running at 2599800 IPS

If you read this far, I'm sure you get the idea by now. I continued inlining a few things where I judged it an acceptable tradeoff. Then a bit later...

Final result: 2700000 instructions per second

This is 9.64 times faster than in the beginning. I decided to stop here, for now at least, since I'd like to work on other aspects of the project.


Ideas for the future?

There are other optimisations I know I could try. One change that would likely provide a huge performance boost would be to use a single byte array for all memory. In other words, do away with having multiple Memory objects for different zones, reducing indirection by not needing to access it through a Memory instance. (And using less memory, since the huge 8 Megabyte lookup table won't be required anymore).

Ok, as I wrote this I couldn't resist and did it. Now at 3545000 IPS! (More than 12 times the original speed)


Another interesting trick with good potential would be to implement repeated string instruction (for instance, REP STOSB) in Java directly. This means looking at what instruction is being repeated, and doing the equivalent loop in Java (for REP STOSB/STOSW/LODSB/LODSW), or even calling System.arraycopy() in the case of REP MOVSB/MOVSW (using a single byte array for memory as suggested above would make this easy).

Ok, I'll really keep this one for later, if graphics are too slow.

goto top


To be continued

This is an ongoing project. Updates coming soon!

goto top


Any trademarks used on this site are the property of their respective owners.
Copyright © 2002-2018, Raphaël Assénat
Website coded withWebsite coded with vimLast update: August 15, 2018 (Wednesday)