Thursday, December 30, 2010

common Orc opcodes

I've been going through liboil's 0.3 source to rewrite the oil_yuv2rgbx_sub2_u8 function we use for Theora decoding to Orc pseudo-assembly code.

Because the Orc opcode documentation splits opcode description and processor support between two tables, for reference I wrote a quick Python script to build a table of Orc opcodes common to SSE (x86), Altivec (PPC/Cell), and NEON (Arm Cortex) processors.

Here's that table for reference, at least until I put the time to format it for a wiki:

opcodedstsrc1src2descriptionpseudo code
absb1 1 absolute value (a < 0) ? -a : a
addb1 1 1 add a + b
addssb1 1 1 add with signed saturate clamp(a + b)
addusb1 1 1 add with unsigned saturate clamp(a + b)
andb1 1 1 bitwise AND a & b
andnb1 1 1 bitwise AND NOT a & (~b)
avgsb1 1 1 signed average (a + b + 1)>>1
avgub1 1 1 unsigned average (a + b + 1)>>1
cmpeqb1 1 1 compare equal (a == b) ? (~0) : 0
cmpgtsb1 1 1 compare greater than (a > b) ? (~0) : 0
copyb1 1 copy a
loadb1 1 load from memory array[i]
loadpb1 1 load parameter or constant scalar
maxsb1 1 1 signed maximum (a > b) ? a : b
maxub1 1 1 unsigned maximum (a > b) ? a : b
minsb1 1 1 signed minimum (a < b) ? a : b
minub1 1 1 unsigned minimum (a < b) ? a : b
mullb1 1 1 low bits of multiply a * b
mulhsb1 1 1 high bits of signed multiply (a * b) >> 8
mulhub1 1 1 high bits of unsigned multiply (a * b) >> 8
orb1 1 1 bitwise or a | b
shlb1 1 1S shift left a << b
shrsb1 1 1S signed shift right a >> b
shrub1 1 1S unsigned shift right a >> b
signb1 1 sign sign(a)
storeb1 1 store to memory special
subb1 1 1 subtract a - b
subssb1 1 1 subtract with signed saturate clamp(a - b)
subusb1 1 1 subtract with unsigned saturate clamp(a - b)
xorb1 1 1 bitwise XOR a ^ b
absw2 2 absolute value (a < 0) ? -a : a
addw2 2 2 add a + b
addssw2 2 2 add with signed saturate clamp(a + b)
addusw2 2 2 add with unsigned saturate clamp(a + b)
andw2 2 2 bitwise AND a & b
andnw2 2 2 bitwise AND NOT a & (~b)
avgsw2 2 2 signed average (a + b + 1)>>1
avguw2 2 2 unsigned average (a + b + 1)>>1
cmpeqw2 2 2 compare equal (a == b) ? (~0) : 0
cmpgtsw2 2 2 compare greater than (a > b) ? (~0) : 0
copyw2 2 copy a
div255w2 2 divide by 255 a/255
loadw2 2 load from memory array[i]
loadpw2 2 load parameter or constant scalar
maxsw2 2 2 signed maximum (a > b) ? a : b
maxuw2 2 2 unsigned maximum (a > b) ? a : b
minsw2 2 2 signed minimum (a < b) ? a : b
minuw2 2 2 unsigned minimum (a < b) ? a : b
mullw2 2 2 low bits of multiply a * b
mulhsw2 2 2 high bits of signed multiply (a * b) >> 8
mulhuw2 2 2 high bits of unsigned multiply (a * b) >> 8
orw2 2 2 bitwise or a | b
shlw2 2 2S shift left a << b
shrsw2 2 2S signed shift right a >> b
shruw2 2 2S unsigned shift right a >> b
signw2 2 sign sign(a)
storew2 2 store to memory special
subw2 2 2 subtract a - b
subssw2 2 2 subtract with signed saturate clamp(a - b)
subusw2 2 2 subtract with unsigned saturate clamp(a - b)
xorw2 2 2 bitwise XOR a ^ b
absl4 4 absolute value (a < 0) ? -a : a
addl4 4 4 add a + b
addssl4 4 4 add with signed saturate clamp(a + b)
addusl4 4 4 add with unsigned saturate clamp(a + b)
andl4 4 4 bitwise AND a & b
andnl4 4 4 bitwise AND NOT a & (~b)
avgsl4 4 4 signed average (a + b + 1)>>1
avgul4 4 4 unsigned average (a + b + 1)>>1
cmpeql4 4 4 compare equal (a == b) ? (~0) : 0
cmpgtsl4 4 4 compare greater than (a > b) ? (~0) : 0
copyl4 4 copy a
loadl4 4 load from memory array[i]
loadpl4 4 load parameter or constant scalar
maxsl4 4 4 signed maximum (a > b) ? a : b
maxul4 4 4 unsigned maximum (a > b) ? a : b
minsl4 4 4 signed minimum (a < b) ? a : b
minul4 4 4 unsigned minimum (a < b) ? a : b
orl4 4 4 bitwise or a | b
shll4 4 4S shift left a << b
shrsl4 4 4S signed shift right a >> b
shrul4 4 4S unsigned shift right a >> b
signl4 4 sign sign(a)
storel4 4 store to memory special
subl4 4 4 subtract a - b
subssl4 4 4 subtract with signed saturate clamp(a - b)
subusl4 4 4 subtract with unsigned saturate clamp(a - b)
xorl4 4 4 bitwise XOR a ^ b
loadq8 8 load from memory array[i]
storeq8 8 store to memory special
splatw3q8 8 duplicates high 16-bits to lower 48 bits special
convsbw2 1 convert signed a
convubw2 1 convert unsigned a
splatbw2 1 duplicates 8 bits to both halfs of 16 bits special
splatbl4 1 duplicates 8 bits to all parts of 32 bits special
convswl4 2 convert signed a
convuwl4 2 convert unsigned a
convslq8 4 signed convert a
convulq8 4 unsigned convert a
convwb1 2 convert a
convhwb1 2 shift and convert a>>8
convssswb1 2 convert signed to signed with saturation clamp(a)
convsuswb1 2 convert signed to unsigned with saturation clamp(a)
convuuswb1 2 convert unsigned to unsigned with saturation clamp(a)
convlw2 4 convert a
convhlw2 4 shift and convert a>>16
convssslw2 4 convert signed to signed with saturation clamp(a)
convql4 8 convert a
mulsbw2 1 1 multiply signed a * b
mulubw2 1 1 multiply unsigned a * b
mulswl4 2 2 multiply signed a * b
muluwl4 2 2 multiply unsigned a * b
accl4 4 accumulate += a
swapw2 2 endianness swap special
swapl4 4 endianness swap special
select0wb1 2 select first half special
select1wb1 2 select second half special
select0lw2 4 select first half special
select1lw2 4 select second half special
mergewl4 2 2 merge halves special
mergebw2 1 1 merge halves special
splitlw2 4 split first/second words special
splitwb1 2 split first/second bytes special
addf4 4 4 add a + b
subf4 4 4 subtract a - b
mulf4 4 4 multiply a * b
maxf4 4 4 maximum max(a,b)
minf4 4 4 minimum min(a,b)
cmpeqf4 4 4 compare equal (a == b) ? (~0) : 0
convfl4 4 convert float point to integer a
convlf4 4 convert integer to floating point a

Wednesday, December 29, 2010

stuck, back to PySoy

I've been working on a streaming XML parser for Python, but need a break. At this point there's no way Concordance is getting out Jan 1st, but certainly by the end of Winter.

Our libsoy migration process PySoy got pretty far. We were migrating from Pyrex to Genie, essentially moving the core engine from PyObject to GObject to remove Python dependency in game clients and enable further multicore processing on both client and servers. Much of the rendering area of the engine has been migrated, but the process has been held up in two areas;

First, while libsoy is in pretty good shape, we still lack Python bindings - aka PySoy itself, which is what we intend games to be written and run with. Our original plan to use GObject Introspection failed in a horrible mess that I've documented in previous postings, we've looked at using SWIG and even building our own bindings generation with little measurable success. In order to get us moving forward again I'm going to just drop out some .c templates and write the custom wrapper classes by hand. The time it'd take to write and maintain these cannot possibly be greater than the time we've wasted talking about a more elegant solution that only exists conceptually.

When GObject Introspection reaches a state of even remote maturity, where it can offer a Pythonic API, we'll look at it again. We'd even help get it there if the current GIR developers would just document the .gir XML schema or typelib format so we wouldn't have to refer to their source code as the sole definition of these.

Second is our physics code. As I've posted, ODE worked for us in the past but has numerous issues with packaging for various Linux distros (and poor features, slow, and extremely difficult to port to mobile devices). We attempted to migrate to Bullet but this burned us out - virtually no work has gone into that in the past 6 months. We're all pretty frustrated with Bullet's haphazard API (whereas ODE is fairly clean) and the C++ only API doesn't play well with GObject (or anything other than c++ for that matter). Bullet's C API is minimal at best.

When it comes right down to it, the biggest barrier we face with physics is processing power on mobile devices, an issue that using Bullet would not solve. Most of the devices we're interested in include ARM6/7 processors from Qualcomm or TI. Many do not include a FPU (floating point unit), but they all seem to offer a fairly powerful DSP used extensively for processing multimedia. We do not, however, want to rewrite and maintain our physics processing for each platform.

A solution I've come up with is to write our physics, greatly simplified from even what ODE offers, using Orc. It's yet another metalanguage (first Pyrex, then Vala/Genie, now this..), but the successor to liboil (which we and much of the Gnome community use) and already supports many interesting platforms.

My plan is to first migrate our liboil-based YUV-RGB conversion code to Orc to get my feet wet, then implement a greatly simplified collision system using it, and expect the next release (or two) to still use ODE for at least rigid body physics with the plan to eventually replace even that with our own physics solver. It should be much faster, and the same Orc code we write now should be able to compile to DSP code for Android handsets and other mobile devices in the future.

Orc already supports ARM Cortex (NEON), so if we were to finish this work today we'd be able to run PySoy clients on more modern Android handsets without touching DSP code. DSP support in Orc would also be very useful for future hardware for PySoy game servers.

While we'd all really like to get the next PySoy release out ASAP, we'd also like to avoid rewriting the engine again down the road.

Wednesday, December 08, 2010

XML parsing in Python

Its been a couple months, so I'm going to give a brief update on what I've been working on.

Concordance is getting close to release, I plan to have the first release (0.1) out January 1st. More on this toward the end of December.

One of the roadblocks I've hit (again and again) is the lack of a decent XML parsing package for Python. The standard library is a shame when it comes to XML; at least four different modules (expat, sax, dom, etree) to choose from and none of them support even XPath. The most popular option, etree (or ElementTree), cannot even process an XML file with the namespace prefix intact.

There's lxml, which offers an etree-compatible API and fixes many of ElementTree's major faults (namespace prefix preservation, xpath/xslt support) but still cannot handle stream processing and, due to ElementTree's API, does not expose multiple text nodes broken up by a child element such as "<div>first string <br/> second string</div>".

To support XMPP streams we need to use expat or sax to handle the stream event-by-event, since the full XML document is only available once the root element closes at the end of the stream, but the direct children of the root element (what we call "stanzas" in XMPP) need to be processed as complete objects. While we may be able to hack something together using lxml, it would likely be less work than to implement a new XML parsing package. As long as the resulting API doesn't diverge very greatly etree the work necessary to switch should be minimal.

Beside this I've been working on a host of different packages around Concordance, from getting a javascript BOSH/XMPP library together to getting distutils2 ready for Python 3. I've even managed to ship a pitiful little serial library for Python 3, PyTTY that we're using to interface with some Arduinos.