Resumes, etc

Applying for jobs is hard for a lot of reasons, but the two I’m annoyed with are:

Everything is easy once you’ve done it.
If you can describe how hard/awesome it was, you can’t do it more than once on a page. And apparently people like one page resumes.

So, I’m going to talk about some of the hard things I’ve done. Who knows, maybe it’ll help someone. You’ll at least learn something about what I’ve been working on for the last few years.

LUSA | Words With Friends Max | Crashplan on FreeNAS | Wrap-up

LUSA

LeTourneau University Schedule Advisor

The first problem talking about this is that it’s written in PHP. There, I’ve lost half my readers. It’s not a bad language, it’s a bad community. Moving on.

The core problem for students is “I want to take these 7 classes. At what times can I do this?”. Sounds easy, but it’s actually a largish combinatorics problem. This class has 8 sections and 12 lab sections. Get a few general electives together, and you have a ton of possibilities. I shouldn’t have to tell you that PHP is slow, but it turns out to be just fast enough to cope with this problem. If you know what you’re doing.

After the obvious things like caching as much as possible, things get hard. The first thing you need to do is figure out an efficient way to run through all the possibilities. This is complicated.

First, we sort courses by the number of sections (line 13). Since we have to start somewhere, we start at the beginning of the course list and try to eliminate from there. Conflicts early mean fast returns, but we’ll get to that…

Next, the array $indexes holds an index into the course section we’re considering. We do some other initialization, but the key is getting the indexes straight, and that’s what the for loop on line 34 does. Most fun for loop I’ve ever written. The comment is helpful, but if you still don’t get it, I’ll try to explain.^[1]

Most people think of for loops as incrementing a variable i from start to end. What they really do is run some section of initialization code, some code at the beginning of the loop to see if the body should be executed (again), and some code at the end of the loop before the second thing gets re-checked. So, to start out, we increment the current section index (pre-increment because we’re using the updated index now). Then, we check to see if we’re out of sections for this class. If we are, we enter the loop body and
34: reset the course section object in the classes array to the first section
35: reset the section index to 0 and move on to the next section, so that
36: We can check to see if there is no next course(in which case we’ve hit all the possibilities, and we’re done. Break 2 is wonderful BTW, because it allows our interpreter to make better decisions about the outer while loop)
37:Otherwise, run the initialization code again (it’s the same, except we’re using $i instead of 0 because we’re no longer in the special case where $i is 0).

And there you have it. As far as I know, that’s the fastest way to enumerate combinations of a 2D array with variable length subarrays in PHP. But this still isn’t fast enough (in fact, it’s so not fast enough that there’s a special implementation of it in Student.php:84ff for the special case of checking a single class against a schedule set that’s known to be valid). I spent a lot of time in webgrind (and I even submitted a patch that he’ll never actually merge, but that’s another rant) doing things like getting the conditional in Course.php:186 right. We’re talking about actual half seconds from the ordering of those conditionals (and smaller gains in the functions it calls).

I looked at things like “is it worth having a special conditional to make sure we don’t compare sections to themselves?” (Answer: It is, and it’s a bigger difference than you expect). I also did profiling on different ways to enumerate arrays (for, foreach, array_walk, array_map, and array_walk_recursive). The answer is it really depends on what you’re doing, but in general, PHP function calls have a high overhead (never ever write recursive algorithms in PHP), although sometimes array_walk is still more efficient than foreach. :/ Lots of work, but lots of fun too. And everyone’s schedules should display inside three seconds from reasonably modern hardware, which (at least at the time) is just tolerable for users.

Words With Friends Max

With digressions on academic code

What’s the maximum score in words with friends? Yes, I know you don’t care, but there are a lot of interesting challenges here. This problem may be intractable on consumer hardware – ongoing debate. Assume for a moment it’s worth studying. The first thing you need to do is get the language out of the way. It turns out objective C’s message passing is expensive (objc_message_send). How expensive? More than a millisecond. Yes, I care. You would too if it was 1+ milliseconds times hundreds of thousands of calls. So, C.^[2]

We should probably start in functions.m. If everything about the game board is a number, you can write switch statements and let the compiler worry about making lookups (that you do all the time) really fast. It’s easy for letters (see valuel), but you can do it for 2d grids too as long as the concatenation of your points will fit in a primitive. So, the y coordinate of the board gets shifted up 4 bits (15×15 grid, the pch file has macros to assist), and we can write scoreSquare* as a single switch statement and never see it in our instruments traces again! Oh, and lets not forget string length. That’s precomputed for all words, because Schlemiel is a terrible painter.

The next thing we’re going to look at is the subword code. You know how you only have 7 tiles? This doesn’t limit you to 7 letter words. You are limited to 15 letter words by the board (the app’s dictionary actually ships with several hundred unplayable words), but by combining words and stealing letters from other vertical words, we have a lot more options. Which reminds me, since the board is symmetrical, we only consider the cases where the word being played is horizontal.

Anyway, this brings us to subwordsAtLocation, which is one of the rare times you’ll see a ** object that’s not NSError. This is both because ARC has not simplified your life – it’s hidden it behind autorelease pools^[3] – and because creating an NSMutableSet tens of thousands of times is slow. Also notice that goto can be useful. Here, it’s turning what would otherwise be a boolean, extra conditional, and a register into a jump (the particular one depends on your optimization level) and fall through for the else branch. Much faster. Much easier to read too. Also, this is not the first time I’ve checked assembly output to verify that I’m getting the optimizations I expect.

Much time has gone into optimizing the rest of this, but I don’t really remember what was interesting about that anymore. Loop orders (how you iterate things affects when you can cache things), lots of caching, and an emphasis on low memory overhead. Wait, low memory overhead? I’m glad you asked.

Memory access is slow. People have written white papers and blog articles about this. In fact, there’s a researcher that specializes in small data structures for english. This guy is very serious about performance, but like other academic code I’ve encountered^[4]:

It’s buggy.
We can do better.

That’s right, it’s not small enough. Some things are easier because my working dictionary is smaller than his, but he has some extra bits in the List-Format-Array that can be used to eliminate lists. It’s also theoretically possible to save a few bits in the Node-Array integer by moving related nodes closer together so that the child mask (which is an index into the node-array (like a jump table)) could use relative offsets instead of absolute offsets. A single extra bit in the node would allow me to eliminate the zeroed end of word list nodes that pepper the node array (which is the most significant optimization I’ve though of). Additional bits could also be used to mask child lists again or to potentially generate other child nodes in the list from existing nodes, although both of these are complicated and probably won’t have much effect. But relative jumps. This turns out to be an integer linear programming optimization problem, which is hard. Yes, there are libraries; But I’ve turned my attention elsewhere for the moment.

The current problem is walking the word graph in a threadsafe manner (because the whole problem is highly concurrent,^[5] and I take advantage of this). Since the sample enumeration code is recursive, we have to convert this to an iterative solution and track the stack ourselves, which has proven challenging. And I’ve been sidetracked trying to run Crashplan on FreeNAS.

CrashPlan on FreeNAS

The not-so-portable Java

There’s a whole separate blog post if you just want to do this. I’m here to describe how hard it was. There are posts going back at least 6 months on the FreeNAS forum asking how to do this, and there are hints that people might have accomplished it, but I seem to be the first to have documented a solution.

The obvious thing to do is to follow Crashplan on FreeBSD instructions (because FreeNAS is nanoBSD, which is FreeBSD ~~with none of the tools you need~~ for embedded systems). This leads you to learning you need to load the linux kernel extension (among other things, but you should start here). You don’t have this extension, of course, which the FreeBSD community finds odd, but helpfully suggests that you can compile it from kernel source. Great! Don’t have that either. The FreeBSD community finds this very odd, but there are a few souls that will suggest you use cvsup to get it. The cvsup servers, of course, don’t know what a “FreeNAS” is, and refuse to help you. You have left the reservation, the BSD community can no longer help you. So, you install FreeBSD somewhere else and copy the linux.ko file from it, because your other option is maintaining a fork of FreeNAS. This is your last hope to simply modify official builds after installation. But this won’t work either. Why is a little bit of a mystery, but it’s basically because your kernel wasn’t compiled to support such things.

Fine. But I really want to run Crashplan on FreeNAS! So, pull up the “build FreeNAS” docs, which helpfully tell you what commands to run, but nothing about the build environment or what anything does. This developer has a nice summary of how the codebase is doing (hint: opaque is the polite term)…

Ok, but how hard can this be? Search around a little, figure out how to add packages, add a package, start the build, … and wait a long time (this is a several hour test-debug cycle. In the meantime, Anime!). And… the build crashes. Why did it crash? Dunno. Seriously, the error message provides no more information than “this package failed to install”. So, go trace down how packages are installed. Find some promising places for print statements. Put them in and try again. Several hours later – crashed. Why? Dunno. What about our print statements? Dunno. … huh? Well, can we redirect output to a file? Run run run (Anime Anime) – yes! But it doesn’t seem to have captured standard error. Run run run. Couldn’t find the jdk binaries. What? But I copied them to distfiles. “find / -iname ‘distfiles'”. Copy to all the distfiles. Run run run. Couldn’t find the jdk binary. Frustration.

Long story short, FreeNAS is installed from FreeBSD, but it brings its own FreeBSD to the party too; And older FreeBSD wants an older java binary. Repeat this debug cycle for loading linux.ko and some kernel build flags (after giving up on installing linux support as a module… again. No, that earlier linux.ko reference was for the install system. There are 3 BSDs running here – it’s complicated), and you’re good to go (read: the internet can help you again)! Except that the default poll provider is pathologically stupid. Now I’m back to repeating this process to bring FreeBSD’s JDK 1.6 port to FreeNAS (which works in principle, but in practice won’t patch correctly in FreeNAS’ build process. Why not? Dunno. Run run run. Hopefully I’ll get back to you someday).

Wrap-Up

Or something

And lets not forget the projects I’m not boring you with (ok, so I am):

Hand-soldering an ARM based microcomputer I designed and built (For the LQFP chips, lots of flux + microscope and patience. You get good by the third board).
Image recognition with OpenCV cross-compiled for C#.NET (seriously, why are they using ANSI C?) and iOS. If you didn’t know, the OpenCV documentation sucks, if it even exists. And they’ve overloaded ‘=’ to do conversion between C and C++ data structures, so that with the memory management differences (C++ is kind of refcounted, and C needs explicit frees), the transitive property no longer applies (necessarily) to ‘=’ after you’ve freed the C structure. “a = b; … free(b); … foo(a); foo(b);”. Different results from foo, different crashes. Oh, and UIImage needs an alpha channel, but jpegs don’t have alpha channels. Think about the implementation implications for a minute. And don’t even get me started on the fwrite buffer overflow in sample code…
Unofficial Mac client for the online game Neptune’s Pride II.
Research into using the Apple trackpad as a chorded keyboard (Result: it’s not quite responsive enough, and you need more tactile feedback. It works, it’s just much slower than typing).

Also, fun fact: If your language supports variable variables (${$foo} in PHP), it also support a hash table with lexical scope. This might be useful if the only datastructure your language has is arrays, but seriously – never ever do this. Also, just because you’ve written your own language doesn’t make it a DSL. But it does make it something you have to maintain. The joys of internships :)

^[1] Back to “everything’s easy when you’ve done it”, I documented this loop several years after I wrote it, and I probably spent as much time re-understanding it as I did writing and debugging it in the first place. So yes, document your code; But also, it’s hard to know what’s so complicated when you just wrote it.
^[2] C++ you say? I have in fact worked with it. I’ll let this guy explain it though.
^[3] Seriously though, ARC was a really good idea, and you really don’t have to think about memory 95% of the time now. But that other 5%… If only you could un-ARC that one stupid variable… Oh, and retain cycles with blocks. Those are bad.
^[4] Digression time! So, Antlr v3… it’s a rather nice compiler generator with a very nice IDE. Altogether wonderful project, hope they fixed it in v4. But in v3, their NFA to DFA converter was broken. In case you didn’t know, Antlr is written in Java. So, if you know anything about this problem, you’re thinking “NFAs (and DFAs) are graphs. Since we’re in Java, we can just create objects for each graph node and link it all together. In fact, there’s probably a library for this!”. Hahahaha! Wrong. I mean, you should do this – it’s the most obvious and maintainable architecture – but he didn’t. Oh no. I don’t know what he did, but it wasn’t that. It involved lots of arrays and lookup tables spanning multiple large files. With, of course, no documentation. I never did figure out how he stored them, but I did manage to print them with graphviz and verify the bug.

I still have tons of notes, graphs, and test environments if anyone cares, but the answer is that it’s a bad, bad codebase. I learned way more about DFA conversion and the internals of Antlr v3 (and v2 somehow) than I needed to know. I had 2 versions of antlr, his string templating library, and source for the IDE by the end of it. And yes, I escalated the issue, but I couldn’t find a way to communicate the bug effectively without saying “this is terrible and you should rewrite it”. Oh, and that whole journey of discovery only took 80 hours or so of my life (but I still want it back). Now, try fitting that on a resume.
^[5]. Oh, GPU programming then, right? Stanford has looked into this, and the short answer is no. The long answer is that GPUs weren’t designed for this, so you can spend a lot of time refactoring your code, but you won’t gain anything. It’s possible that you could spend a lot more time hyper optimizing your code, and they provide some ideas for further exploration, but the general conclusion is that probably won’t work out for you. Intel’s new Haswell CPUs are supposed to have faster integer and caching pipelines though, so that’s exciting :)

LUSA | Words With Friends Max | Crashplan on FreeNAS | Wrap-up

LUSA

LeTourneau University Schedule Advisor

Words With Friends Max

With digressions on academic code

CrashPlan on FreeNAS

The not-so-portable Java

Wrap-Up

Or something

About Bion

One Response to Resumes, etc

Leave a Reply Cancel reply

Archives

Meta