2013. március 29., péntek

Finding a Hamiltonian path - a randomized aproach

The problem

There is an international programming contest in Hungary held in every year. I like to participate, the problems are very entertaining.

One of the problems in 2003 was an idealized DNA sequence assembly based on short reads. The sequences of course were generated by a computer (they weren't actually sequenced DNA data), and the input was very "clean":
  • There were given N reads, each L long.
  • The sequences overlapping exactly at five bases (five characters)
  • There were no read errors, or any kind of noise
That's it. It felt tempting to build a graph of that data.

Doing it in Python, the adjacency dictionaries (one forward, one reversed) was built in a couple dozen seconds for the largest files. (Python dictionaries are hash maps, they're pretty fast, and scale well.)


Possible solutions

Question is, how do I build a path in such a graph that contains all nodes exactly once?

Such a path is called a Hamiltonian path, and finding one in an arbitrary graph is an NP-complete problem. This means that it is hopeless to come up with an efficient algorithm that finds such paths quickly for small graphs, and only reasonably slower in bigger graphs.

Since the biggest input for this problem contained 10.000 nodes, it looked totally hopeless to exhaustively search through all possible paths, it would take a very-very long time (sun-would-turn-into-a-red-giant long).

Backtrack

...is the name of the process that would be still running under a red sun (worst case). It would also take insanely lot of memory to work.

However, if the graphs would have only a very few (like 1, or two) connections from a node towards other nodes (few branches along the possible paths), then it would work OK for even a few thousand nodes.

Building every possible Hamiltonian path

The idea is to start from a set of paths that contain only one node. Create a set and put a one-long path for every node.

In each iteration:
  • Try to extend the paths in every way, and build a new set from the extensions.
  • If it cannot be extended, it won't be in the new set.
  • When the new set is ready, discard the old one.
Do this until you will have a path that is Hamiltonian.

The problem with this is that the number of paths will be very large after 6-8 steps, and will eat many gigabytes of memory. Handling such a data structure is also very slow.

Discarding paths

Since, we only need a single path, we can try to get rid some of the paths.

I found out that if I randomly throw away paths so that only a few thousand remain after every iteration, I can solve the problems under one thousand nodes.

This is because there are many solutions, and we have a reasonable chance to find one even if we throw some candidates away.

Random path extension

The algorithm above is really just a complicated way of building random paths. Random path building can be done more quickly - therefore it's faster. We start from a random node, and add a random neighbour. If we can't add more, we check if the path is Hamiltonian. If it isn't, we start again.

This algorithm below can solve the problem for every input in reasonable time.

The problem is, that sometimes it takes two minutes to solve the hardest input, and sometimes it takes half an hour. Good thing is that it's very easy to parallelize: just run the same program on every processor core you have. One of them will find the solution quicker than the others.


It is surprising how a simple algorithm like this can solve such a hard problem.

Never underestimate the power of simple randomized algorithms!