Wednesday 16 January 2013

The power of the Matrix

Do you want to make your code run 100 times faster, at least in python? Try rewriting it in matrix form using numpy.

I came across this when implementing the Floyd Warshall Algorithm for finding all shortest paths in a graph as part of th Algorithms 2 course at Coursera. The guts of Floyd Warshall are 3 nested for loops that iterate over all the nodes, here's the Python code :


# v_sz = number of vertices
for k in range(1,v_sz) :
  for i in range(0,v_sz) :
    for j in range(0,v_sz) :
      new_len = A[k-1][i][k] + A[k-1][k][j]
      old_len = A[k-1][i][j]
      if new_len < old_len :
        A[k][i][j] = new_len
      else :
        A[k][i][j] = old_len


This takes around 30 minutes to run on my PC for the supplied graph of 1000 vertices.

Other people were getting much better times and the hint was given to try rewriting this in matrix form -took me a while but I had used Octave as part of the Machine Learning and Neural Nets courses, and together with the Numpy for Matlab Users page I got to :

for k in range(1,v_sz) :
  #tile() is the equivalent to repmat -copies a vector repeatedly into a matrix
  i2k = np.tile(M[0:v_sz][:,k],(1,v_sz))
  k2j = np.tile(M[k][:],(v_sz,1))
  M = np.minimum(M, i2k + k2j)

This runs in ~10 seconds, a good chunk of which will be loading in the data! Pretty impressive and there would be more to go as I think I can shave time off the data load.

What it shows is that numpy is heavily optimised, probably through use of underlying C libraries, for matrix computation and that it is always going to be worth looking to see if a set of for loops can be recast into vector and matrix manipulations instead.

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. I've been thinking about this, could the numpy use GPGPU if its there?

    There is a lot of optimisation for seconds level caches, which if done correctly can also give a huge speed up; 1000 vertices sounds large, but if it can fit snugly inside the processors front porch it doesn't need to make a call out to the next country if you follow my L2 cache, and RAM metaphor.

    I've watched a fair proportion of Algo2, and it has been good.

    Geoff

    ReplyDelete
  3. Looks like numpy won't out of the box -but there are potential ways of making it do so, using say Clyther.

    http://srossross.github.com/Clyther/for_numpy_users.html

    ReplyDelete

linkedin