I'm talking about Big O Notation analysis of algorithms.

“Although developed as a part of pure mathematics, this notation is now frequently also used in the analysis of algorithms to describe an algorithm's usage of computational resources: the worst case or average case running time or memory usage of an algorithm is often expressed as a function of the length of its input using big O notation. This allows algorithm designers to predict the behavior of their algorithms and to determine which of multiple algorithms to use, in a way that is independent of computer architecture or clock rate. Because big O notation discards multiplicative constants on the running time, and ignores efficiency for low input sizes, it does not always reveal the fastest algorithm in practice or for practically-sized data sets, but the approach is still very effective for comparing the scalability of various algorithms as input sizes become large.” - [http://en.wikipedia.org/wiki/Big_O_notation

We are still talking about code, specifically your code and how well it performs when pushed to it's limits and more specifically help you understand it's limits and plan for what will happen when those are reached. I'm not going into a full class on Big O, I'm just highlighting an example from code I am working on at the moment that reminded me of Big O.

At CF Webtools we have taken on the task of doing RETS [http://www.rets.org/] data import. I have over 5 years of experience in developing Real Estate search platforms and working with Real Estate data. However, I am not a RETS expert although many of my former co-workers are and I discussed the issues and methods of RETS with them on a regular basis. Taking on a RETS import, writing your own RETS client and managing the data is not for the weak of code.

RETS: Even though S stand for “Standard”, standard is a term that is used loosely. Usually each RETS server is has different rules, protocols and limits. A RETS client lets you access a RETS server by making http calls to the RETS server and retrieving limited amounts of data in a variant of XML. The particular RETS server I am getting data from limits me to 2500 records at a time and there are well over 200,000 records. When I run a single http RETS request I get back an XML file that I need to parse, clean up a bit and bulk insert into the database.

So the situation I landed in is to take an existing RETS client that was written with ColdFusion and finish the code, test it and add addition data imports. I chose to accept this mission even though I suspect it will shorten my lifespan. After digging into the code, reading developer notes and running a lot of tests I ran into several BIG issues. Each issue was performance based. I'll cover only the code performance issue today.

This will lead us to the Big O, I promise!

As I was saying the RETS server limits us to 2500 records at a time. The DB choked on 2500 record bulk inserts, that's a different blog post, so I dropped the RETS requests to 500 records at a time. The code performance issue is as follows.

  • Retrieve data in XML
  • Parse the XML
  • Get one row of data
  • Look at each column of data in that row and 'do some cleanup and logic'
  • Append this row of data to the bulk insert statement being built as a string
  • Repeat for each row of data

If you are reading that correctly you see a nested loop. Only one, so hey not too bad. But there are two key questions you should be asking about now. If you are asking how many rows of data and how many columns per row then you are asking the correct questions. This is where Big O tells me how bad the above algorithm will perform when scaled up to production. The answers are 200,000 plus rows and 450 columns per row. A basic rule in Big O is is that nested loops are multiplied. So this algorithm's Big O is f(n)=O(n*n). If you know any math at all this tells you that the algorithm is logarithmic in function. As n increases the time to run increases exponentially. So n, being the largest value in this case, is 200,000 meaning this run in the worst case 200,000^2 or 40 billion loops. O! Yup, that's the Big O, and O! this could be a problem. The practical application of this algorithm gave me a still staggering number, 200,000 rows of data with 450 columns per row calculates to an actual 200,000 * 450 = 90,000,000 loops, Oh my! While the actual practical application of this algorithm is not nearly as scary as the worst case, it's still very scary.

So, what do you do?

  1. accept the code and get bigger, badder, better hardware
  2. figure out a better way to write the code

I saw a way to improve the code. The best way to do so is get rid of the multiplication factor, ie kill the nested loop. After researching the code and figuring out the inner loop I discovered that I could replace the functionality of the inner loop with some built in ColdFusion functions.

Lets look at the inner loop. It was replacing any column that had a “no data place holder” value of “xxx” with NULL, escaping single quotes, replacing | with , placing single quotes around the values that need to be quoted and wrapping the entire row with parenthesis and add a comma to the end. The inner loop had a massive amount of logic to look at each data column and decided on each of these items one at a time. It was so convoluted that I didn't even try to reuse any of it. The code below shows how fixed this with ColdFusion.

view plain print about
1<cfloop array="#myData#" index="tampDataRow">
2<cfscript>
3 tempDataRow = replace( tempDataRow, "'", "''", "ALL" ); //escape single quotes
4
tempDataRow = listQualify(tempDataRow,"
'","|","All"); //wrap each item in single quotes
5
tempDataRow = replace(tempDataRow,"'xxx'","NULL","All"); //replace 'xxx' with NULL
6
tempDataRow = listChangeDelims(tempDataRow,",","|"); //change list qualifiers from | to ,
7
SqlInsert3 = sqlInsert3 & "(" & tempDataRow & "),";
8
</cfscript>
9</cfloop>

First thing I did was escape all the single quotes. Since the raw data did not have any of the values wrapped in single quotes this was safe and quick. Next adding single quotes around all the data items using listQualify(). Now to add in NULL by replacing 'xxx' with NULL. By using list qualify first and then replacing the full list value of 'xxx' with NULL I eliminated the need for complex logic to determine which fields needed to be quoted. Now replace the | delimiter with , using the listChangeDelims() function and wrap the whole line in parenthesis and add a comma.

Of course this leads to one last little problem which is that the SQL I just generated has a comma at the end and that will throw an error. The old code used complex logix to keep track of the posistion in the loop and add a comma unless it was the last item. It did this logic for every row even though the only time it mattered was the last time. Why run all that extra logic when you know the last character of the string is going to be a comma and you can simply strip the comma with a single command.

view plain print about
1<cfset sqlinsert3="left(SqlInsert3,len(SqlInsert3)-1)"></cfset>

Now there is only one loop per row of data for a Big O of f(n)=O(n) and in the practical application means that the code will only produce 200,000 loops or so depending on the actual number of data rows it has to retrieve. This is a much faster solution and you should be saying O! Wow! about now. (Now you've had multiple Big O's)

Until next time when I talk about digging into MySql Server Tuning.