How Good Is Your Data? by Sunny Harris
All data is equal — at least that’s what we think.
IS all data equal? If truth be told, I never gave it much thought. I have been using one vendor nearly exclusively for about 20 years. My fills are good enough. My closing prices seem to match what I see on television or find online. As long as the profits roll in, there has been no reason to question the data.
But then I was told by another vendor that my vendor’s data is off by just enough to generate a side income, through the slippage from actual price to the price I am presented. My curiosity was piqued, and so I decided to investigate. First, I set up a spreadsheet and compared the two vendors. To keep it simple, I considered only the past five years of data. My data experiment ran from June 30, 2005, to June 29, 2010.
I began by exporting the data for a single symbol from each software application to a comma-separated value (Csv) text file. The instrument I chose was the Russell 2000 index, which has different symbols in different software, like Rut, $Rut, and RU2000. I selected the Russell 2000 because of its high liquidity, ease of use, and it is something little guys like us can trade.
Figure 1 shows the beginning of the spreadsheet, with the data of the two vendors (T and M) in the columns. At first glance it appeared that everything was in order, with small discrepancies here and there. The differences in the data, where there is one, seem to be out in the hundredths place, like 600.01 vs. 600.02. That wouldn’t make much difference over time, with some errors to the positive and the negative. It seems like it should be a wash.
Next, I put columns in the spreadsheet to calculate the differences between the open, high, low, and close (Ohlc) of each vendor. Part of that spreadsheet is shown in Figure 2. At the top of each column, in the first row of data, is the result of calculating the sum of all the differences between the two vendors’ Ohlc data. I wouldn’t have been surprised if each component had been consistently lower or higher than the other. But these summation numbers show that the data is all over the map. The closes are 52 points lower, the opens are 40 points higher, the highs are 65 points lower, and the lows are 48 points higher. The spread between the numbers is alternating positive and negative. Could it be — as one vendor suggested — that there is enough of a spread in there for vendor T to cash in on the spread alone?