The challenges of getting data

January 29, 2008

I recently had a call with a client who asked for a rich content feed covering several top US cities. The feed was in the food & dining space and included cuisine, parking, payment methods, description, reviews, ambiance, menus, health details and a bunch of other business attributes.

During our “sales pitch” Malcolm (our VP of Business Development) and I explained how easy it would be for us to collect, clean up and send the data with quarterly updates. Later that day I had a chat with Nave, our VP of R&D, and we discussed the process involved in getting that data. I won’t bore you with all the details but it was amazing to see how complex it really is.

For example, it turns out that some of the top US websites use different encoding methods – one decided to use English, Swedish and Greek encodings all in the same site. Then there’s the issue of dealing with different formats such as Ajax, cookies and other weird navigation methods. Cleaning up the content into a normalized format from rogue characters and images is a whole new challenge and even when it’s all good and ready, how do you reconcile discrepancies (e.g. one source claimed that a restaurant was European while another insisted that it’s French…)?

After about 30 minutes of going over daunting buzzwords and technical challenges ranging from finding high quality sources to keeping the data fresh, I recalled our insistence that getting the data is “quite easy for us”. It makes one think of Adam Smith’s concept of Division of Labour and thank God that the sales guy doesn’t have to develop the product and that the R&D guy doesn’t have to sell it. 🙂


