I had an interesting project which involved a CD-ROM containing tens, if not hundreds of thousands of Piaggio/Vespa/Aprilia parts, one of these part catalogues nicely wrapped up in some kind of stand alone Java web application, Jetty I think it was.
The task, to extract all data and images from the CD, after looking through the file system of the application I concluded the application was using a Derby database and the images appeared to be wrapped in a .res file. I had no luck in extracting the images from these files with various tools around on the internets. I wasted too much time trying to get the data out of these files, so I gave up.
There had to be a better way of doing this and as with most things, there was. Some years ago, I created CraigsCrawl – an awesome project which crawled CraigsList ads and extracted data (sorry CraigsList). Since the parts data was presented in a web interface out comes C#.
Once I targeted the information needed, I automated the whole process extracting all the ranges, catalogues, parts and images from the CD. All data was then stored in a nice MS SQL database and the images named and saved to a folder which linked (by filename) to records in the database. Creating the application to crawl the parts data took roughly 2-3days (on and off) and countless cups of fresh, ground, fruity coffee. The extraction only took around 3 hours.
I mainly used the WebBrowser object to extract the data and the HttpWebRequest object for some of the data which hidden behind Ajax post calls. The coffee was sourced from Sainsbury’s, the fresh fruity beans that need grinding up. Nice stuff.
The total number of parts that were extracted for 3 brands came to 210k! Could you imagine the time it would take to manually extract that amount of data?
Comments are closed.