Web scraping: where to start?

By Dario Iacampo

One of my friends was having tough times trying to scrape some contents out of a bunch of websites, I reviewed his code and noticed some problems: he was trying to do the job through regular expressions in php, bad idea.

YOU SHOULD NEVER DO ANY KIND OF WEB SCRAPING THROUGH REGULAR EXPRESSIONS.

A better approach to the job is xpath queries, first of all you need an html/xml parser, then you have to figure out which queries are appropriate.

There a re a couple of tools that are great to start prototyping: yql console and google chrome, here is why:


Yql console allows you to write and execute sql like queries in this form:

select * from html where url='http://...' and xpath='...'

You only have to provide the url of the page you are working with and the xpath query and he will spit out some json that you can easily parse and use in your app (it provides an api that is easy to call and consume in javascript through jsonp)

If you are particularly lazy and don't want to waste time constructing xpath queries just fire up chrome, F12 and


Now, obviously, you have to understand how xpath queries work to get exactly what you want but this is handy.
This is what you get in yql modifying slightly the original xpath expression (//*[@id="restaurant-menu"]/table[1]/tbody/tr[1]/th/cite)


Now all this can be consumed directly in javascript or you can do an http request by a webclient implemented in any language and parse the results or came out with a different approach, the only thing you have to remember is that yql has some limitation usage and if you are writing a spider and need to fire billions of requests a day this quickly becomes a bad approach. 
I'll suggest python and Beautiful Soup for this job