Web Scraping

Web Scraping - when the websites don't offer structured information

XML For PHP Developers

On 30 October 2010, the Bangalore PHP User Group met at Microsoft. I gave a talk, XML For PHP Developers, in the event.

I'm sharing the slides in this entry.

It was an introductory talk on XML for PHP developers. There are hundreds of technologies built on top of XML. We have all heard about RSS, Atom, XML-RPC, SOAP, etc. The goal of the talk was to get PHP developers to start using XML. In the talk, I presented three recipes:

Taxonomy upgrade extras: 

Using Cookie Jar With urllib2

A while ago, we discussed how to scrape information from websites that don't offer information in a structured format like XML or JSON. We noted that urllib and lxml are indispensable tools in web scraping. While urllib enables us to connect to websites and retrieve information, lxml helps convert HTML, broken or not, to valid XML and parse it. In this post, I will demonstrate how to retrieve information from web pages that require a login session.

Taxonomy upgrade extras: 

Web Scraping With lxml

More and more websites are offering APIs nowadays. Previously, we've talked about XML-RPC and REST. Even though web services are growing exponentially there are a lot of websites out there that offer information in unstructured format. Especially, the government websites. If you want to consume information from those websites, web scraping is your only choice.

What is web scraping?

Web scraping is a technique used in programs that mimic a human browsing the website. In order to scrape a website in your programs you need tools to

  • Make HTTP requests to websites
  • Parse the HTTP response and extract content
Taxonomy upgrade extras: