Web Scraping With lxml

AttachmentSize
members-of-the-parliment.html455.22 KB

More and more websites are offering APIs nowadays. Previously, we've talked about XML-RPC and REST. Even though web services are growing exponentially there are a lot of websites out there that offer information in unstructured format. Especially, the government websites. If you want to consume information from those websites, web scraping is your only choice.

What is web scraping?

Web scraping is a technique used in programs that mimic a human browsing the website. In order to scrape a website in your programs you need tools to

  • Make HTTP requests to websites
  • Parse the HTTP response and extract content

Making HTTP requests is a snap with urllib, a Python standard library module. Once you have the raw HTML returned by the website, you have to have an efficient technique to extract content.

Many programmers immediately think of regular expressions when talking about extracting information from text documents. But, there are better tools at your disposal. Enter lxml. Using tools like lxml you can transform an HTML document into an XML document. After all, an XHTML document is an XML document. As we all know that web authors seldom care for standards compliant HTML documents. Majority of websites have broken HTML. We have to deal with it. But hey, lxml is cool with it. Even if you supply a broken HTML document, lxml's HTML parser can transform it into valid XML document. However, regular expressions are still useful in web scraping. You can use regular expressions in conjunction with lxml, specifically when you're dealing with text nodes.

What you should know before you start?

  • XML
  • XPath
  • A little bit of Python

W3Schools.com has good tutorials on these subjects. Head over to XML tutorial and XPath tutorial to brush up your knowledge.

Let's write a Python script to put our new found skills into practice.

The government of India has a web page where it lists the honourable members of the parliment. The goal of this exercise is to scrape the web page and extract the list of names of members of the parliment.

The web page in question is http://164.100.47.132/LssNew/Members/Alphabaticallist.aspx

Without further ado, let's begin coding.

import urllib
from lxml import etree
import StringIO

We can grab the web page using the urllib module. lxml.etree has the required parser objects.

result = urllib.urlopen("http://164.100.47.132/LssNew/Members/Alphabaticallist.aspx")
html = result.read()

At this point, we have the raw HTML in html variable.

parser = etree.HTMLParser()
tree   = etree.parse(StringIO.StringIO(html), parser)

We create the HTML parser object and then pass the parser to etree.parse. In other words, we tell etree.parse to use the HTML parser object. We also pass the file like string object using StringIO.StringIO.

Now, take a look at the source of the document.

The information we want is in the table whose id is "ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1".

Let's begin constructing the XPath expression to drill down the document to those parts we care about.

//table[@id='ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1']

The above XPath expression grabs the table node having the id "ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1" irrespective of its location in the document.

The first row, <tr>, is not required since it contains the table heading. Let's grab all the rows of the table element except the first row.

//table[@id='ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1']/tr[position()>1]

In each table row, the name of the member of the parliment is contained in the second cell, <td>.

Filter the XPath expression to return only the second cell of each row.

//table[@id='ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1']/tr[position()>1]/td[position()=2]

Within our target cell node, the name of the member of the parliment is contained in the anchor, <a>, element.

Further refine the XPath expression to grab the text nodes.

//table[@id='ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1']/tr[position()>1]/td[position()=2]/a/child::text()

Apply the XPath expression to our tree.

xpath = "//table[@id='ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1']/tr[position()>1]/td[position()=2]/a/child::text()"
filtered_html = tree.xpath(xpath)

That's all we need to do to grab the names of members of the parliment.

The filtered_html variable is a Python list. The elements of the list are the names of the members of the parliment.

Try it and see for yourself.

print filtered_html

Here's the sample output

['Aaroon Rasheed,Shri J.M.', 'Abdul Rahman,Shri ', 'Abdullah,Dr. Farooq', 'Acharia,Shri Basudeb', 'Adhalrao Patil,Shri Shivaji', 'Adhi Sankar,Shri ', 'Adhikari ,Shri Sisir Kumar', ...]

By the time you read this document, if the web page is moved or its contents altered, refer to the attached HTML document.

The complete script is posted as a gist.

Taxonomy upgrade extras: 

Comments

Try Vietspider - a excellent web scraping software from Vietspider

This:

result = urllib.urlopen("http://164.100.47.132/LssNew/Members/Alphabaticallist.aspx")
html = result.read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO.StringIO(html), parser)

is a rather unwieldy way of spelling this:


parser = etree.HTMLParser()
tree = etree.parse("http://164.100.47.132/LssNew/Members/Alphabaticallist.aspx", parser)

Stefan

Thanks for the comment. I knew, someone would point it out sooner or later :)

If you are scraping only publicly available pages, this short and sweet method works very well.

Good one !

I would like to know about the memory utilization. For eg, In BeautifulSoup, we can use SoupStrainer to load only the part of HTML into memory and parse it.


$ wget -O- http://164.100.47.132/LssNew/Members/Alphabaticallist.aspx 2>/dev/null | sed -nr 's/^.*mpsno=[0-9]+">([^>]*)<.*$/\1/p' | head
Aaroon Rasheed,Shri J.M.
Abdul Rahman,Shri
Abdullah,Dr. Farooq
Acharia,Shri Basudeb
Adhalrao Patil,Shri Shivaji
Adhi Sankar,Shri
Adhikari ,Shri Sisir Kumar
Adhikari,Shri Suvendu
Adityanath ,Shri Yogi
Adsul,Shri Anandrao
$

Jeff Atwood calls it Parsing Html The Cthulhu Way. I agree with him.

http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.