Web Scraping With lxml
More and more websites are offering APIs nowadays. Previously, we've talked about XML-RPC and REST. Even though web services are growing exponentially there are a lot of websites out there that offer information in unstructured format. Especially, the government websites. If you want to consume information from those websites, web scraping is your only choice.
What is web scraping?
Web scraping is a technique used in programs that mimic a human browsing the website. In order to scrape a website in your programs you need tools to
- Make HTTP requests to websites
- Parse the HTTP response and extract content
Making HTTP requests is a snap with urllib, a Python standard library module. Once you have the raw HTML returned by the website, you have to have an efficient technique to extract content.
Many programmers immediately think of regular expressions when talking about extracting information from text documents. But, there are better tools at your disposal. Enter lxml. Using tools like lxml you can transform an HTML document into an XML document. After all, an XHTML document is an XML document. As we all know that web authors seldom care for standards compliant HTML documents. Majority of websites have broken HTML. We have to deal with it. But hey, lxml is cool with it. Even if you supply a broken HTML document, lxml's HTML parser can transform it into valid XML document. However, regular expressions are still useful in web scraping. You can use regular expressions in conjunction with lxml, specifically when you're dealing with text nodes.
What you should know before you start?
- A little bit of Python
Let's write a Python script to put our new found skills into practice.
The government of India has a web page where it lists the honourable members of the parliment. The goal of this exercise is to scrape the web page and extract the list of names of members of the parliment.
The web page in question is http://18.104.22.168/LssNew/Members/Alphabaticallist.aspx
Without further ado, let's begin coding.
import urllib from lxml import etree import StringIO
We can grab the web page using the urllib module. lxml.etree has the required parser objects.
result = urllib.urlopen("http://22.214.171.124/LssNew/Members/Alphabaticallist.aspx") html = result.read()
At this point, we have the raw HTML in html variable.
parser = etree.HTMLParser() tree = etree.parse(StringIO.StringIO(html), parser)
We create the HTML parser object and then pass the parser to etree.parse. In other words, we tell etree.parse to use the HTML parser object. We also pass the file like string object using StringIO.StringIO.
Now, take a look at the source of the document.
The information we want is in the table whose id is "ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1".
Let's begin constructing the XPath expression to drill down the document to those parts we care about.
The above XPath expression grabs the table node having the id "ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1" irrespective of its location in the document.
The first row, <tr>, is not required since it contains the table heading. Let's grab all the rows of the table element except the first row.
In each table row, the name of the member of the parliment is contained in the second cell, <td>.
Filter the XPath expression to return only the second cell of each row.
Within our target cell node, the name of the member of the parliment is contained in the anchor, <a>, element.
Further refine the XPath expression to grab the text nodes.
Apply the XPath expression to our tree.
xpath = "//table[@id='ctl00_ContPlaceHolderMain_Alphabaticallist1_dg1']/tr[position()>1]/td[position()=2]/a/child::text()" filtered_html = tree.xpath(xpath)
That's all we need to do to grab the names of members of the parliment.
The filtered_html variable is a Python list. The elements of the list are the names of the members of the parliment.
Try it and see for yourself.
Here's the sample output
['Aaroon Rasheed,Shri J.M.', 'Abdul Rahman,Shri ', 'Abdullah,Dr. Farooq', 'Acharia,Shri Basudeb', 'Adhalrao Patil,Shri Shivaji', 'Adhi Sankar,Shri ', 'Adhikari ,Shri Sisir Kumar', ...]
By the time you read this document, if the web page is moved or its contents altered, refer to the attached HTML document.
The complete script is posted as a gist.