Using Cookie Jar With urllib2

A while ago, we discussed how to scrape information from websites that don't offer information in a structured format like XML or JSON. We noted that urllib and lxml are indispensable tools in web scraping. While urllib enables us to connect to websites and retrieve information, lxml helps convert HTML, broken or not, to valid XML and parse it. In this post, I will demonstrate how to retrieve information from web pages that require a login session.

I have created a sample website for this task - http://toscrape.techchorus.net

The website has a page that requires a login session - http://toscrape.techchorus.net/only_authenticated.php

To login to the sample website, use the credentials:
username: admin
password: password

If you visit http://toscrape.techchorus.net/ you will notice that the server sends the response with headers:

Date: Tue, 19 Oct 2010 17:33:43 GMT
Server: Apache mod_fcgid/2.3.5 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635
X-Powered-By: PHP/5.2.14
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: PHPSESSID=c84456d65e5b9da95b09abd4092f860b; path=/
Location: /login.php
Content-Length: 0
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/html

In order to maintain the session, you have to send the cookie PHPSESSID with the value c84456d65e5b9da95b09abd4092f860b in all subsequent requests. Of course, the value varies for each user. If the server resets the value of the cookie in a subsequent response, you have to send the updated cookie value in further requests.

Python standard library offers the module cookielib to manage cookies in the client side. We can use this as a cookie jar. In essence, cookielib offers a container to hold cookies. We use this container in urllib2.

Let's start writing the code.

Import the libraries

import urllib2
import urllib
from cookielib import CookieJar

Create the cookie jar and the opener

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

You can register a handler instance to urllib2. In the example, we register urllib2.HTTPCookieProcessor as the handler. Notice the cookie jar instance, cj being passed to the HTTPCookieProcessor constructor. At this point, our cookie jar is configured.

In order to initiate a login session, you have to send the username and password POST parameters to http://toscrape.techchorus.net/do_login.php. If the POST parameters are correct, the server internally maintains a login session.

values = {'username': 'admin', 'password': 'password'}
data = urllib.urlencode(values)
response = opener.open("http://toscrape.techchorus.net/do_login.php", data)

Print the response and view the output.

print response.read()

The output on my computer:

Hello, admin. Welcome to scrape tester.
Logout

Now open, http://toscrape.techchorus.net/only_authenticated.php and print the response.

response2 = opener.open('http://toscrape.techchorus.net/only_authenticated.php')
print response2.read()

The output:

Hello, authenticated user.

Don't you think the batteries are included?

I'm posting the complete program so that you can copy it in one shot.

import urllib2
import urllib
from cookielib import CookieJar

cj = CookieJar()

opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))

values = {'username': 'admin', 'password': 'password'}
data = urllib.urlencode(values)
response = opener.open("http://toscrape.techchorus.net/do_login.php", data)
print response.read()

response2 = opener.open('http://toscrape.techchorus.net/only_authenticated.php')
print response2.read()

The source code is available as a gist.

Taxonomy upgrade extras: 

Comments

This just saved me a ton of time. Thanks!!

Add new comment

Filtered HTML

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <blockquote> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Lines and paragraphs break automatically.

Plain text

  • No HTML tags allowed.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.
By submitting this form, you accept the Mollom privacy policy.