Fire's of May: HTTP, Web Scraping and Python

Today we'll see what is this Web Scraping. We'll also learn HTTP protocol but I promise i'll make it more hands on rather than all jargon which you can read online anyway :)

So what is this HTTP thing?

Simply put, it's how our computers (Clients) Talk to Big Computers (Servers) and get the cool stuff done for us.

So when you go to wiki and open a link, internally HTTP requests are made and that gets your page to your screen via browser.

Let's check how its done.
Goto this page and it loads up a wiki page. But what really happened? Here's a sample HTTP request that was made...

----------Request From Client to Server----------

GET /wiki/Python_(programming_language) HTTP/1.1
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://en.wikipedia.org/wiki/Python
Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O; mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore; mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow

----------End of Request From Client to Server----------

----------Response From Server to Client----------

HTTP/1.0 200 OK 

Date: Mon, 10 Oct 2011 12:44:46 GMT 

Server: Apache 

X-Content-Type-Options: nosniff 

Cache-Control: private, s-maxage=0, max-age=0, must-revalidate 

Content-Language: en 

Vary: Accept-Encoding,Cookie 

Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT 

Content-Encoding: gzip 

Content-Length: 47407 

Content-Type: text/html; charset=UTF-8 

Age: 10932 

X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org 

X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from sq65.wikimedia.org:80 

Connection: keep-alive

----------End of Response From Server to Client----------

I have highlighted the important part of the request and response.

HTTP has several kinds of request out of which one is GET request.

That's why we had this line below :-
GET /wiki/Python_(programming_language) HTTP/1.1

It says "Oye you Big Computer (Server) GET me this python page whose address is /wiki/Python_(programming_language) and I am talking in language HTTP whose version is 1.1 and the website that i want is mentioned in Host Parameter (Below)"
Host: en.wikipedia.org

You ever wondered how the websites know what browser you are using, what operating system you are using....? What happens is, when your browser makes a request, it adds to the request what is called as headers (Even the host parameter is part of header), One of the parameter - User Agent, specifies where the request is coming from. So here I am using Linux with Firefox version 7.0.1 (keeping it simple here)

User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1

I'll cover other things later in request (Cookies and stuff). In response we get

HTTP/1.0 200 OK
Server: Apache

Above simply means "Hey...Nice to see you again....Everything is fine :), i have done my job and btw I am Apache"

(Ignore if you don't know what is apache)

So Let's fire this up in python and see how we can do this on our own....

Note Anything in the line (except the first line) with a # is a comment. I have made it Bold+Italic to highlight it.

-----------http_Get_Request_1.py---------------   

#!/usr/bin/env python

#Python library for making HTTP connections/Requests

import httplib

#make connection with the host using http protocol with port 80

connection = httplib.HTTPConnection("en.wikipedia.org",80)

#make a GET request for the resource mentioned

connection.request("GET", "/wiki/Hello_world")

#Get the response and save it resp

resp = connection.getresponse()

#Print the Response and see the textual description of it

print resp.status, resp.reason

#Save the data

data = resp.read()

#write the html data to file

page = open("file.html", "w")

page.write(data)

page.close()

#Close the HTTP Connection 

connection.close()

-------End of http_Get_Request_1.py---------------

If you run the above code you should get something like this

mankaj $ python http_Get_Request_1.py
200 OK
mankaj $ ls
file.html http_Get_Request_1.py
mankaj $

You can see that you got a new file called file.html. If you open that file you can see that web page. Notice that pictures are missing. Does it give you any clue of how requests are made internally?? Any ideas???

All the requests are not made in one Go! It fetches the skeleton of the web page, and as it encounters (the browser) new links, it makes a seperate request for each picture. You might have known this, but now you saw it yourself :)

Let's wrap it up with one thing you might be wondering! How the hell I saw that HTTP request and response ?? :)

There is something called as LiveHTTPHeader . It's a Firefox Plugin which allows you to see What is happening Internally. (Other browsers might have something like this, just google for it). Just download it. and Restart your Firefox browser. Go to that wiki page. Once you are there. Go to Tools > Live HTTP Headers. And reload the page. You can see all the requests made. Go to the top and you can see the first request made and later internal requests made to load the whole of page. Don't get confused. It's just to give you a taste of what is about to come in the next set of tutorials :)

Have Fun!! :)

Update : Read Part 2 here

References :-

Atul Alex Cherian's Script (which I modified and used)
Siddhant's PyCon Slides
HTTP Definitive Guide (Amazing Book)
Python For Newbies Video Series (Good if you don't know python at all!!)

1 comment:

Extract Website ContentOctober 29, 2011 at 8:33 AM
Hi all,

The HTTP protocol is designed to permit intermediate network elements to improve or enable communications between clients and servers. High-traffic websites often benefit from web cache servers that deliver content on behalf of the original to improve response time. Thanks a lot.....

Note: Only a member of this blog may post a comment.

Fire's of May

Monday, October 10, 2011

HTTP, Web Scraping and Python - Part 1

1 comment: