Today we'll see what is this Web Scraping. We'll also learn HTTP protocol but I promise i'll make it more hands on rather than all jargon which you can read online anyway :)
So what is this HTTP thing?
Simply put, it's how our computers (Clients) Talk to Big Computers (Servers) and get the cool stuff done for us.
So when you go to wiki and open a link, internally HTTP requests are made and that gets your page to your screen via browser.
Let's check how its done.
Goto this page and it loads up a wiki page. But what really happened? Here's a sample HTTP request that was made...
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://en.wikipedia.org/wiki/Python
Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O; mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore; mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow
GET /wiki/Python_(programming_language) HTTP/1.1
It says "Oye you Big Computer (Server) GET me this python page whose address is /wiki/Python_(programming_language) and I am talking in language HTTP whose version is 1.1 and the website that i want is mentioned in Host Parameter (Below)"
Host: en.wikipedia.org
You ever wondered how the websites know what browser you are using, what operating system you are using....? What happens is, when your browser makes a request, it adds to the request what is called as headers (Even the host parameter is part of header), One of the parameter - User Agent, specifies where the request is coming from. So here I am using Linux with Firefox version 7.0.1 (keeping it simple here)
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
I'll cover other things later in request (Cookies and stuff). In response we get
HTTP/1.0 200 OK
Server: Apache
Above simply means "Hey...Nice to see you again....Everything is fine :), i have done my job and btw I am Apache"
(Ignore if you don't know what is apache)
So Let's fire this up in python and see how we can do this on our own....
Note Anything in the line (except the first line) with a # is a comment. I have made it Bold+Italic to highlight it.
#Close the HTTP Connection
-------End of http_Get_Request_1.py---------------
If you run the above code you should get something like this
mankaj $ python http_Get_Request_1.py
200 OK
mankaj $ ls
file.html http_Get_Request_1.py
mankaj $
You can see that you got a new file called file.html. If you open that file you can see that web page. Notice that pictures are missing. Does it give you any clue of how requests are made internally?? Any ideas???
All the requests are not made in one Go! It fetches the skeleton of the web page, and as it encounters (the browser) new links, it makes a seperate request for each picture. You might have known this, but now you saw it yourself :)
Let's wrap it up with one thing you might be wondering! How the hell I saw that HTTP request and response ?? :)
There is something called as LiveHTTPHeader . It's a Firefox Plugin which allows you to see What is happening Internally. (Other browsers might have something like this, just google for it). Just download it. and Restart your Firefox browser. Go to that wiki page. Once you are there. Go to Tools > Live HTTP Headers. And reload the page. You can see all the requests made. Go to the top and you can see the first request made and later internal requests made to load the whole of page. Don't get confused. It's just to give you a taste of what is about to come in the next set of tutorials :)
Have Fun!! :)
Update : Read Part 2 here
References :-
So what is this HTTP thing?
Simply put, it's how our computers (Clients) Talk to Big Computers (Servers) and get the cool stuff done for us.
So when you go to wiki and open a link, internally HTTP requests are made and that gets your page to your screen via browser.
Let's check how its done.
Goto this page and it loads up a wiki page. But what really happened? Here's a sample HTTP request that was made...
----------Request From Client to Server----------
GET /wiki/Python_(programming_language) HTTP/1.1 Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://en.wikipedia.org/wiki/Python
Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O; mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore; mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow
----------End of Request From Client to Server----------
----------Response From Server to Client----------
HTTP/1.0 200 OK
Date: Mon, 10 Oct 2011 12:44:46 GMT
Server: Apache
X-Content-Type-Options: nosniff
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Vary: Accept-Encoding,Cookie
Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT
Content-Encoding: gzip
Content-Length: 47407
Content-Type: text/html; charset=UTF-8
Age: 10932
X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org
X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from sq65.wikimedia.org:80
Connection: keep-alive
----------End of Response From Server to Client----------Date: Mon, 10 Oct 2011 12:44:46 GMT
Server: Apache
X-Content-Type-Options: nosniff
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Vary: Accept-Encoding,Cookie
Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT
Content-Encoding: gzip
Content-Length: 47407
Content-Type: text/html; charset=UTF-8
Age: 10932
X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org
X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from sq65.wikimedia.org:80
Connection: keep-alive
I have highlighted the important part of the request and response.
HTTP has several kinds of request out of which one is GET request.
That's why we had this line below :-GET /wiki/Python_(programming_language) HTTP/1.1
It says "Oye you Big Computer (Server) GET me this python page whose address is /wiki/Python_(programming_language) and I am talking in language HTTP whose version is 1.1 and the website that i want is mentioned in Host Parameter (Below)"
Host: en.wikipedia.org
You ever wondered how the websites know what browser you are using, what operating system you are using....? What happens is, when your browser makes a request, it adds to the request what is called as headers (Even the host parameter is part of header), One of the parameter - User Agent, specifies where the request is coming from. So here I am using Linux with Firefox version 7.0.1 (keeping it simple here)
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
I'll cover other things later in request (Cookies and stuff). In response we get
HTTP/1.0 200 OK
Server: Apache
Above simply means "Hey...Nice to see you again....Everything is fine :), i have done my job and btw I am Apache"
(Ignore if you don't know what is apache)
So Let's fire this up in python and see how we can do this on our own....
Note Anything in the line (except the first line) with a # is a comment. I have made it Bold+Italic to highlight it.
-----------http_Get_Request_1.py---------------
#!/usr/bin/env python
#Python library for making HTTP connections/Requests
import httplib
#make connection with the host using http protocol with port 80
connection = httplib.HTTPConnection("en.wikipedia.org",80)
#make a GET request for the resource mentioned
connection.request("GET", "/wiki/Hello_world")
#Get the response and save it resp
resp = connection.getresponse()
#Print the Response and see the textual description of it
print resp.status, resp.reason
#Save the data
data = resp.read()
#write the html data to file
page = open("file.html", "w")
page.write(data)
page.close()
import httplib
#make connection with the host using http protocol with port 80
connection = httplib.HTTPConnection("en.wikipedia.org",80)
#make a GET request for the resource mentioned
connection.request("GET", "/wiki/Hello_world")
#Get the response and save it resp
resp = connection.getresponse()
#Print the Response and see the textual description of it
print resp.status, resp.reason
#Save the data
data = resp.read()
#write the html data to file
page = open("file.html", "w")
page.write(data)
page.close()
#Close the HTTP Connection
connection.close()
If you run the above code you should get something like this
mankaj $ python http_Get_Request_1.py
200 OK
mankaj $ ls
file.html http_Get_Request_1.py
mankaj $
You can see that you got a new file called file.html. If you open that file you can see that web page. Notice that pictures are missing. Does it give you any clue of how requests are made internally?? Any ideas???
All the requests are not made in one Go! It fetches the skeleton of the web page, and as it encounters (the browser) new links, it makes a seperate request for each picture. You might have known this, but now you saw it yourself :)
Let's wrap it up with one thing you might be wondering! How the hell I saw that HTTP request and response ?? :)
There is something called as LiveHTTPHeader . It's a Firefox Plugin which allows you to see What is happening Internally. (Other browsers might have something like this, just google for it). Just download it. and Restart your Firefox browser. Go to that wiki page. Once you are there. Go to Tools > Live HTTP Headers. And reload the page. You can see all the requests made. Go to the top and you can see the first request made and later internal requests made to load the whole of page. Don't get confused. It's just to give you a taste of what is about to come in the next set of tutorials :)
Have Fun!! :)
Update : Read Part 2 here
References :-
- Atul Alex Cherian's Script (which I modified and used)
- Siddhant's PyCon Slides
- HTTP Definitive Guide (Amazing Book)
- Python For Newbies Video Series (Good if you don't know python at all!!)
Hi all,
ReplyDeleteThe HTTP protocol is designed to permit intermediate network elements to improve or enable communications between clients and servers. High-traffic websites often benefit from web cache servers that deliver content on behalf of the original to improve response time. Thanks a lot.....