Showing posts with label HTTP. Show all posts
Showing posts with label HTTP. Show all posts

Tuesday, October 11, 2011

HTTP, Web Scraping and Python - Part 2

In Part 1, I talked about User Agents. Today we'll try to see what I said, is it actually true? i.e. Do servers really see that user agent value? Do they really identify you with it?


Last time we proposed this as a hypothesis, today we'll see if it's a fact or not? ;)

We'll do a little experiment. For this You'll need firefox. So swtich to firefox, if you haven't.

Now, Go to this url. It's an addon which allows you to switch User Agents. Just download and restart firefox.

Oye. Stop. Go and download that Plugin before you move on! Such a lazy person you are! :) Just kidding ;) (but you'll really learn a lot more if you do this)


Now Go to 
Tools > Default User Agent > Edit User Agents.
Select New > New User Agents.


Now fill in random crap in each text field. Yes you heard me. Fill all crap!! Utter nonsense, or write poetry.  At least Change the User Agent field.


Okay. Done? Select Okay. Okay.
Now go again to 
Tools > Default User Agent > And select the user agent you made.


Okay Done?


Now go back to this same page. Notice anything above? It says :- "To try the thousands of add-ons available here, download Mozilla Firefox, a fast, free way to surf the Web!"

Huh?? Download Firefox? But I am in firefox!!! :)
Okay cool.


Now go again to 
Tools > {Your User Agent Name} > And select the default user agent.


Now reload the Page again. Bam! That Banner is gone!!


It shows that the value of the User Agent String does matter, and that servers do read that value to identify who you are! Why is this important to know you'll see in the later part of the series...

Hence, Proved! :)

PS - This Idea came from a problem I had. From past few months my firefox was not being recognized by my gmail and it always use to go to default basic mode rather than standard mode. I tried googling for it, but I didn't understand what was going on, or what to search for! It was a weird problem! I knew about User Agents but it was all theory!! :) No practicals about it, hence it never clicked me, my user agent might be a problem!

While writting the part 1, i noticed the header value of User Agent String which was some rubbish and it just struck me!! I set it back to default and everything was fine again :)

Moral Of this Post - Don't just believe what You read anywhere, no matter who says it, try it out on your own and really test it before you believe it! Ask questions! Challenge what you learn! Learn the same thing in different ways. And you'll get awesome everyday :)

Have Fun! :)

References :-

Monday, October 10, 2011

HTTP, Web Scraping and Python - Part 1

Today we'll see what is this Web Scraping. We'll also learn HTTP protocol but I promise i'll make it more hands on rather than all jargon which you can read online anyway :)

So what is this HTTP thing?

Simply put, it's how our computers (Clients) Talk to Big Computers (Servers) and get the cool stuff done for us.

So when you go to wiki and open a link, internally HTTP requests are made and that gets your page to your screen via browser.


Let's check how its done.
Goto this page and it loads up a wiki page. But what really happened? Here's a sample HTTP request that was made...

----------Request From Client to Server----------
GET /wiki/Python_(programming_language) HTTP/1.1
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://en.wikipedia.org/wiki/Python
Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O; mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore; mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow
----------End of Request From Client to Server----------

----------Response From Server to Client----------
HTTP/1.0 200 OK
Date: Mon, 10 Oct 2011 12:44:46 GMT
Server: Apache
X-Content-Type-Options: nosniff
Cache-Control: private, s-maxage=0, max-age=0, must-revalidate
Content-Language: en
Vary: Accept-Encoding,Cookie
Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT
Content-Encoding: gzip
Content-Length: 47407
Content-Type: text/html; charset=UTF-8
Age: 10932
X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org
X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from sq65.wikimedia.org:80
Connection: keep-alive
----------End of Response From Server to Client----------

I have highlighted the important part of the request and response.

HTTP has several kinds of request out of which one is GET request. 
That's why we had this line below :-
GET /wiki/Python_(programming_language) HTTP/1.1




It says "Oye  you Big Computer (Server) GET me this python page whose address is /wiki/Python_(programming_language) and I am talking in language HTTP whose version is 1.1 and the website that i want is mentioned in Host Parameter (Below)"
Host: en.wikipedia.org 

You ever wondered how the websites know what browser you are using, what operating system you are using....? What happens is, when your browser makes a request, it adds to the request what is called as headers (Even the host parameter is part of header), One of the parameter - User Agent, specifies where the request is coming from. So here I am using Linux with Firefox version 7.0.1 (keeping it simple here)

User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 

I'll cover other things later in request (Cookies and stuff). In response we get

HTTP/1.0 200 OK
Server: Apache


Above simply means "Hey...Nice to see you again....Everything is fine :), i have done my job and btw I am Apache" 

(Ignore if you don't know what is apache)

So Let's fire this up in python and see how we can do this on our own....

Note Anything in the line (except the first line) with a # is a comment. I have made it Bold+Italic to highlight it. 

-----------http_Get_Request_1.py---------------  
#!/usr/bin/env python



#Python library for making HTTP connections/Requests
import httplib

#make connection with the host using http protocol with port 80
connection = httplib.HTTPConnection("en.wikipedia.org",80)

#make a GET request for the resource mentioned
connection.request("GET", "/wiki/Hello_world")

#Get the response and save it resp
resp = connection.getresponse()

#Print the Response and see the textual description of it
print resp.status, resp.reason

#Save the data
data = resp.read()

#write the html data to file
page = open("file.html", "w")
page.write(data)
page.close()

#Close the HTTP Connection
connection.close()

-------End of http_Get_Request_1.py--------------- 



If you run the above code you should get something like this
 
mankaj $ python http_Get_Request_1.py
200 OK
mankaj $ ls
file.html  http_Get_Request_1.py
mankaj $



You can see that you got a new file called file.html. If you open that file you can see that web page. Notice that pictures are missing. Does it give you any clue of how requests are made internally?? Any ideas???

All the requests are not made in one Go! It fetches the skeleton of the web page, and as it encounters (the browser) new links, it makes a seperate request for each picture. You might have known this, but now you saw it yourself :)

Let's wrap it up with one thing you might be wondering! How the hell I saw that HTTP request and response ?? :)

There is something called as LiveHTTPHeader . It's a Firefox Plugin which allows you to see What is happening Internally. (Other browsers might have something like this, just google for it). Just download it. and Restart your Firefox browser. Go to that wiki page. Once you are there. Go to Tools >  Live HTTP Headers. And reload the page. You can see all the requests made. Go to the top and you can see the first request made and later internal requests made to load the whole of page. Don't get confused. It's just to give you a taste of what is about to come in the next set of tutorials :)

Have Fun!! :)

Update : Read Part 2 here

References :-