Fire's of May: Python

Monday, October 10, 2011

HTTP, Web Scraping and Python - Part 1

Today we'll see what is this Web Scraping. We'll also learn HTTP protocol but I promise i'll make it more hands on rather than all jargon which you can read online anyway :)

So what is this HTTP thing?

Simply put, it's how our computers (Clients) Talk to Big Computers (Servers) and get the cool stuff done for us.

So when you go to wiki and open a link, internally HTTP requests are made and that gets your page to your screen via browser.

Let's check how its done.
Goto this page and it loads up a wiki page. But what really happened? Here's a sample HTTP request that was made...

----------Request From Client to Server----------

GET /wiki/Python_(programming_language) HTTP/1.1
Host: en.wikipedia.org
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: http://en.wikipedia.org/wiki/Python
Cookie: clicktracking-session=QgVKVqIpsfsgsgszgvwBCASkSOdw2O; mediaWiki.user.bucket:ext.articleFeedback-tracking=8%3Aignore; mediaWiki.user.bucket:ext.articleFeedback-options=8%3Ashow

----------End of Request From Client to Server----------

----------Response From Server to Client----------

HTTP/1.0 200 OK 

Date: Mon, 10 Oct 2011 12:44:46 GMT 

Server: Apache 

X-Content-Type-Options: nosniff 

Cache-Control: private, s-maxage=0, max-age=0, must-revalidate 

Content-Language: en 

Vary: Accept-Encoding,Cookie 

Last-Modified: Sun, 09 Oct 2011 05:01:32 GMT 

Content-Encoding: gzip 

Content-Length: 47407 

Content-Type: text/html; charset=UTF-8 

Age: 10932 

X-Cache: HIT from sq66.wikimedia.org, MISS from sq65.wikimedia.org 

X-Cache-Lookup: HIT from sq66.wikimedia.org:3128, MISS from sq65.wikimedia.org:80 

Connection: keep-alive

----------End of Response From Server to Client----------

I have highlighted the important part of the request and response.

HTTP has several kinds of request out of which one is GET request.

That's why we had this line below :-
GET /wiki/Python_(programming_language) HTTP/1.1

It says "Oye you Big Computer (Server) GET me this python page whose address is /wiki/Python_(programming_language) and I am talking in language HTTP whose version is 1.1 and the website that i want is mentioned in Host Parameter (Below)"
Host: en.wikipedia.org

You ever wondered how the websites know what browser you are using, what operating system you are using....? What happens is, when your browser makes a request, it adds to the request what is called as headers (Even the host parameter is part of header), One of the parameter - User Agent, specifies where the request is coming from. So here I am using Linux with Firefox version 7.0.1 (keeping it simple here)

User-Agent: Mozilla/5.0 (X11; Linux i686; rv:7.0.1) Gecko/20100101 Firefox/7.0.1

I'll cover other things later in request (Cookies and stuff). In response we get

HTTP/1.0 200 OK
Server: Apache

Above simply means "Hey...Nice to see you again....Everything is fine :), i have done my job and btw I am Apache"

(Ignore if you don't know what is apache)

So Let's fire this up in python and see how we can do this on our own....

Note Anything in the line (except the first line) with a # is a comment. I have made it Bold+Italic to highlight it.

-----------http_Get_Request_1.py---------------   

#!/usr/bin/env python

#Python library for making HTTP connections/Requests

import httplib

#make connection with the host using http protocol with port 80

connection = httplib.HTTPConnection("en.wikipedia.org",80)

#make a GET request for the resource mentioned

connection.request("GET", "/wiki/Hello_world")

#Get the response and save it resp

resp = connection.getresponse()

#Print the Response and see the textual description of it

print resp.status, resp.reason

#Save the data

data = resp.read()

#write the html data to file

page = open("file.html", "w")

page.write(data)

page.close()

#Close the HTTP Connection 

connection.close()

-------End of http_Get_Request_1.py---------------

If you run the above code you should get something like this

mankaj $ python http_Get_Request_1.py
200 OK
mankaj $ ls
file.html http_Get_Request_1.py
mankaj $

You can see that you got a new file called file.html. If you open that file you can see that web page. Notice that pictures are missing. Does it give you any clue of how requests are made internally?? Any ideas???

All the requests are not made in one Go! It fetches the skeleton of the web page, and as it encounters (the browser) new links, it makes a seperate request for each picture. You might have known this, but now you saw it yourself :)

Let's wrap it up with one thing you might be wondering! How the hell I saw that HTTP request and response ?? :)

There is something called as LiveHTTPHeader . It's a Firefox Plugin which allows you to see What is happening Internally. (Other browsers might have something like this, just google for it). Just download it. and Restart your Firefox browser. Go to that wiki page. Once you are there. Go to Tools > Live HTTP Headers. And reload the page. You can see all the requests made. Go to the top and you can see the first request made and later internal requests made to load the whole of page. Don't get confused. It's just to give you a taste of what is about to come in the next set of tutorials :)

Have Fun!! :)

Update : Read Part 2 here

References :-

Atul Alex Cherian's Script (which I modified and used)
Siddhant's PyCon Slides
HTTP Definitive Guide (Amazing Book)
Python For Newbies Video Series (Good if you don't know python at all!!)

Thursday, October 6, 2011

Simple App on Google App Engine using Python

Hi,
Today we'll try to make a simple App on Google App Engine in Python which just says 'YourName Rulz!!' and we'll upload it on Google App Engine :)

To get what I am saying, goto this page and check it out :)

That's what we are going to make today. Isn't that cool! You are about to write your name in history! Well on the web and as long as the sever lasts but it still rocks! :)

Once you are done with this tutorial, you can try and solve this activity at reliscore -
Google AppEngine using Python to see how much you really understood :)

Now what is Google App Engine? Simply, its a way to deploy your web apps and let google's infrastucture do all the hard work for you! It will do all the cloud computing *cool* stuff for you!

So Let's get started.

Go to this url and download the app engine for your OS. I am using Fedora 15 KDE for it. And I recommend you do start using linux if you haven't. Here's a link to my friend's blog post on Getting Started with linux . Its a tutorial for the absolute newbies.

Okay Back to the game. So now that you have downloaded it. For Windows/Mac please check the installations note.

For linux you just have to unzip the folder somewhere, lets say /home/yourusername/workspace/gapp

Fire up your terminals and move into the directory using cd command.
just in case you don't know

$ cd /home/yourusername/workspace/gapp

create a new directory called hello
and inside that directory
write a file called hello.py
and in that file
write these contents

------------hello.py------------
print 'Content-Type: text/plain'
print ''
print 'yourname Rulz'
------------X--X---------------

save it.
Note1 : Make sure you have an empty line After Content-Type else you will have issues.

Note2: Replace yourname with YOUR name :p ... Incase you haven't figured that out ;)

open another file and call it app.yaml
and write these contents

------------app.yaml------------
application: helloworld
version: 1
runtime: python
api_version: 1

handlers:
- url: /.*
script: hello.py
------------X--X---------------

Note: Make sure url and script are aligned. I had issues when they were not aligned.

Basically, this is like the app configuration file. Simply put, this tells that the application name is helloworld (we will change this later, why we will see that), the version is 1, you are using python as a language. And the handlers just says simply right now is that whatever sub url you give, all will be handled by the script hello.py given by us. If you don't get it, its okay...

and save it.

to test it run this command

$ /home/yourusername/gapp/dev_appserver.py /homeyourusername/hello/

(Note: You don't need to give full path, but to keep it simple and universal, I am writing full path. You can write it whatever way you like as long as its correct.)

Once it works, open your favourite browser and check it aout! :)

http://localhost:8080/

Ta ra! :)

"Hey, but thats not deployed on the web" you say! Aye. That's Next ;)

Okay go to this site and login with your google account.

Once you are done, choose create application. Choose a nice "Application Identifier" as thats what you will be sharing it with others. In my case it was firesofmay

Fill in the Application title and let other things be default and click "Create Application".

Now go back to the hello folder and open up the app.yaml file and modify the
"application: helloworld" to whatever your application identifier was. In my case it was
"application: firesofmay"

Once you are done issue this command :)

$ /home/yourusername/workspace/gapp/appcfg.py update /home/yourusername/workspace/helloworld/

If you have done everything correctly you should see something like this as output...

-------------------------
Application: firesofmay; version: 1
Host: appengine.google.com

Starting update of app: firesofmay, version: 1
Scanning files on local disk.
Cloning 2 application files.
Compilation starting.
Compilation completed.
Starting deployment.
Checking if deployment succeeded.
Deployment successful.
Checking if updated app version is serving.
Will check again in 1 seconds.
Checking if updated app version is serving.
Will check again in 1 seconds.
Checking if updated app version is serving.
Completed update of app: firesofmay, version: 1
-------------------------

Open up your URL and check it out... isn't that super cool!! :)
It's time for you to go to your facebook and post some cool stuff online and show it off to your friends!

Here are the references :-

I want to thank Navin Kabra, for being a great support and starting such a cool website - reliscore for programmers like you and me who love coding real world problems. And its a Kick Ass Website for those who wanna show off and get job for it too ;) Now Go and sharpen your coding skills on that website.

Cya :)