When I saw the course description first time, I think this was just another mission impossible'. At that time, as an undergraduate student majored in EE, my main programming work is always related to some low-level system design using C and assemble(most time, I just read assemble and clearly understand what they mean). I write many applications on MCUs/Embedded System instead of PC. So I really don't know how far I can go after I enrolled.
After taking Unit1, hopes revealed. Dave showed us some basic idea about writing a search engine. First thing is to accomplish a simple web crawler. A web crawler is actually an URL collector on the web, it keeps capturing every single hyperlink on every page. Here is a detailed definition of a web crawler. To write a web crawler, we must know something about what is a web page. Nowadays, we can see almost everything on a web page: pictures, videos, musics, games, passages and etc. But web pages are actually noting but plain texts written in HTML. The only reason we can see so many things on a page is that browsers do plenty of work to parse the HTML texts. And meantime we can treat a web page as a very long string. Therefore dealing with strings turned to be the first lesson. Python has a very powerful built-in string library which totally meets our needs.
If we have a slice of HTML code like this:
...
<a href = "whateverurl.com">Blablabla</a>
...
The hyperlink 'whateverurl.com' is what we want to find when our web crawlers parse web pages. Obviously all hyperlinks on a page is between the "<a href=" and ">" in the HTML text, you can check the HTML source of this page to see them. After knowing what we should do, we used Python code to finish it. The code became an easy shot with the string.find() function and the string[a:b] structure.Getting every link on one page, and opening those links, each of them leads us to a new/old page, we parse those pages again, thus we get more links, ..., on and on, theoretically we can get all links.
This is just a basic idea how to crawl the web, and we managed to perform it in the Python interpreter. Source codes like this:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import urllib | |
#this is just a very simple web crawler, can not actually do | |
#what a real web crawler do :) | |
#get the next link on the page,here page is the content of | |
#the HTML text, also a string | |
def get_next_link(page): | |
start_pos = page.find("<a href=") | |
if start_pos == -1: | |
return None,0 | |
start_pos = page.find('"',start_pos) | |
end_pos = page.find('"',start_pos + 1) | |
url = page[start_pos + 1:end_pos] | |
return url,end_pos | |
#get all links on one page | |
def get_all_link(page): | |
crawl = [] | |
while True: | |
url,end_pos = get_next_link(page) | |
if url != None: | |
crawl.append(url) | |
page = page[end_pos:] #the NEW page starts where the last link is | |
else: | |
break | |
return crawl | |
#union two lists | |
def union(a,b): | |
for element in b: | |
if element not in a: | |
a.append(element) | |
#crawl a seed URL, get every link that directly/indirectly connected | |
#to the seed page. | |
def crawl(seed): | |
tocrawl = [seed] #tocrawl contains those yet to be crawled pages | |
crawled = [] #crawled pages | |
last_url = '' | |
while len(tocrawl) > 0: | |
page = tocrawl.pop() | |
if page not in crawled: | |
url = str(page) | |
if url[0] == '/': | |
url = last_url + url #turns a relative path to an absolute one | |
if url.find('http') != -1: | |
last_url = page | |
try: | |
file = urllib.urlopen(url) | |
page = file.read() | |
except: | |
page = '' | |
union(tocrawl,get_all_link(page)) | |
crawled.append(last_url) | |
print url | |
file.close() | |
crawl('http://www.google.com') |