Apr 26, 2012

Udacity CS101 review part1

    The main task of CS101 is to build a search engine, which sounds really challenging and interesting. Most courses on Udacity use Python to program. Because Python is easy to begin with along with its high readability. By the way, Professor Dave and TA Peter are very responsible and humorous.
    When I saw the course description first time, I think this was just another mission impossible'. At that time, as an undergraduate student majored in EE, my main programming work is always related to some low-level system design using C and assemble(most time, I just read assemble and clearly understand what they mean). I write many applications on MCUs/Embedded System instead of PC. So I really don't know how far I can go after I enrolled.
    After taking Unit1, hopes revealed. Dave showed us some basic idea about writing a search engine. First thing is to accomplish a simple web crawler. A web crawler is actually an URL collector on the web, it keeps capturing every single hyperlink on every page. Here is a detailed definition of a web crawler. To write a web crawler, we must know something about what is a web page. Nowadays, we can see almost everything on a web page: pictures, videos, musics, games, passages and etc. But web pages are actually noting but plain texts  written in HTML. The only reason we can see so many things on a page is that browsers do plenty of work to parse the HTML texts. And meantime we can treat a web page as a very long string.  Therefore dealing with strings turned to be the first lesson. Python has a very powerful built-in string library which totally meets our needs.
    If we have a slice of HTML code like this:
...
<a href = "whateverurl.com">Blablabla</a>
...
    The hyperlink 'whateverurl.com' is what we want to find when our web crawlers parse web pages. Obviously all hyperlinks on a page is between the "<a href=" and ">" in the HTML text, you can check the HTML source of this page to see them. After knowing what we should do, we used Python code to finish it. The code became an easy shot with the string.find() function and the string[a:b] structure.
    Getting every link on one page, and opening those links, each of them leads us to a new/old page, we parse those pages again, thus we get more links, ..., on and on, theoretically we can get all links.
    This is just a basic idea how to crawl the web, and we managed to perform it in the Python interpreter. Source codes like this:

No comments: