Apr 26, 2012

Udacity CS101 review part1

    The main task of CS101 is to build a search engine, which sounds really challenging and interesting. Most courses on Udacity use Python to program. Because Python is easy to begin with along with its high readability. By the way, Professor Dave and TA Peter are very responsible and humorous.
    When I saw the course description first time, I think this was just another mission impossible'. At that time, as an undergraduate student majored in EE, my main programming work is always related to some low-level system design using C and assemble(most time, I just read assemble and clearly understand what they mean). I write many applications on MCUs/Embedded System instead of PC. So I really don't know how far I can go after I enrolled.
    After taking Unit1, hopes revealed. Dave showed us some basic idea about writing a search engine. First thing is to accomplish a simple web crawler. A web crawler is actually an URL collector on the web, it keeps capturing every single hyperlink on every page. Here is a detailed definition of a web crawler. To write a web crawler, we must know something about what is a web page. Nowadays, we can see almost everything on a web page: pictures, videos, musics, games, passages and etc. But web pages are actually noting but plain texts  written in HTML. The only reason we can see so many things on a page is that browsers do plenty of work to parse the HTML texts. And meantime we can treat a web page as a very long string.  Therefore dealing with strings turned to be the first lesson. Python has a very powerful built-in string library which totally meets our needs.
    If we have a slice of HTML code like this:
...
<a href = "whateverurl.com">Blablabla</a>
...
    The hyperlink 'whateverurl.com' is what we want to find when our web crawlers parse web pages. Obviously all hyperlinks on a page is between the "<a href=" and ">" in the HTML text, you can check the HTML source of this page to see them. After knowing what we should do, we used Python code to finish it. The code became an easy shot with the string.find() function and the string[a:b] structure.
    Getting every link on one page, and opening those links, each of them leads us to a new/old page, we parse those pages again, thus we get more links, ..., on and on, theoretically we can get all links.
    This is just a basic idea how to crawl the web, and we managed to perform it in the Python interpreter. Source codes like this:

Apr 25, 2012

Hello world!

    After a long time, I've finally come back here. Actually I keep looking for a convenient blog service around the internet. This blog was set up six years ago. At that time, I am still a high school student who just entered the most amazing programming world with old-enough Pascal and toy-like Visual Basic.(I don't deny the easiness and quickness of developing on VB, but all in all, it still looks like a toy compared with others.) Time passed, I entered a local University and became an undergraduate majored in Electronic Engineering, and was successfully recommended as a postgraduate in September 2011. Since I'm a Chinese, my English could not be as well as natives. But I will still insist writing English in my blog for practicing. Hope it's a nice choice:). This blog will mainly be updated with my daily studying and working staffs, sometimes comments about the industry maybet.
    At this time, I am spending some leisure time studying Python and some web app programming stuffs in Udacity. It's a great web site which provides free and excellent education to anyone on the Internet, also there are several other choices. Try to google for them if you are interested. Another piece of my current main task is my graduating paper, which is related some kind of mathematical correction of high accuracy GPS data analysis.
    I was never a blogging guy before, but recently I think I should find a way to keep notes and make conclusions of what I learned, because I'm achieving so much information everyday. And as the internet generation, blogging is the best way ever. I have connected this blog with my Google+ account, if you find this site is interesting, then you can put me in your whatever circle :).