1. History of Hadoop
Hadoop was created by Goug Cutting, he is the creator of Apache Lucene, the widely used text search library. Hadoop has been originated from Apache Nutch, which is an open source web search engine.
1.1. Origin of Name Hadoop
Hadoop doesn’t have a meaning, neither its a acronym. The project creator Doug Cutting explains how they named it as Hadoop –
This name is given by my Kid to his yellow stuffed elephant. It’s short, easy to remember and pronounce, and not used anywhere else: those are my naming criteria.
1.2. How Hadoop came into picture
According to Mike Cafarella and Doug Cutting 1 billion page index would cost millions of dollars in hardware and monthly running cost of 30,000$.
Nutch was started in 2002 having crawler and search system emerged, However Doug believed that architecture wouldn’t scale up to billions of pages on web because of the storage issues.
They got a clue: They got a fantastic idea with the release of paper in 2003. This paper described the architecture of Google’s distributed file system, known as GFS. Nutch developers believed that GFS would solve their storage needs (for crawling and indexing whole web). GFS would free up time being spent on administrative tasks such as managing storage nodes. In 2004 they planned to setup an open source file system: Nutch Distributed FileSystem(NDFS).
1.3. Introduction of MapReduce
In 2004, Google published the paper that introduced MapReduce to the world. After this release Nutch developers started to find the way to use MapReduce and NDFS in Nutch algorithms. In the mid of 2005 Nutch started running MapReduce and NDFS altogether. In feb, 2006 they made it a separate independent project and named it Hadoop.
1.4. Hadoop 2008 success
In Jan, 2008 Hadoop made its own top level project at Apache, after confirming its success. By that time it has been used by many other top level companies besides Yahoo!, such as Last.fm, Facebook and new york times.
In April, 2008, Hadoop broke a world record after becoming the fastest system to sort terabyte of data. Hadoop sorted one terabyte in 209 seconds. Same year Google reported that Its MapReduce implementation sorted one terabyte in 68 seconds.
1.5. HDFS (Hadoop distributed file system)
HDFS is Hadoop’s distribute file system. NDFS is renamed as HDFS after Hadoop implementation.
2. Hadoop at Yahoo!
1) 2004—Initial versions of what is now Hadoop Distributed Filesystem and MapReduce implemented by Doug Cutting and Mike Cafarella.
December 2005—Nutch ported to the new framework. Hadoop runs reliably on 20 nodes.
2) January 2006—Doug Cutting joins Yahoo!.
3) February 2006—Apache Hadoop project officially started to support the standalone development of MapReduce and HDFS.
4) February 2006—Adoption of Hadoop by Yahoo! Grid team.
5) April 2006—Sort benchmark (10 GB/node) run on 188 nodes in 47.9 hours.
6) May 2006—Yahoo! set up a Hadoop research cluster—300 nodes.
7) May 2006—Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark).
8) October 2006—Research cluster reaches 600 nodes.
9) December 2006—Sort benchmark run on 20 nodes in 1.8 hours, 100 nodes in 3.3
hours, 500 nodes in 5.2 hours, 900 nodes in 7.8 hours.
10) January 2007—Research cluster reaches 900 nodes.
11) April 2007—Research clusters—2 clusters of 1000 nodes.
12) April 2008—Won the 1 terabyte sort benchmark in 209 seconds on 900 nodes.
13) October 2008—Loading 10 terabytes of data per day on to research clusters.
14) March 2009—17 clusters with a total of 24,000 nodes.
15) April 2009—Won the minute sort by sorting 500 GB in 59 seconds (on 1,400 nodes) and the 100 terabyte sort in 173 minutes (on 3,400 nodes).
3. Apache Hadoop
Hadoop is widely known for its distribute file system(HDFS) and MapReduce. I’m going to cover each and every term related to it in detail, you can find below the terms related to Hadoop.
3.1. Important links
- Building Nutch: Open Source Search: http://queue.acm.org/detail.cfm?id=988408
- Hadoop wiki: http://wiki.apache.org/hadoop/PoweredBy
- Sorting 1PB with MapReduce: http://googleblog.blogspot.in/2008/11/sorting-1pb-with-mapreduce.html
- NyTimes: Self-Service, Prorated Super computing Fun!: http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/