At Google we are fanatical about organizing the world's information. As a result, we spend a lot of time finding better ways to sort information using MapReduce, a key component of our software infrastructure that allows us to run multiple processes simultaneously. MapReduce is a perfect solution for many of the computations we run daily, due in large part to its simplicity, applicability to a wide range of real-world computing tasks, and natural translation to highly scalable distributed implementations that harness the power of thousands of computers.
In our sorting experiments we have followed the rules of a standard terabyte (TB) sort benchmark. Standardized experiments help us understand and compare the benefits of various technologies and also add a competitive spirit. You can think of it as an Olympic event for computations. By pushing the boundaries of these types of programs, we learn about the limitations of current technologies as well as the lessons useful in designing next generation computing platforms. This, in turn, should help everyone have faster access to higher-quality information.
We are excited to announce we were able to sort 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds. By comparison, the previous 1TB sorting record is 209 seconds on 910 computers.
Sometimes you need to sort more than a terabyte, so we were curious to find out what happens when you sort more and gave one petabyte (PB) a try. One petabyte is a thousand terabytes, or, to put this amount in perspective, it is 12 times the amount of archived web data in the U.S. Library of Congress as of May 2008. In comparison, consider that the aggregate size of data processed by all instances of MapReduce at Google was on average 20PB per day in January 2008.
It took six hours and two minutes to sort 1PB (10 trillion 100-byte records) on 4,000 computers. We're not aware of any other sorting experiment at this scale and are obviously very excited to be able to process so much data so quickly.
An interesting question came up while running experiments at such a scale: Where do you put 1PB of sorted data? We were writing it to 48,000 hard drives (we did not use the full capacity of these disks, though), and every time we ran our sort, at least one of our disks managed to break (this is not surprising at all given the duration of the test, the number of disks involved, and the expected lifetime of hard disks). To make sure we kept our sorted petabyte safe, we asked the Google File System to write three copies of each file to three different disks.
Significantly improved handling of the so-called "stragglers" (parts of computation that run slower than expected) was a key software technique that helped sort 1PB. And of course, there are many other factors that contributed to the result. We'll be discussing all of this and more in an upcoming publication. And you can also check out the video from our recent Technology RoundTable Series.
Home
»
»Unlabelled
» Sorting 1PB with MapReduce
Recent Posts
There They Go Again - Merck Threatened Legal Action Against Italian Doctor who Criticized Zetia
08 Jul 20140It seems that when confronted with unfavorable facts or opinions, Merck executives like to threaten ...Read more »
What Does the Blumsohn - Procter and Gamble -Sheffield University Affair Say About the Fitness of the Latest VA Secretary Candidate?
02 Jul 20140Introduction - New Leadership for the US Department of Veterans Affairs After reports of problems wi...Read more »
Health Care Corruption, "No Dirty Little Secret," but "An Open Sore" - Lessons from India for the US
01 Jul 20140Health care corruption is widely prevalent around the globe, but remains the great unmentionable.Int...Read more »
Through a Glass Darkly - How Can So Many Health Care Executives Be Visionaries?
26 Jun 20140Visionary (noun) 1: one whose ideas or projects are impractical : dreamer 2: one w...Read more »
A Real Example of Public Relations Talking Points to Justify Outsize Executive Compensation - and Why We Should No Longer be Fooled
25 Jun 20140We frequently discuss outsize executive compensation in health care organizations as both a symptom ...Read more »
Đăng ký:
Đăng Nhận xét (Atom)
0 nhận xét:
Đăng nhận xét
Click to see the code!
To insert emoticon you must added at least one space before the code.