Professional Documents
Culture Documents
http://net.pku.edu.cn/~course/cs402/2009/
pb@net.pku.edu.cn 6/30/2009
(Cloud Computing)?
(Cloud Computing)
2.
3.
First write down your own opinion about cloud computing , whatever you thought about in your mind. Question: What ? Who? Why? How? Pros and cons? The most important question is: What is the relation with me?
Cloud Computing is
No software access everywhere by Internet power -- Large-scale data processing Appeal for startups
Cons
a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.
Utility Computing
pay-as-you-go Microsoftpay lessutility computing 500 use less pay lesscloud computing
8
Cloud Computing is
Key Characteristics
illusion of infinite computing resources available on demand; elimination of an up-front commitment by Cloud users; ability to pay for use of computing resources on a very large datacenters short-term basis as needed billing large-scale software infrastructure utility computing operational expertise
10
Why now?
pay-as-you-go computing
11
Key Players
Key Applications
Mobile Interactive applications, Tim OReilly Mobile datacenter mashup Parallel batch processingCloud ComputingMapReduceHadoop /cloud Amazonhost large public datasets for free The rise of analyticstransaction based analytics Extension of compute-intensive desktop application matlab, mathematicacloud computingwoo~
13
Google37 Google
14
Challenges
15
Its stupidity. Its worse than stupidity: its a marketing hype campaign. Somebody is saying this is inevitable and whenever you hear somebody saying that, its very likely to be a set of businesses campaigning to make it true. Richard Stallman, quoted in The Guardian, September 29, 2008
16
17
Cloud is coming
Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law
18
20
Happening everywhere!
microarray chips Molecular biology (cancer) fiber optics Network traffic (spam)
microprocessors
300M/day
1B
21
1M/sec
22
23
24
25
Internet archive has 2 PB of data + 20 TB/month Google processes 20 PB a day (2008) all words ever spoken by human beings ~ 5 EB CERNs LHC will generate 10-15 PB a year Sanger anticipates 6 PB of data in 2009
640K ought to be enough for anybody.
26
Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years
27
Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization. Extracting these Kelvin fluctuations from inherently noisy data is a serious computational challenge.
28
Experiment
Notes
2x109
6x103
3x101
Time
Satellite,
Workstation
3x108
5x105
3x101
Pixel
7x1010
4x107
1x103
Satellite, Analysis-bound Satellite, Major HPC/DA effort Ground, NGmultiplexing Satellite, Early planning/design
Planck (2007)
5x1011
6x108
6x103
Time/ Pixel
POLARBEAR (2007)
8x1012
6x106
1x103
Time
1014
109
104
Time/ Pixel
29
Experiment
Computation
Download entire revision history of Wikipedia 4.7 M pages, 58 M revisions, 800 GB Analyze editing patterns & trends
30
Hays, Efros (CMU), Scene Completion Using Millions of Photographs SIGGRAPH, 2007
Computation
Classify images with gist scene detector [Torralba] Color similarity Local context matching
Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Flickr.com has over 500 million images
Extension
31
Experiment
Systems Challenge
Moreover, we experienced a
Use web crawler to gather 151M HTML pages weekly 11 times Generated 1.2 TB log information Analyze page statistics and change frequencies
catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.
32
GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC
?
Subject genome
GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT
Reads
Sequencer
33
DNA Sequencing
Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG
Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)
Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al, 2009)
Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
34
Subject reads CTATGCGGGC A T CTATGCGG TCTAGATGCT GCTTA T CTAT A T CTATGCGG A T CTATGCGG A T CTATGCGG TTA T CTATGC CTATGCGGGC GCTTA T CTAT CTAGATGCTT Alignment CGGTCTAGATGCTTAGCTATGCGGGCCCCTT
Reference sequence
35
Subject reads
ATGCGGGCCC CTAGATGCTT CTATGCGGGC TCTAGATGCT ATCTATGCGG CGGTCTAG ATCTATGCGG CTT CGGTCT TTATCTATGC CCTT CGGTC GCTTATCTAT GCCCCTT GCTTATCTAT CGG GGCCCCTT CGGTCTAGATGCTTATCTATGCGGGCCCCTT
Reference sequence
36
Example: Bioinformatics
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
37
del.icio.us crawl->a bipartite graph covering 802739 Webpages and 1021107 tags.
38
An Example
40
Partition
w3
worker
w1
worker
w2
worker
r1
r2
r3
Result
Combine
41
Whats Mapreduce
Input split
shuffle
output
42
map (in_key, in_value) -> list(out_key, intermediate_value) input key/value pair key/value pairs
Shuffle:
key
reduce (out_key, list(intermediate_value)) -> list(out_value) keyvalues (usually just one)
43
44
Example continued:
sum
Reduce<key, sum>
45
46
History of Hadoop
2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!
47
48
50
LEC# TOPICS 1 - ABSTRACT MapReduce MapReduce Inverted IndexMapReduce PageRankMapReduce MapReduce ClusteringMapReduce MapReduce MapReduce 51
MapReduce
2 3 4 MapReduce Inverted Index PageRank MapReduce Clustering 5 6 7 8 MapReduce
Grading Policy
Lab Lab Lab Lab 1 2 3 4 - Introduction to Hadoop, Eclipse A Simple Inverted Index - PageRank over Wikipedia Corpus Clustering the Netflix Movie Data
Hw1 - Read - Intro Distributed system; Intro MapReduce Programming. Hw2 - Read MapReduce[1] Hw3 Read GFS[2] Hw4 Read Pig Latin[3]
52
Programming Language
53
Hadoop
Resources
Homework
http://net.pku.edu.cn/~course/cs402/2009/ 3-4project Lab 1 - Introduction to Hadoop, Eclipse Intro Distributed system; Intro Parallel Programming. http://code.google.com/edu/parallel/dsd-tutorial.html http://code.google.com/edu/parallel/mapreduce-tutorial.html
Lab1
HW Reading1
55
Summary
CloudComputing brings
Possible of using unlimited resources on-demand, and by anytime and anywhere Possible of construct and deploy applications automatically scale to tens of thousands computers Possible of construct and run programs dealing with prodigious volume of data
56
Q&A
[1] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150. [2] G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the
nineteenth ACM symposium on Operating systems principles. Bolton Landing, NY, USA:
ACM Press, 2003. [3] O. Christopher, R. Benjamin, S. Utkarsh, K. Ravi, and T. Andrew, "Pig latin: a not-so-foreign language for data processing," in Proceedings of
the 2008 ACM SIGMOD international conference on Management of data. Vancouver, Canada:
ACM, 2008.
58
Think RPC: request in, processing, response out Works well for the web and AJAX; also for other services
infinite number of apps, requests/sec, storage capacity APIs are simple, stupid
59
Python VM process
stateful APIs
memcache
datastore
60
60
61
Amazons infrastructure (auto scaling, load balancing) Elastic Compute Cloud (EC2) scalable virtual private server instances Simple Storage Service (S3) Simple Queue Service (SQS) messaging SimpleDB - database Flexible Payments Service, Mechanical Turk, CloudFront, etc.
62
Very flexible, lower-level offering (closer to hardware) = more possibilities, higher performing Runs platform you provide (machine images) Supports all major web languages Industry-standard services (move off AWS easily) Require much more work, longer time-to-market
63
64