Lecture1-Introduction To Cloud Computing

Introduction to Cloud Computing
http://net.pku.edu.cn/~course/cs402/2009/
pb@net.pku.edu.cn 6/30/2009
(Cloud Computing)?
(Cloud Computing)
What is Cloud Computing?

1.
2.
3.
First write down your own opinion about cloud computing , whatever you thought about in your mind. Question: What ? Who? Why? How? Pros and cons? The most important question is: What is the relation with me?
Cloud Computing is

No software access everywhere by Internet power -- Large-scale data processing Appeal for startups

Cost efficiency Software as platform

Security Data lock-in
5
SaaS PaaS Utility Computing
Cons

Software as a Service (SaaS)
a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.
Platform as a Service (PaaS)
Web ApplicationServicesPaaS Internet Multitenant architecture platform

7
Utility Computing
pay-as-you-go Microsoftpay lessutility computing 500 use less pay lesscloud computing
8
Cloud Computing is
Key Characteristics
illusion of infinite computing resources available on demand; elimination of an up-front commitment by Cloud users; ability to pay for use of computing resources on a very large datacenters short-term basis as needed billing large-scale software infrastructure utility computing operational expertise
10
Why now?

very large-scale datacenter Business
pay-as-you-go computing
11
Key Players
Amazon Web Services Google App Engine Microsoft Windows Azure

12
Key Applications
Mobile Interactive applications, Tim OReilly Mobile datacenter mashup Parallel batch processingCloud ComputingMapReduceHadoop /cloud Amazonhost large public datasets for free The rise of analyticstransaction based analytics Extension of compute-intensive desktop application matlab, mathematicacloud computingwoo~
13
Cloud Computing = Silver Bullet?
Google37 Google
Problem of Data Lock-in
14
Challenges
15
Some other Voices

The interesting thing about Cloud Computing is that weve redefined Cloud Computing to include everything that we already do. . . . I dont understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads. Larry Ellison, quoted in the Wall Street Journal, September 26, 2008
Its stupidity. Its worse than stupidity: its a marketing hype campaign. Somebody is saying this is inevitable and whenever you hear somebody saying that, its very likely to be a set of businesses campaigning to make it true. Richard Stallman, quoted in The Guardian, September 29, 2008
16
Whats matter with ME?!
What you want to do with 1000pcs, or even 100,000 pcs?
17
Cloud is coming
Google alone has 450,000 systems running across 20 datacenters, and Microsoft's Windows Live team is doubling the number of servers it uses every 14 months, which is faster than Moore's Law
Data Center is a Computer

Parallelism everywhere
Massive Scalable Reliable Resource Management Data Management

Programming Model & Tools
18
20
Happening everywhere!
microarray chips Molecular biology (cancer) fiber optics Network traffic (spam)
microprocessors
Simulations (Millennium) particle colliders
300M/day
Particle events (LHC)
1B
21
1M/sec
22
Maximilien Brice, CERN
23
24
25
How much data?

Internet archive has 2 PB of data + 20 TB/month Google processes 20 PB a day (2008) all words ever spoken by human beings ~ 5 EB CERNs LHC will generate 10-15 PB a year Sanger anticipates 6 PB of data in 2009
640K ought to be enough for anybody.
26
NERSC User George Smoot wins 2006 Nobel Prize in Physics
Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB
Cosmic Microwave Background Radiation (CMB): an image of the universe at 400,000 years
27
The Current CMB Map
source J. Borrill, LBNL
Unique imprint of primordial physics through the tiny anisotropies in temperature and polarization. Extracting these Kelvin fluctuations from inherently noisy data is a serious computational challenge.
28
Experiment
Evolution Of CMB Data Sets: Cost > O(Np^3 ) Limiting

Nt Np Nb
Data
Notes
COBE (1989) BOOMERanG (1998) (4yr) WMAP (2001)
2x109
6x103
3x101
Time
Satellite,
Workstation
3x108
5x105
3x101
Pixel
Balloon, 1st HPC/NERSC
7x1010
4x107
1x103
Satellite, Analysis-bound Satellite, Major HPC/DA effort Ground, NGmultiplexing Satellite, Early planning/design
Planck (2007)
5x1011
6x108
6x103
Time/ Pixel
POLARBEAR (2007)
8x1012
6x106
1x103
Time
CMBPol (~2020) data compression
1014
109
104
Time/ Pixel
29
Example: Wikipedia Anthropology

Kittur, Suh, Pendleton (UCLA, PARC), He Says, She Says: Conflict and Coordination in Wikipedia CHI, 2007
Increasing fraction of edits are for work indirectly related to articles
Experiment
Computation
Download entire revision history of Wikipedia 4.7 M pages, 58 M revisions, 800 GB Analyze editing patterns & trends
30
Hadoop on 20-machine cluster
Example: Scene Completion
Hays, Efros (CMU), Scene Completion Using Millions of Photographs SIGGRAPH, 2007
Image Database Grouped by Semantic Content
Computation
30 different Flickr.com groups 2.3 M images total (396 GB).
Select Candidate Images Most Suitable for Filling Hole
Classify images with gist scene detector [Torralba] Color similarity Local context matching
Index images offline 50 min. scene matching, 20 min. local matching, 4 min. compositing Reduces to 5 minutes total by using 5 machines Flickr.com has over 500 million images
Extension
31
Example: Web Page Analysis

Fetterly, Manasse, Najork, Wiener (Microsoft, HP), A Large-Scale Study of the Evolution of Web Pages, Software-Practice & Experience, 2004
Experiment
Systems Challenge
Moreover, we experienced a
Use web crawler to gather 151M HTML pages weekly 11 times Generated 1.2 TB log information Analyze page statistics and change frequencies
catastrophic disk failure during the third crawl, causing us to lose a quarter of the logs of that crawl.
32
GATGCTTACTATGCGGGCCCC CGGTCTAATGCTTACTATGC
?
Subject genome
GCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT TAATGCTTACTATGC AATGCTTAGCTATGCGGGC AATGCTTACTATGCGGGCCCCTT AATGCTTACTATGCGGGCCCCTT CGGTCTAGATGCTTACTATGC AATGCTTACTATGCGGGCCCCTT CGGTCTAATGCTTAGCTATGC ATGCTTACTATGCGGGCCCCTT
Reads
Sequencer
33
DNA Sequencing
Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG

Bacteria: ~5 million bp Humans: ~3 billion bp
Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)

Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al, 2009)
ATCTGATAAGTCCCAGGACTTCAGT GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC GGAGTTAGTAAAAGTCCACATTGAG
Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
~144 GB of compressed sequence data
34
Subject reads CTATGCGGGC A T CTATGCGG TCTAGATGCT GCTTA T CTAT A T CTATGCGG A T CTATGCGG A T CTATGCGG TTA T CTATGC CTATGCGGGC GCTTA T CTAT CTAGATGCTT Alignment CGGTCTAGATGCTTAGCTATGCGGGCCCCTT
Reference sequence
35
Subject reads
ATGCGGGCCC CTAGATGCTT CTATGCGGGC TCTAGATGCT ATCTATGCGG CGGTCTAG ATCTATGCGG CTT CGGTCT TTATCTATGC CCTT CGGTC GCTTATCTAT GCCCCTT GCTTATCTAT CGG GGCCCCTT CGGTCTAGATGCTTATCTATGCGGGCCCCTT
Reference sequence
36
Example: Bioinformatics
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
Evaluate running time on local 24 core cluster
Running time increases linearly with the number of reads
37
Example: Data Mining

Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, Edward Y. Chang: Pfp: parallel fp-growth for query recommendation. RecSys 2008: 107114
del.icio.us crawl->a bipartite graph covering 802739 Webpages and 1021107 tags.
38
An Example
Try on these collection:
2006870 Million , 2 TB. Google, Yahoo100+ Billion pages
40
Divide and Conquer

Work
Partition
w3
worker
w1
worker
w2
worker
r1
r2
r3
Result
Combine
41
Whats Mapreduce
Parallel/Distributed Computing Programming Model
Input split
shuffle
output
42
Typical problem solved by MapReduce
: key/value Map: extract something
map (in_key, in_value) -> list(out_key, intermediate_value) input key/value pair key/value pairs
Shuffle:
key
reduce (out_key, list(intermediate_value)) -> list(out_value) keyvalues (usually just one)
Reduce: aggregate, summarize, filter, etc.
43
Word Frequencies in Web pages

one document per record map function

key = document URL value = document contents document<word, 1>
map (potentially many) key/value pairs.
44
Example continued:
MapReduce()key (shuffle/sort) reduce functionkeyvalues
sum
Reduce<key, sum>
45
MapReduce Runtime System
46
History of Hadoop

2004 - Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December 2005 - Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January 2006 - Doug Cutting joins Yahoo! February 2006 - Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March 2006 - Formation of the Yahoo! Hadoop team May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes April 2006 - Sort benchmark run on 188 nodes in 47.9 hours May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October 2006 - Research cluster reaches 600 Nodes December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January 2006 - Research cluster reaches 900 node April 2007 - Research clusters - 2 clusters of 1000 nodes Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!
47
From Theory to Practice

1. Scp data to cluster 2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job 4a. Go back to Step 3 You Hadoop Cluster
5. Move data out of HDFS 6. Scp data from cluster
48
MapReduce MapReduce MapReduce
50

LEC# TOPICS 1 - ABSTRACT MapReduce MapReduce Inverted IndexMapReduce PageRankMapReduce MapReduce ClusteringMapReduce MapReduce MapReduce 51
MapReduce
2 3 4 MapReduce Inverted Index PageRank MapReduce Clustering 5 6 7 8 MapReduce
Grading Policy
Lab Lab Lab Lab 1 2 3 4 - Introduction to Hadoop, Eclipse A Simple Inverted Index - PageRank over Wikipedia Corpus Clustering the Netflix Movie Data
Hw1 - Read - Intro Distributed system; Intro MapReduce Programming. Hw2 - Read MapReduce[1] Hw3 Read GFS[2] Hw4 Read Pig Latin[3]
30% Assignments 20% Readings 50% Course project
52
Programming Language
Lots of java programming practices
53
Teachers and Resources
http://net.pku.edu.cn/~cour se/cs402/2009/ http://groups.google.com/g roup/cs402pku http://hadoop.apache.org/c ore/ http://net.pku.edu.cn/~cour se/cs402/2008/resource.ht ml

54
Hadoop
Resources
Homework
http://net.pku.edu.cn/~course/cs402/2009/ 3-4project Lab 1 - Introduction to Hadoop, Eclipse Intro Distributed system; Intro Parallel Programming. http://code.google.com/edu/parallel/dsd-tutorial.html http://code.google.com/edu/parallel/mapreduce-tutorial.html
Lab1
HW Reading1
55
Summary
CloudComputing brings
Possible of using unlimited resources on-demand, and by anytime and anywhere Possible of construct and deploy applications automatically scale to tens of thousands computers Possible of construct and run programs dealing with prodigious volume of data
How to make it real?
Distributed File System Distributed Computing Framework
56
Q&A
[1] J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150. [2] G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the
nineteenth ACM symposium on Operating systems principles. Bolton Landing, NY, USA:
ACM Press, 2003. [3] O. Christopher, R. Benjamin, S. Utkarsh, K. Ravi, and T. Andrew, "Pig latin: a not-so-foreign language for data processing," in Proceedings of
the 2008 ACM SIGMOD international conference on Management of data. Vancouver, Canada:
ACM, 2008.
58
Google App Engine
App Engine handles HTTP(S) requests, nothing else

Think RPC: request in, processing, response out Works well for the web and AJAX; also for other services
App configuration is dead simple
No performance tuning needed
Everything is built to scale
infinite number of apps, requests/sec, storage capacity APIs are simple, stupid
59
App Engine Architecture

req/resp stateless APIs urlfech mail images R/O FS stdlib app
Python VM process
stateful APIs
memcache
datastore
60
60
Microsoft Windows Azure
61
Amazon Web Services
Amazons infrastructure (auto scaling, load balancing) Elastic Compute Cloud (EC2) scalable virtual private server instances Simple Storage Service (S3) Simple Queue Service (SQS) messaging SimpleDB - database Flexible Payments Service, Mechanical Turk, CloudFront, etc.
62
Amazon Web Services
Very flexible, lower-level offering (closer to hardware) = more possibilities, higher performing Runs platform you provide (machine images) Supports all major web languages Industry-standard services (move off AWS easily) Require much more work, longer time-to-market
Deployment scripts, configuring images, etc.
Various libraries and GUI plug-ins make AWS do help
63
Price of Amazon EC2
64

Lecture1-Introduction To Cloud Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture1-Introduction To Cloud Computing

Uploaded by

Copyright:

Available Formats

Introduction to Cloud Computing

What is Cloud Computing?

Cost efficiency Software as platform

SaaS PaaS Utility Computing

Software as a Service (SaaS)

Platform as a Service (PaaS)

Web ApplicationServicesPaaS Internet Multitenant architecture platform

very large-scale datacenter Business

Amazon Web Services Google App Engine Microsoft Windows Azure

Cloud Computing = Silver Bullet?

Problem of Data Lock-in

Some other Voices

Whats matter with ME?!

What you want to do with 1000pcs, or even 100,000 pcs?

Data Center is a Computer

Massive Scalable Reliable Resource Management Data Management

Simulations (Millennium) particle colliders

Particle events (LHC)

Maximilien Brice, CERN

Maximilien Brice, CERN

Maximilien Brice, CERN

Maximilien Brice, CERN

How much data?

NERSC User George Smoot wins 2006 Nobel Prize in Physics

Smoot and Mather 1992 COBE Experiment showed anisotropy of CMB

The Current CMB Map

source J. Borrill, LBNL

Evolution Of CMB Data Sets: Cost > O(Np^3 ) Limiting

COBE (1989) BOOMERanG (1998) (4yr) WMAP (2001)

Balloon, 1st HPC/NERSC

CMBPol (~2020) data compression

Example: Wikipedia Anthropology

Increasing fraction of edits are for work indirectly related to articles

Hadoop on 20-machine cluster

Example: Scene Completion

Image Database Grouped by Semantic Content

30 different Flickr.com groups 2.3 M images total (396 GB).

Select Candidate Images Most Suitable for Filling Hole

Example: Web Page Analysis

Bacteria: ~5 million bp Humans: ~3 billion bp

ATCTGATAAGTCCCAGGACTTCAGT GCAAGGCAAACCCGAGCCCAGTTT TCCAGTTCTAGAGTTTCACATGATC GGAGTTAGTAAAAGTCCACATTGAG

~144 GB of compressed sequence data

Evaluate running time on local 24 core cluster

Running time increases linearly with the number of reads

Example: Data Mining

Try on these collection:

2006870 Million , 2 TB. Google, Yahoo100+ Billion pages

Divide and Conquer

Parallel/Distributed Computing Programming Model

Typical problem solved by MapReduce

: key/value Map: extract something

Reduce: aggregate, summarize, filter, etc.

Word Frequencies in Web pages

one document per record map function

key = document URL value = document contents document<word, 1>

map (potentially many) key/value pairs.

MapReduce()key (shuffle/sort) reduce functionkeyvalues

MapReduce Runtime System

From Theory to Practice

3. Develop code locally

4. Submit MapReduce job 4a. Go back to Step 3 You Hadoop Cluster

5. Move data out of HDFS 6. Scp data from cluster

MapReduce MapReduce MapReduce