You are on page 1of 73

google

gaojizuojing@126.com


-- ...................................................................................... 1
-- ...................................................................................... 3
-- ............................................ 6
-- ? ...................................................................................... 8
-- .......................................... 11
-- (Web Crawlers) ....................................................... 14
-- ............................................................ 16
-- ...................................................... 18
-- ........................................................ 21
........................................................................... 23
-- Google 47 . ................................. 25
- .................................................................... 27
............................................................................... 31
............................................................................... 34
..................................................... 37
-- ... 40

....... 42

................................. 45
........................................... 48
(Bayesian Networks) ............. 50

Bloom Filter..................................................... 52
....................................................... 55
........... 58
............. 63
..................................... 66

--
2006 4 3 08:15:00
Google

: , Google

Google
Google
(Statistical Language Models)
Google

-
Noam Chomsky

(Claude Shannon)

(Fred
Jelinek) IBM (Sabbatical Leave)

S w1 w2 wn S

~1~

S S P(S)
S P(S)

P(S) = P(w1)P(w2|w1)P(w3| w1 w2)P(wn|w1 w2wn-1)


P (w1) w1 P (w2|w1)
wn
wi
wi-1 (S

P(S) = P(w1)P(w2|w1)P(w3|w2)P(wi|wi-1)
( N-1
P (wi|wi-1)
wi-1,wi) wi-1
,P(wi|wi-1) = P(wi-1,wi)/ P (wi-1)

Google
(NIST)
Google

997 20

Google

~2~

--
2006 4 10 08:10:00
: Google

-----

/ / / / / / / / / /


--
~3~

--
--
--

90

S
A1, A2, A3, ..., Ak,
B1, B2, B3, ..., Bm
C1, C2, C3, ..., Cn
A1, A2, B1, B2, C1, C2
A1,A2,..., Ak P

P (A1, A2, A3, ..., Ak P (B1, B2, B3, ..., Bm),


P (A1, A2, A3, ..., Ak P(C1, C2, C3, ..., Cn)

Dynamic Programming) Viterbi

~4~

Computational
Linguistics

Google

Google

1.

http://www.touchwrite.com/demo/LiangNanyuan-JCIP-1987.pdf
2.
~5~

http://www.touchwrite.com/demo/GuoJin-JCIP-1993.pdf
3.
Critical Tokenization and its Properties
http://acl.ldc.upenn.edu/J/J97/J97-4004.pdf
4.
Chinese word segmentation without using lexicon and hand-crafted training data
http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=980775
|

--

2006 4 17 08:01:00
Google

--

s1s2s3...o1, o2, o3 ...


o1, o2, o3 ... s1s2s3...
~6~

Hidden Markov Model


o1,o2,o3 s1,s2,s3
o1,o2,o3,...

P (s1,s2,s3,...|o1,o2,o3....) s1,s2,s3,...

P(o1,o2,o3,...|s1,s2,s3....) * P(s1,s2,s3,...)

P(o1,o2,o3,...|s1,s2,s3....) s1,s2,s3... o1,o2,o3,...,


P(s1,s2,s3,...) s1,s2,s3,...
s1,s2,s3... s1,s2,s3...

s1,s2,s3,... si si-1 ()
i oi si ,
P(o1,o2,o3,...|s1,s2,s3....) = P(o1|s1) * P(o2|s2)*P(o3|s3)...
Viterbi
s1,s2,s3,...

s1,s2,s3,...
s1,s2,s3,...
o1,o2,o3,...
o1,o2,o3,...

~7~

P (o1,o2,o3,...|s1,s2,s3....)
(Acoustic Model) (Translation Model)
(Correction Model) P (s1,s2,s3,...)

Baum 60
IBM Fred Jelinek ()
Jim and Janet Baker ()
( 30%
10%)
Sphinx

<http://googlechinablog.com/2006/04/blog-post_17.html>

-- ?
2006 4 26 08:11:00
Google

: Google

1948
(shng)


? 1
~8~

32 1-16 ?
1-8 ? 9-16

bit

,
log (log32=5, log64=6

32

= -p1*log p1 + p2 * log p2 +

p32 *log p32)

p1p2 p32 32
(Entropy) H 32

7000
13 13
10% 95%
8-9
5
250 320KB
1MB
redundancy)
250

~9~

~10~

--

2006 5 10 09:10:00
: Google
[
Google Page Rank ()

George Boole)
1854 An Investigation
of the Laws of Thought, on which are founded the Mathematical Theories of Logic and Probabilities

1 TRUE ) 0
FALSE)AND) (OR) NOT)

AND | 1 0
----------------------1|10
0|00
AND 0 0
1 1(0),
10
OR | 1 0
----------------------1|11
0|10

~11~

OR 1 1 0
00
11
NOT |
-------------1|0
0|1
NOT 1 0 0 11
0

80 1938

-TRUE, 1 -- FALSE, 0
AND
AND (NOT )
-
-
-
True False

1 0
0100100001100001...

0010100110000001...
AND
0000100000000001...

~12~

Alta Vista
3-5

Shards)

~13~

--
(Web Crawlers)
2006 5 15 07:15:00
: Google
[

(Web Crawlers)
Google Trends
]

Traverse)
Leonhard Euler1736
Konigsberg

BFS)

DFS)
~14~

Hyperlinks)

"
"Robot) (MIT).Matthew
Gray) 1993 ("www wanderer")

(Hash Table)

Google
200 200
634

~15~

--

2006 5 25 07:56:00
, Google

(Fred Jelinek)

(Perplexity)
Sphinx
997
997
60
20
Mutual Information)
Kullback-Leibler Divergence)

Bush
Kerry ""Kerry

Bush

~16~

(Gale)(Church)(Yarowsky)

(Mitch Marcus)

Kullback-Leibler Divergence

-TF/IDF)
TF/IDF TF/IDF
. (Thomas Cover)
""(Elements of Information Theory)
http://www.amazon.com/gp/product/0471062596/ref=nosim/103-7880775-7782209?n=283155
http://www.cnforyou.com/query/bookdetail1.asp?viBookCode=17909

~17~

--

2006 6 8 09:15:00
Google

.(Fred Jelinek)

D
A1949

Roman Jakobson (
[](Noam Chomsky)
--

IBM
"
"

~18~

Bahl Dragon
(Della Pietra)BCJR (Cocke)(Raviv)

IBM Google,


BCJR

IBM IBM
Amaden BCJR IBM IBM
IBM

IBM IBM

CLSP 20-30 CLSP


CLSP

IBM, AT&T
Google

Pascale

~19~

1
2
3
4
5
6

~20~

--

2006 6 27 09:53:00
Google
[(Page Rank)

Term
Frequency)
2 35 5 0.0020.035 0.005
0.042
w1,w2,...,wN, :
TF1, TF2, ..., TFNTF: term frequency) :TF1 + TF2
+ ... + TFN
80%
Stopwords)

0.007
0.002 0.005

~21~

1.

2.

w Dw Dw

Inverse document frequency IDF log(D/Dw)
10 Dw
IDFlog(10 /10 = log (1) =
Dwlog(500) =6.2
= log(2)
0.7
IDF TF1*IDF1
+TF2*IDF2 ... + TFN*IDFN
0.0161 0.0126 0.0035

term frequency/inverse document frequency)


TF/IDF
IDF [ (Karen Sparck Jones)

log(/Dw)

Salton) TF/IDF

2004 60 IDF

IDF
Kullback-Leibler Divergence)

TF/IDF
TF/IDF
(Page Rank)

~22~


2006 7 5 09:09:00
Google

83

Google

~23~

AT&T
Mohri, Pereira Riley
C AT&T

AT&T AT&T
Google AT&T
C

AT&T

~24~

-- Google 47
.
2006 7 10 09:52:00
Google
.Nicolas Cage)Lord of War)
47( AK47)(

47
(Google .
(Amit Singhal) Google 47 Google

Google
Matt Cutts
Spam)

40%

Google

47

debug)

(Salton) AT&T

AT&T

~25~

Google Google

Google
AT&T
Google
Google
2005
40
RAID)(Randy Katz)

~26~

2006 7 20 10:12:00
Google

Google

TF/IDF
/TF/IDF)
TF/IDF TF/IDF


-----------------1
2
3
~27~

4
...
789
....
64000

64,000 TF/IDF

TF/IDF

========
1
2
3
4
5
...
789
...
64000

======
0
0.0034
0
0.00052
0
0.034
0.075

64,000
64,000

a, b c A, B C A -~28~

b c

b c
X Y
x1,x2,...,x64000
y1,y2,...,y64000,

~29~

~30~


2006 8 3 11:17:00
Google

Fingerprint)

URL)
Google

http://www.baidu.com/s?ie=gb2312&bs=%CA%FD%D1%A7%D6%AE%C3%C0&sr=&am
p;z=&cl=3&f=8
&wd=%CE%E2%BE%FC+%CA%FD%D1%A7%D6%AE%C3%C0&ct=0
200 2 TB
GB 50% 4 TB

200 128
16 :

893249432984398432980545454543
16
~31~

1/6 16 Fingerprint)

128

prng) prng

1001 9 01010001 ( 81
0100
MersenneTwister

, ,

Cookie cookie


cookie MersenneTwister

csprng) MD5 SHA1


128 160
~32~

SHA1

~33~


2006 8 9 09:12:00
Google

[
Google

~34~

8-10

~35~

Verrier

Google
.
.

.
.

/TF/IDF)
page rank)
~36~

2006 8 23 11:22:00
Google

(Michael Collins)

(Mitch Marcus)
(MIT)

(sentence parser)
(Eric Brill)
Ratnaparkhi Eisnar

~37~

AT&T

AT&T MIT
MIT

(Eric Brill)

(transformation rule based


machine learning)

~38~

chang

(part of speech tagging)


Google

Google
Google

~39~


--
2006 10 8 07:27:00
Google
[
(the maximum entropy principle)
]
Google

"wang-xiao-bo"
()

~40~

(maximum entropy)

AT&T
1/6

1/3 2/15 1/3

wang-xiao-bo

Csiszar
--
w3
w1 w2
subject

lambda Z

~41~

2006 11 16 06:50:00
Google

GIS(generalized iterative
scaling) GIS
1.
2. N

3. 2

GIS Darroch Ratcliff


Csiszar)
Darroch Ratcliff GIS
64
GIS

~42~

(Della Pietra) IBM GIS


IISimproved iterative scaling

IBM

IBM (Adwait Ratnaparkhi)

IIS

~43~

(language model) 20 SUN

Google

IBM
IBM
(hedge fund)---- (Renaissance Technologies)

1988
34% 1988 200
Berkshire Hathaway)
16

~44~

2006 11 28 03:18:00
Google

(SPAM)

(page rank)

Google Google
~45~

Matt Cutts Google


(
""

Google

~46~

Google (

~47~

--

2007 1 1 03:10:00
Google

()

Singular Value Decomposition


SVD) A

~48~

M=1,000,000N=500,000 i j j
i TF/IDF)

X B
Y 1.5

Y
A

Google MapReduce
Google
Google
Google

~49~


(Bayesian Networks)
2007 1 28 09:53:00
Google

(Markov
Chain)

(belief)

~50~

(belief networks)

NP-complete

IBM Watson (Geoffrey Zweig)


(Jeff Bilmes)

Google Google

~51~


Bloom Filter
2007 7 3 09:35:00
Google

FBI

hash table

Yahoo,Hotmail
Gmai email
spamer email

email 1.6GB
email
googlechinablog.com/2006/08/blog-post.html
50% email
1.6GB GB

~52~

1/8 1/4

X
F1,F2, ...,F8 f1, f2, ..., f8
G 1 g1, g2, ...,g8
email
email

Y
F1, F2, ..., F8
s1,s2,...,s8 t1,t2,...,t8 Y
t1,t2,..,t8
~53~

~54~



2007 4 13 07:03:00
Google

(Mitch
Marcus)

AT&T

LDC

(corpus)
~55~

DARPA
PennTree
Bank PennTree Bank
LDC

LDC

try-and-error

bioinformatics (

~56~

~57~



2007 9 13 09:00:00
Google


http://ent.sina.com.cn/v/2005-10-17/ba866985.shtml

EBKTBP CAESAR

~58~


0543 0543
054337372947

AF

AF
AF

~59~

Caesar
Ascii X=099097101115097114

1P Q 100 ,
N=PQM=P-1Q-1

2 M E M E 1

3 D ED M 1 ED mod M = 1

E
D N

X Y

D Y X D
Y X

~60~

1.

2.

N,E D ,

3. E D

N N P Q
P Q
P Q 50
RSA-158

395058745832651445264197678006144819960207764603049364541393760515793556265
294
506836097278424682195350935443058704902519956553357102097992264849779494429
~61~

55603
= 3388495837466721394368393204672181522815830368604993048084925840555281177

1165882340667125990314837655838327081813101225814639260043952099413134433416292
4536139

N
N=PQ

P Q

game theory

~62~



2007 12 3 10:05:00
Google

6700
26 676
6700

p1, p2, p3, ..., p6700

L1, L2, L3, ..., L6700

p1L1 + p2L2 + ... + p6700L6700

GBK

~63~

http://www.googlechinablog.com/2006/04/4.html
H = -p1 * log p1 - ... - p6700 log p6700

26
log26=
4.7 10/4.7= 2.1

8
8/4.7=1.7

http://www.googlechinablog.com/2006/04/blog-post.html
6 6/4.7=1.3

~64~

2.98

100

http://tools.google.com/pinyin/

~65~

2008 10 14 08:34:00
Google

GoogleT-Mobile HTC
Android 3G

Dynamic
Programming

~66~

shortest path

Dynamic Programming
programming

-> ->->

->->->

~67~

101015
10 15

~68~

Y1,Y2,Y3,,YN W11,W12,W13 Y1
W21,W22,W23,W24 Y2

~69~

~70~

You might also like