Professional Documents
Culture Documents
gaojizuojing@126.com
-- ...................................................................................... 1
-- ...................................................................................... 3
-- ............................................ 6
-- ? ...................................................................................... 8
-- .......................................... 11
-- (Web Crawlers) ....................................................... 14
-- ............................................................ 16
-- ...................................................... 18
-- ........................................................ 21
........................................................................... 23
-- Google 47 . ................................. 25
- .................................................................... 27
............................................................................... 31
............................................................................... 34
..................................................... 37
-- ... 40
....... 42
................................. 45
........................................... 48
(Bayesian Networks) ............. 50
Bloom Filter..................................................... 52
....................................................... 55
........... 58
............. 63
..................................... 66
--
2006 4 3 08:15:00
Google
Google
Google
(Statistical Language Models)
Google
-
Noam Chomsky
(Claude Shannon)
(Fred
Jelinek) IBM (Sabbatical Leave)
S w1 w2 wn S
~1~
S S P(S)
S P(S)
P(S) = P(w1)P(w2|w1)P(w3|w2)P(wi|wi-1)
( N-1
P (wi|wi-1)
wi-1,wi) wi-1
,P(wi|wi-1) = P(wi-1,wi)/ P (wi-1)
Google
(NIST)
Google
997 20
~2~
--
2006 4 10 08:10:00
: Google
-----
/ / / / / / / / / /
--
~3~
--
--
--
90
S
A1, A2, A3, ..., Ak,
B1, B2, B3, ..., Bm
C1, C2, C3, ..., Cn
A1, A2, B1, B2, C1, C2
A1,A2,..., Ak P
~4~
Computational
Linguistics
1.
http://www.touchwrite.com/demo/LiangNanyuan-JCIP-1987.pdf
2.
~5~
http://www.touchwrite.com/demo/GuoJin-JCIP-1993.pdf
3.
Critical Tokenization and its Properties
http://acl.ldc.upenn.edu/J/J97/J97-4004.pdf
4.
Chinese word segmentation without using lexicon and hand-crafted training data
http://portal.acm.org/citation.cfm?coll=GUIDE&dl=GUIDE&id=980775
|
--
2006 4 17 08:01:00
Google
--
P (s1,s2,s3,...|o1,o2,o3....) s1,s2,s3,...
P(o1,o2,o3,...|s1,s2,s3....) * P(s1,s2,s3,...)
s1,s2,s3,... si si-1 ()
i oi si ,
P(o1,o2,o3,...|s1,s2,s3....) = P(o1|s1) * P(o2|s2)*P(o3|s3)...
Viterbi
s1,s2,s3,...
s1,s2,s3,...
s1,s2,s3,...
o1,o2,o3,...
o1,o2,o3,...
~7~
P (o1,o2,o3,...|s1,s2,s3....)
(Acoustic Model) (Translation Model)
(Correction Model) P (s1,s2,s3,...)
Baum 60
IBM Fred Jelinek ()
Jim and Janet Baker ()
( 30%
10%)
Sphinx
<http://googlechinablog.com/2006/04/blog-post_17.html>
-- ?
2006 4 26 08:11:00
Google
1948
(shng)
? 1
~8~
32 1-16 ?
1-8 ? 9-16
bit
,
log (log32=5, log64=6
32
= -p1*log p1 + p2 * log p2 +
p1p2 p32 32
(Entropy) H 32
7000
13 13
10% 95%
8-9
5
250 320KB
1MB
redundancy)
250
~9~
~10~
--
2006 5 10 09:10:00
: Google
[
Google Page Rank ()
George Boole)
1854 An Investigation
of the Laws of Thought, on which are founded the Mathematical Theories of Logic and Probabilities
1 TRUE ) 0
FALSE)AND) (OR) NOT)
AND | 1 0
----------------------1|10
0|00
AND 0 0
1 1(0),
10
OR | 1 0
----------------------1|11
0|10
~11~
OR 1 1 0
00
11
NOT |
-------------1|0
0|1
NOT 1 0 0 11
0
80 1938
-TRUE, 1 -- FALSE, 0
AND
AND (NOT )
-
-
-
True False
1 0
0100100001100001...
0010100110000001...
AND
0000100000000001...
~12~
Alta Vista
3-5
Shards)
~13~
--
(Web Crawlers)
2006 5 15 07:15:00
: Google
[
(Web Crawlers)
Google Trends
]
Traverse)
Leonhard Euler1736
Konigsberg
BFS)
DFS)
~14~
Hyperlinks)
"
"Robot) (MIT).Matthew
Gray) 1993 ("www wanderer")
(Hash Table)
Google
200 200
634
~15~
--
2006 5 25 07:56:00
, Google
(Fred Jelinek)
(Perplexity)
Sphinx
997
997
60
20
Mutual Information)
Kullback-Leibler Divergence)
Bush
Kerry ""Kerry
Bush
~16~
(Gale)(Church)(Yarowsky)
(Mitch Marcus)
Kullback-Leibler Divergence
-TF/IDF)
TF/IDF TF/IDF
. (Thomas Cover)
""(Elements of Information Theory)
http://www.amazon.com/gp/product/0471062596/ref=nosim/103-7880775-7782209?n=283155
http://www.cnforyou.com/query/bookdetail1.asp?viBookCode=17909
~17~
--
2006 6 8 09:15:00
Google
.(Fred Jelinek)
D
A1949
Roman Jakobson (
[](Noam Chomsky)
--
IBM
"
"
~18~
Bahl Dragon
(Della Pietra)BCJR (Cocke)(Raviv)
IBM Google,
BCJR
IBM IBM
Amaden BCJR IBM IBM
IBM
IBM IBM
IBM, AT&T
Google
Pascale
~19~
1
2
3
4
5
6
~20~
--
2006 6 27 09:53:00
Google
[(Page Rank)
Term
Frequency)
2 35 5 0.0020.035 0.005
0.042
w1,w2,...,wN, :
TF1, TF2, ..., TFNTF: term frequency) :TF1 + TF2
+ ... + TFN
80%
Stopwords)
0.007
0.002 0.005
~21~
1.
2.
w Dw Dw
Inverse document frequency IDF log(D/Dw)
10 Dw
IDFlog(10 /10 = log (1) =
Dwlog(500) =6.2
= log(2)
0.7
IDF TF1*IDF1
+TF2*IDF2 ... + TFN*IDFN
0.0161 0.0126 0.0035
log(/Dw)
Salton) TF/IDF
2004 60 IDF
IDF
Kullback-Leibler Divergence)
TF/IDF
TF/IDF
(Page Rank)
~22~
2006 7 5 09:09:00
Google
83
~23~
AT&T
Mohri, Pereira Riley
C AT&T
AT&T AT&T
Google AT&T
C
AT&T
~24~
-- Google 47
.
2006 7 10 09:52:00
Google
.Nicolas Cage)Lord of War)
47( AK47)(
47
(Google .
(Amit Singhal) Google 47 Google
Google
Matt Cutts
Spam)
40%
47
debug)
(Salton) AT&T
AT&T
~25~
Google Google
Google
AT&T
Google
Google
2005
40
RAID)(Randy Katz)
~26~
2006 7 20 10:12:00
Google
TF/IDF
/TF/IDF)
TF/IDF TF/IDF
-----------------1
2
3
~27~
4
...
789
....
64000
64,000 TF/IDF
TF/IDF
========
1
2
3
4
5
...
789
...
64000
======
0
0.0034
0
0.00052
0
0.034
0.075
64,000
64,000
a, b c A, B C A -~28~
b c
b c
X Y
x1,x2,...,x64000
y1,y2,...,y64000,
~29~
~30~
2006 8 3 11:17:00
Google
Fingerprint)
URL)
Google
http://www.baidu.com/s?ie=gb2312&bs=%CA%FD%D1%A7%D6%AE%C3%C0&sr=&am
p;z=&cl=3&f=8
&wd=%CE%E2%BE%FC+%CA%FD%D1%A7%D6%AE%C3%C0&ct=0
200 2 TB
GB 50% 4 TB
200 128
16 :
893249432984398432980545454543
16
~31~
1/6 16 Fingerprint)
128
prng) prng
1001 9 01010001 ( 81
0100
MersenneTwister
, ,
Cookie cookie
cookie MersenneTwister
SHA1
~33~
2006 8 9 09:12:00
Google
[
Google
~34~
8-10
~35~
Verrier
Google
.
.
.
.
/TF/IDF)
page rank)
~36~
2006 8 23 11:22:00
Google
(Michael Collins)
(Mitch Marcus)
(MIT)
(sentence parser)
(Eric Brill)
Ratnaparkhi Eisnar
~37~
AT&T
AT&T MIT
MIT
(Eric Brill)
~38~
chang
Google
Google
~39~
--
2006 10 8 07:27:00
Google
[
(the maximum entropy principle)
]
Google
"wang-xiao-bo"
()
~40~
(maximum entropy)
AT&T
1/6
wang-xiao-bo
Csiszar
--
w3
w1 w2
subject
lambda Z
~41~
2006 11 16 06:50:00
Google
GIS(generalized iterative
scaling) GIS
1.
2. N
3. 2
~42~
IBM
IIS
~43~
IBM
IBM
(hedge fund)---- (Renaissance Technologies)
1988
34% 1988 200
Berkshire Hathaway)
16
~44~
2006 11 28 03:18:00
Google
(SPAM)
(page rank)
Google Google
~45~
~46~
Google (
~47~
--
2007 1 1 03:10:00
Google
()
~48~
M=1,000,000N=500,000 i j j
i TF/IDF)
X B
Y 1.5
Y
A
Google MapReduce
Google
Google
Google
~49~
(Bayesian Networks)
2007 1 28 09:53:00
Google
(Markov
Chain)
(belief)
~50~
(belief networks)
NP-complete
Google Google
~51~
Bloom Filter
2007 7 3 09:35:00
Google
FBI
hash table
Yahoo,Hotmail
Gmai email
spamer email
email 1.6GB
email
googlechinablog.com/2006/08/blog-post.html
50% email
1.6GB GB
~52~
1/8 1/4
X
F1,F2, ...,F8 f1, f2, ..., f8
G 1 g1, g2, ...,g8
email
email
Y
F1, F2, ..., F8
s1,s2,...,s8 t1,t2,...,t8 Y
t1,t2,..,t8
~53~
~54~
2007 4 13 07:03:00
Google
(Mitch
Marcus)
AT&T
LDC
(corpus)
~55~
DARPA
PennTree
Bank PennTree Bank
LDC
LDC
try-and-error
bioinformatics (
~56~
~57~
2007 9 13 09:00:00
Google
http://ent.sina.com.cn/v/2005-10-17/ba866985.shtml
EBKTBP CAESAR
~58~
0543 0543
054337372947
AF
AF
AF
~59~
Caesar
Ascii X=099097101115097114
1P Q 100 ,
N=PQM=P-1Q-1
2 M E M E 1
3 D ED M 1 ED mod M = 1
E
D N
X Y
D Y X D
Y X
~60~
1.
2.
N,E D ,
3. E D
N N P Q
P Q
P Q 50
RSA-158
395058745832651445264197678006144819960207764603049364541393760515793556265
294
506836097278424682195350935443058704902519956553357102097992264849779494429
~61~
55603
= 3388495837466721394368393204672181522815830368604993048084925840555281177
1165882340667125990314837655838327081813101225814639260043952099413134433416292
4536139
N
N=PQ
P Q
game theory
~62~
2007 12 3 10:05:00
Google
6700
26 676
6700
GBK
~63~
http://www.googlechinablog.com/2006/04/4.html
H = -p1 * log p1 - ... - p6700 log p6700
26
log26=
4.7 10/4.7= 2.1
8
8/4.7=1.7
http://www.googlechinablog.com/2006/04/blog-post.html
6 6/4.7=1.3
~64~
2.98
100
http://tools.google.com/pinyin/
~65~
2008 10 14 08:34:00
Google
GoogleT-Mobile HTC
Android 3G
Dynamic
Programming
~66~
shortest path
Dynamic Programming
programming
-> ->->
->->->
~67~
101015
10 15
~68~
Y1,Y2,Y3,,YN W11,W12,W13 Y1
W21,W22,W23,W24 Y2
~69~
~70~