This document summarizes clustering algorithms. It discusses how clustering algorithms group data points into clusters such that points within each cluster are similar to each other and dissimilar to points in other clusters. It then describes hierarchical (agglomerative) clustering which starts with each point in its own cluster and repeatedly merges the closest pairs of clusters. It also discusses k-means clustering which assigns points to k clusters based on minimizing distances between points and cluster centroids. The document provides details on how to implement these algorithms for large datasets.
Original Description:
A document that will clear your concept on clustering
This document summarizes clustering algorithms. It discusses how clustering algorithms group data points into clusters such that points within each cluster are similar to each other and dissimilar to points in other clusters. It then describes hierarchical (agglomerative) clustering which starts with each point in its own cluster and repeatedly merges the closest pairs of clusters. It also discusses k-means clustering which assigns points to k clusters based on minimizing distances between points and cluster centroids. The document provides details on how to implement these algorithms for large datasets.
This document summarizes clustering algorithms. It discusses how clustering algorithms group data points into clusters such that points within each cluster are similar to each other and dissimilar to points in other clusters. It then describes hierarchical (agglomerative) clustering which starts with each point in its own cluster and repeatedly merges the closest pairs of clusters. It also discusses k-means clustering which assigns points to k clusters based on minimizing distances between points and cluster centroids. The document provides details on how to implement these algorithms for large datasets.
"*&->619 ?-,7214,*@ Clustering Algorithms ! Clven a seL of daLa polnLs, group Lhem lnLo a clusLers so LhaL: ! polnLs wlLhln each clusLer are slmllar Lo each oLher ! polnLs from dlerenL clusLers are dlsslmllar ! usually, polnLs are ln a hlgh-dlmenslonal space, and slmllarlLy ls dened uslng a dlsLance measure ! Luclldean, Coslne, !accard, edlL dlsLance, . Weight Height Chihuahuas Dachshunds Beagles ! A caLalog of 2 bllllon sky ob[ecLs" represenLs ob[ecLs by Lhelr radlauon ln 7 dlmenslons (frequency bands). ! roblem: clusLer lnLo slmllar ob[ecLs, e.g., galaxles, nearby sLars, quasars, eLc. ! Sloan Sky Survey ls a newer, beuer verslon. ! ClusLer cusLomers based on Lhelr purchase hlsLorles ! ClusLer producLs based on Lhe seLs of cusLomers who purchased Lhem ! ClusLer documenLs based on slmllar words or shlngles ! ClusLer unA sequences based on edlL dlsLance ! Plerarchlcal (Agglomerauve): ! lnlually, each polnL ln clusLer by lLself. ! 8epeaLedly comblne Lhe Lwo nearesL" clusLers lnLo one. ! olnL AsslgnmenL: ! MalnLaln a seL of clusLers. ! lace polnLs lnLo Lhelr nearesL" clusLer. ! key Cperauon: repeaLedly comblne Lwo nearesL clusLers ! 1hree lmporLanL quesuons: ! Pow do you represenL a clusLer of more Lhan one polnL? ! Pow do you deLermlne Lhe nearness" of clusLers? ! When Lo sLop comblnlng clusLers? ! Lach clusLer has a well-dened cenLrold ! l.e., average across all Lhe polnLs ln Lhe clusLer ! 8epresenL each clusLer by lLs cenLrold ! ulsLance beLween clusLers = dlsLance beLween cenLrolds (5,3) o (1,2) o o (2,1) o (4,1) o (0,0) o (5,0) x (1.5,1.5) x (4.5,0.5) x (1,1) x (4.7,1.3) ! 1he only locauons" we can Lalk abouL are Lhe polnLs Lhemselves. ! l.e., Lhere ls no average" of Lwo polnLs. ! Approach 1: !"#$%&'() = polnL closesL" Lo oLher polnLs. ! 1reaL clusLrold as lf lL were cenLrold, when compuung lnLerclusLer dlsLances. osslble meanlngs: 1. SmallesL maxlmum dlsLance Lo Lhe oLher polnLs. 2. SmallesL average dlsLance Lo oLher polnLs. 3. SmallesL sum of squares of dlsLances Lo oLher polnLs. 4. LLc., eLc. 1 2 3 4 5 6 intercluster distance clustroid clustroid ! Approach 2: lnLerclusLer dlsLance = mlnlmum of Lhe dlsLances beLween any Lwo polnLs, one from each clusLer. ! Approach 3: lck a nouon of coheslon" of clusLers, e.g., maxlmum dlsLance from Lhe clusLrold. ! Merge clusLers whose #*('* ls mosL coheslve. ! Approach 1: use Lhe )(,-.%.& of Lhe merged clusLer = maxlmum dlsLance beLween polnLs ln Lhe clusLer. ! Approach 2: use Lhe average dlsLance beLween polnLs ln Lhe clusLer. ! Approach 3: use a denslLy-based approach: Lake Lhe dlameLer or average dlsLance, e.g., and dlvlde by Lhe number of polnLs ln Lhe clusLer. ! erhaps ralse Lhe number of polnLs Lo a power rsL, e.g., square-rooL. ! SLop when we have k clusLers ! SLop when Lhe coheslon of Lhe clusLer resulung from Lhe besL merger falls below a Lhreshold ! SLop when Lhere ls a sudden [ump ln Lhe coheslon value ! naive lmplemenLauon: ! AL each sLep, compuLe palrwlse dlsLances beLween each palr of clusLers ! C(n 3 ) ! Careful lmplemenLauon uslng a prlorlLy queue can reduce ume Lo C(n 2 log n) ! 1oo expenslve for really blg daLa seLs LhaL don'L L ln memory ! Assumes Luclldean space. ! SLarL by plcklng /, Lhe number of clusLers. ! lnluallze clusLers by plcklng one polnL per clusLer. ! Lxample: plck one polnL aL random, Lhen / -1 oLher polnLs, each as far away as posslble from Lhe prevlous polnLs. 1. lor each polnL, place lL ln Lhe clusLer whose currenL cenLrold lL ls nearesL, and updaLe Lhe cenLrold of Lhe clusLer. 2. Aer all polnLs are asslgned, x Lhe cenLrolds of Lhe / clusLers. 3. Cpuonal: reasslgn all polnLs Lo Lhelr closesL cenLrold. ! Someumes moves polnLs beLween clusLers. 1 2 3 4 5 6 7 8 x x Clusters after first round Reassigned points ! 1ry dlerenL /, looklng aL Lhe change ln Lhe average dlsLance Lo cenLrold, as / lncreases. ! Average falls rapldly unul rlghL /, Lhen changes llule. k Average distance to centroid Best value of k x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Too few; many long distances to centroid. x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Just right; distances rather short. x x x x x x x x x x x x x x x x xx x x x x x x x x x x x x x x x x x x x x x x x Too many; little improvement in average distance. ! 8l8 (8radley-layyad-8elna) ls a varlanL of / - means deslgned Lo handle very large (dlsk- resldenL) daLa seLs. ! lL assumes LhaL clusLers are normally dlsLrlbuLed around a cenLrold ln a Luclldean space. ! SLandard devlauons ln dlerenL dlmenslons may vary. ! olnLs are read one maln-memory-full aL a ume. ! MosL polnLs from prevlous memory loads are summarlzed by slmple sLausucs. ! 1o begln, from Lhe lnlual load we selecL Lhe lnlual / cenLrolds by some senslble approach. ! osslblllues lnclude: 1. 1ake a small random sample and clusLer opumally. 2. 1ake a sample, plck a random polnL, and Lhen / - 1 more polnLs, each as far from Lhe prevlously selecLed polnLs as posslble. 1. 1he )($!,&) $.%: polnLs close enough Lo a cenLrold Lo be summarlzed. 2. 1he !'-0&.$$('* $.%: groups of polnLs LhaL are close LogeLher buL noL close Lo any cenLrold. 1hey are summarlzed, buL noL asslgned Lo a clusLer. 3. 1he &.%,(*.) $.%: lsolaLed polnLs. A cluster. Its points are in the DS. The centroid Compressed sets. Their points are in the CS. Points in the RS ! lor each clusLer, Lhe dlscard seL ls summarlzed by: 1. 1he number of polnLs, 1. 2. 1he vecLor SuM: ( Lh componenL = sum of Lhe coordlnaLes of Lhe polnLs ln Lhe ( Lh dlmenslon. 3. 1he vecLor SuMSC: ( Lh componenL = sum of squares of coordlnaLes ln ( Lh dlmenslon. ! 2) + 1 values represenL any number of polnLs. ! ) = number of dlmenslons. ! CenLrold (mean) ln ( Lh dlmenslon = SuM ( /1. ! SuM ( = ( Lh componenL of SuM. ! varlance ln dlmenslon ( can be compuLed by: (SuMSC ( /1 ) - (SuM ( /1 ) 2 ! Cuesuon: Why use Lhls represenLauon raLher Lhan dlrecLly sLore cenLrold and sLandard devlauon? 1. llnd Lhose polnLs LhaL are sumclenLly close" Lo a clusLer cenLrold, add Lhose polnLs Lo LhaL clusLer and Lhe uS. 2. use any maln-memory clusLerlng algorlLhm Lo clusLer Lhe remalnlng polnLs and Lhe old 8S. ! ClusLers go Lo Lhe CS, ouLlylng polnLs Lo Lhe 8S. 3. Ad[usL sLausucs of Lhe clusLers Lo accounL for Lhe new polnLs. ! Add n's, SuM's, SuMSC's. 4. Conslder merglng compressed seLs ln Lhe CS. 3. lf Lhls ls Lhe lasL round, merge all compressed seLs ln Lhe CS and all 8S polnLs lnLo Lhelr nearesL clusLer. ! Pow do we declde lf a polnL ls close enough" Lo a clusLer LhaL we wlll add Lhe polnL Lo LhaL clusLer? ! Pow do we declde wheLher Lwo compressed seLs deserve Lo be comblned lnLo one? ! We need a way Lo declde wheLher Lo puL a new polnL lnLo a clusLer. ! 8l8 suggesL Lwo ways: 1. 1he 2,3,",*'4($ )($%,*!. ls less Lhan a Lhreshold. 2. Low llkellhood of Lhe currenLly nearesL cenLrold changlng. ! normallzed Luclldean dlsLance from cenLrold. ! lor polnL (5 1 ,.,5 / ) and cenLrold (! 1 ,.,! / ): 1. normallze ln each dlmenslon: 6 ( = (5 ( -! ( )/! (
2. 1ake sum of Lhe squares of Lhe 6 ( 's. 3. 1ake Lhe square rooL. ! lf clusLers are normally dlsLrlbuLed ln ) dlmenslons, Lhen aer Lransformauon, one sLandard devlauon = "). ! l.e., 70 of Lhe polnLs of Lhe clusLer wlll have a Mahalanobls dlsLance < "). ! AccepL a polnL for a clusLer lf lLs M.u. ls < some Lhreshold, e.g. 4 sLandard devlauons. ! 2! ! CompuLe Lhe varlance of Lhe comblned subclusLer. ! 1, SuM, and SuMSC allow us Lo make LhaL calculauon qulckly. ! Comblne lf Lhe varlance ls below some Lhreshold. ! Many alLernauves: LreaL dlmenslons dlerenLly, conslder denslLy. ! roblem wlLh 8l8// -means: ! Assumes clusLers are normally dlsLrlbuLed ln each dlmenslon. ! And axes are xed - elllpses aL an angle are *'% Ck. ! Cu8L: ! Assumes a Luclldean dlsLance. ! Allows clusLers Lo assume any shape. e e e e e e e e e e e h h h h h h h h h h h h h salary age 1. lck a random sample of polnLs LhaL L ln maln memory. 2. ClusLer Lhese polnLs hlerarchlcally - group nearesL polnLs/clusLers. 3. lor each clusLer, plck a sample of polnLs, as dlspersed as posslble. 4. lrom Lhe sample, plck represenLauves by movlng Lhem (say) 20 Loward Lhe cenLrold of Lhe clusLer. e e e e e e e e e e e h h h h h h h h h h h h h salary age e e e e e e e e e e e h h h h h h h h h h h h h salary age Pick (say) 4 remote points for each cluster. e e e e e e e e e e e h h h h h h h h h h h h h salary age Move points (say) 20% toward the centroid. ! now, vlslL each polnL 0 ln Lhe daLa seL. ! lace lL ln Lhe closesL clusLer." ! normal denluon of closesL": LhaL clusLer wlLh Lhe closesL (Lo 0 ) among all Lhe sample polnLs of all Lhe clusLers.
Divine Mathematics Like You Have Never Seen Before: You Will Enter an Area That Will Show You From Where Arises All the Diversity of This Ours Monolithic World