You are on page 1of 6

10/ 26/ 08

All Places > SAS Support Communities > SAS Procedures > Discussions

11 Replies Latest reply: Apr 27, 2012 9:00 AM by mspak

mspak Apr 26, 2012 8:37 AM

Combine Datasets using Inexact Character Variables in SAS


This question has been Answered. Dear all, I downloaded data from 2 different databases with different their identification codes. As such, it is impossible to match cases by an ID. Therefore, the only way is to match the cases by using their company names which is an inexact character variable. I have 2 different datasets: A) uw_match, with the following variables: - underwriters_names; - holding_company; - ipo_date; - others..... B) maluw, with the following variables; - name (label as bank name) - bs_id_number (bankscope identification code) - closdate (Company Fiscal Year End) - others.... I wish to match-merge these two dataset by the following creteria: STEP 1: the underwriters (either underwriters_names or their holding_company) in dataset A compared to bank name in dataset B; note: I can either match-merge the underwriters_names (in A) with the bank name (in B) or holding_company (in A) with bank name (in B), the pairs with the higher matching accuracy level will be output/used; AND STEP 2: the closest closdate (in B) with ipo_date (in A). In short, I wish to match all the variables in A and B, by 1. the bank names (in B) = IPO underwriters (in A; either underwriters_names or their holding_company;which can provide the highest precision level) 2. in the similar period (ie. the closest fiscal year end of the banks with the IPO date). I read an article (see the pdf attached), and I understand that it is possible with SAS. But I feel little knowledge on how to apply the examples into my context here. Any comment and advise is much appreciated. Thank you. Regards, mspak

COMBINE DATASETS USING INEXACT VAR.pdf (156.2 K) maluw.sas7bdat.zip(61.1 K) uw_match.sas7bdat.zip(10.6 K)

Correct Answer by MikeZdeb on Apr 26, 2012 11:01 AM hi ... before you start any of the above, since step #1 relies on matching by literals, have you looked at the names in both files and determined if there are thigs you should do before you even start ... for example ... #1 in holding_company, I see ... OSK HLIG BRA ... ODNS EHD OKHLIG BRA S ODNS EHD
ht t ps: / / com m uni i s. sas. com / m essage/ 124707#124707 te 1/ 6

10/ 26/ 08

are they the same company and should you get rid of those periods #2 there's a mix of lower and upper case letters ... should you convert to all uppercase #3 most (90+%) of all the name variables you cite have "BHD" or "BERHAD" as part of the name ... if you are going to look for similarity in names you don't want the fact that the "BHD' or "BERHAD" part of the match contributing anything to a score given to a name comparison #4 sometimes a location is in parentheses (MALAYSIA) and sometimes it's not MALAYSIA just using PROC FREQ on the various name variables would give you some idea as to how to fix up the names before you even try to match names for example, clean up the names and make some new variables to hold those names ... dt nwmlw aa e_au; stzmlw e .au; *adarcr nme frltrue d eod ubr o ae s; mrc1 ne+; *cnett uprae ol ke nmesltessae,cnetmlil sae t oesae ovr o pecs, ny ep ubr/etr/pcs ovr utpe pcs o n pc; n =cmb(opesucs(ae,'ds); m oplcmrs(paenm),ka') *gtrdo BDadBRA; e i f H n EHD n =tawdn, BD,'; m rnr(m' H'') n =tawdn, BRA'') m rnr(m' EHD,'; rn u; dt nwu_ac; aa e_wmth stzu_ac; e .wmth urc1 ne+; nh=cmb(opesucs(odn_opn),ka') m oplcmrs(paehligcmay,'ds); nh=tawdnh'BD,'; m rnr(m, H'') nh=tawdnh'BRA'') m rnr(m, EHD,'; nu=cmb(opesucs(newiesnms,'ds); m oplcmrs(paeudrrtr_ae),ka') nu=tawdnu'BD,'; m rnr(m, H'') nu=tawdnu'BRA'') m rnr(m, EHD,'; rn u; then run PROC FREQ again on the new variables (nm, nmh, and nmu) and see if there are any other things you should do before you start to match the nm in one file to nmh and nmu in another once you have done the above, here's a suggestion for a start ... haven't used COMPGED much (maybe other folk know about a "good score" level) I usually do this stuf in stages, evaluating the success of each step (e.g. the name match) before I move onto the next ... *ueSLt mthteflsb acmaio o nms ueteCMGDfnto t cmaenms s Q o ac h ie y oprsn f ae, s h OPE ucin o opr ae; *yudnthv t uealtedt sneyuhv pitr (ne adurc; o o' ae o s l h aa ic o ae ones mrc n ne) *n_m adn_m aemthn soe; mnh n mnu r acig crs po sl rc q; cet tbebt a rae al oh s slc mrc urc cmgdn,nh a n_m,cmgdn,nu a n_m eet ne, ne, ope(m m) s mnh ope(m m) s mnu fo nwmlw nwu_ac rm e_au, e_wmth hvn n_m l 5 o n_m l 5; aig mnh t 0 r mnu t 0 qi; ut *rcntuttedt uigtepitr; eosrc h aa sn h ones *myeyuol adtedtsadohrvr yune frmr wr a ti pit ab o ny d h ae n te as o ed o oe ok t hs on; dt bt; aa oh stbt; e oh p=ne; 1mrc p=ne; 2urc stnwmlw(epn codt)pitp; e e_au ke=m lsae on=1 stnwu_ac (epnhnuiodt)pitp; e e_wmth ke=m m p_ae on=2 rn u; ec.. t .

ht t ps: / / com m uni i s. sas. com / m essage/ 124707#124707 te

2/ 6

10/ 26/ 08

359 Views Average User Rating (0 ratings)

MikeZdeb Apr 26, 2012 11:01 AM (in response to mspak) Correct Answer 1. Re: Combine Datasets using Inexact Character Variables in SAS hi ... before you start any of the above, since step #1 relies on matching by literals, have you looked at the names in both files and determined if there are thigs you should do before you even start ... for example ... #1 in holding_company, I see ... OSK HLIG BRA ... ODNS EHD OKHLIG BRA S ODNS EHD are they the same company and should you get rid of those periods #2 there's a mix of lower and upper case letters ... should you convert to all uppercase #3 most (90+%) of all the name variables you cite have "BHD" or "BERHAD" as part of the name ... if you are going to look for similarity in names you don't want the fact that the "BHD' or "BERHAD" part of the match contributing anything to a score given to a name comparison #4 sometimes a location is in parentheses (MALAYSIA) and sometimes it's not MALAYSIA just using PROC FREQ on the various name variables would give you some idea as to how to fix up the names before you even try to match names for example, clean up the names and make some new variables to hold those names ... dt nwmlw aa e_au; stzmlw e .au; *adarcr nme frltrue d eod ubr o ae s; mrc1 ne+; *cnett uprae ol ke nmesltessae,cnetmlil sae t oesae ovr o pecs, ny ep ubr/etr/pcs ovr utpe pcs o n pc; n =cmb(opesucs(ae,'ds); m oplcmrs(paenm),ka') *gtrdo BDadBRA; e i f H n EHD n =tawdn, BD,'; m rnr(m' H'') n =tawdn, BRA'') m rnr(m' EHD,'; rn u; dt nwu_ac; aa e_wmth stzu_ac; e .wmth urc1 ne+; nh=cmb(opesucs(odn_opn),ka') m oplcmrs(paehligcmay,'ds); nh=tawdnh'BD,'; m rnr(m, H'') nh=tawdnh'BRA'') m rnr(m, EHD,'; nu=cmb(opesucs(newiesnms,'ds); m oplcmrs(paeudrrtr_ae),ka') nu=tawdnu'BD,'; m rnr(m, H'') nu=tawdnu'BRA'') m rnr(m, EHD,'; rn u; then run PROC FREQ again on the new variables (nm, nmh, and nmu) and see if there are any other things you should do before you start to match the nm in one file to nmh and nmu in another once you have done the above, here's a suggestion for a start ... haven't used COMPGED much (maybe other folk know about a "good score" level) I usually do this stuf in stages, evaluating the success of each step (e.g. the name match) before I move onto the next ... *ueSLt mthteflsb acmaio o nms ueteCMGDfnto t cmaenms s Q o ac h ie y oprsn f ae, s h OPE ucin o opr ae; *yudnthv t uealtedt sneyuhv pitr (ne adurc; o o' ae o s l h aa ic o ae ones mrc n ne) *n_m adn_m aemthn soe; mnh n mnu r acig crs po sl rc q; cet tbebt a rae al oh s slc mrc urc cmgdn,nh a n_m,cmgdn,nu a n_m eet ne, ne, ope(m m) s mnh ope(m m) s mnu fo nwmlw nwu_ac rm e_au, e_wmth
ht t ps: / / com m uni i s. sas. com / m essage/ 124707#124707 te 3/ 6

10/ 26/ 08

hvn n_m l 5 o n_m l 5; aig mnh t 0 r mnu t 0 qi; ut *rcntuttedt uigtepitr; eosrc h aa sn h ones *myeyuol adtedtsadohrvr yune frmr wr a ti pit ab o ny d h ae n te as o ed o oe ok t hs on; dt bt; aa oh stbt; e oh p=ne; 1mrc p=ne; 2urc stnwmlw(epn codt)pitp; e e_au ke=m lsae on=1 stnwu_ac (epnhnuiodt)pitp; e e_wmth ke=m m p_ae on=2 rn u; ec.. t .

Like (1)

PGStats Apr 26, 2012 11:09 AM (in response to mspak) 2. Re: Combine Datasets using Inexact Character Variables in SAS Hello MSPak, take a look at SAS string distance functions : SPEDIS, COMPGED and COMPLEV. They seem far more sophisticated than the function described in the joined article. PG

Like (1)

MikeZdeb Apr 26, 2012 11:25 AM (in response to PGStats) 3. Re: Combine Datasets using Inexact Character Variables in SAS hi .. I had the same reaction and wondered how there could be an SGF 2012 paper on inexact matching that did not even reference the functions you mentioned (so someone like MSPak could read a new paper and not even be made aware of the functions)

Like (0)

GTickner Apr 26, 2012 12:04 PM (in response to mspak) 4. Re: Combine Datasets using Inexact Character Variables in SAS For your case (with 1,560*940=1,466,400 comparisons) the match functions would be fine, although it's still a good idea to translate common abbreviations (CO, CORP, LTD, INC, DIST, DIV, states such as BHD) and eliminate punctuation before doing an outer join. When I match company names from much larger databases, an outer join is not practical. In that case, you need to transform both sides and join on the transformed fields. I have a rather old program that does a modified soundex transformation to each name from both datasets before matching. Not particularly sophisticated but, having developed the transformations over a period of time, it performs well.

Like (0)

Ksharp Apr 26, 2012 11:32 PM (in response to mspak) 5. Re: Combine Datasets using Inexact Character Variables in SAS For your situation, it is very very hard . SAS has a product named DATA FLUX can solve this problem. As my opinion, I will maintain a table which contain all of these similar words to identify them . And of course these spell distance function mentioned by PGStat . The following code just for fun. lbaexv ':Sfwr' inm 9 D\otae;
ht t ps: / / com m uni i s. sas. com / m essage/ 124707#124707 te 4/ 6

10/ 26/ 08

dt u_ac; aa wmth stxu_ac; e .wmth _aehligcmayotu; nm=odn_opn;upt _aeudrrtr_ae;upt nm=newiesnmsotu; ke _ae ep nm; rn u; po sl rc q; cet tbexa rae al s slc nm,nm eet ae_ae fo xmlwu_ac rm .au,wmth weenm = _ae;ut hr ae * nm qi;

Ksharp

Like (0)

mspak Apr 27, 2012 3:01 AM (in response to Ksharp) 6. Re: Combine Datasets using Inexact Character Variables in SAS Hi Ksharp, What is the meaning of this code "where name =* _name" ? Thanks. Regards, mspak

Like (0)

mspak Apr 27, 2012 3:07 AM (in response to MikeZdeb) 7. Re: Combine Datasets using Inexact Character Variables in SAS Hi MikeZdeb, Your codes are extremely useful for me. I applied the codes suggested, I can say the result is excellent. In BTW, I found a good article on COMPGED Function (see the Pdf). Thank you very much for your helps. Regards, mspak

Fuzzy Matching using the COMPGED Function.pdf (128.6 K)

Like (0)

Ksharp Apr 27, 2012 3:45 AM (in response to mspak) 8. Re: Combine Datasets using Inexact Character Variables in SAS It is SOUND LIKE .

ht t ps: / / com m uni i s. sas. com / m essage/ 124707#124707 te

5/ 6

10/ 26/ 08

Like (0)

mspak Apr 27, 2012 4:02 AM (in response to Ksharp) 9. Re: Combine Datasets using Inexact Character Variables in SAS Thank you very much. Today, I learned a lot from all of you.

"There is no royal road to learning, learning SAS with this discussion forum is extremely useful and fun". Hope everyone enjoy your weekend Regards, mspak

Like (0)

Patrick Apr 27, 2012 5:22 AM (in response to mspak) 10. Re: Combine Datasets using Inexact Character Variables in SAS Just to add to Ksharp's mentioning of DataFlux: If you have the SAS Data Quality Server licensed at your site then you have have DataFlux components available which you can use within a data step. http://support.sas.com/documentation/cdl/en/lefunctionsref/63354/HTML/default/viewer.htm#p1dczp3khf9susn1m0b7badlypav.htm Interesting for you would be to create machcodes. Similar strings will collaps into a single matchcode which you then can use to join your tables. The way DataFlux creates these matchcodes is far superior over functions like soundex() as it doesn't only use a rule-based approach but also lists (eg. I would assume all major companies together with a vast list of variants inclusive typos are in such a list - so "someone" does clusteranalysis here and there are regular updates to the "database").

Like (0)

mspak Apr 27, 2012 9:00 AM (in response to Patrick) 11. Re: Combine Datasets using Inexact Character Variables in SAS Hi Patrick and Ksharp, Thank you for letting me know about this special features of DataFlux. I am not sure whether the University has subscribed for the SAS Data Quality Server. I will check with the local SAS representative on this matter. Perhaps, I should submit a proposal for the subscription of DataFlux if it is not accessible by staff members here. Regards, mspak

Like (0)

Community postings can contain untested user-supplied content. This content is provided as-is w ithout w arranty by SAS. For official SAS support, submit a problem report. Copyright 2012 SAS Institute Inc. All Rights Reserved. Community Guidelines| Terms of Use & Legal Information| Privacy Statement

ht t ps: / / com m uni i s. sas. com / m essage/ 124707#124707 te

6/ 6

You might also like