Ohio State nav bar

IPR Seminar, Dr. Martha Bailey, University of Michigan

Head shot of Dr. Martha Bailey
March 21, 2017
12:30PM - 1:30PM
038 Townshend Hall

Date Range
Add to Calendar 2017-03-21 12:30:00 2017-03-21 13:30:00 IPR Seminar, Dr. Martha Bailey, University of Michigan How Well do Automated Linking Methods Perform? Evidence from the LIFE-M ProjectNew initiatives to create longitudinal linkages from historical datasets are transforming the study of U.S. economic and demographic history. This paper uses two ground truth samples to provide new evidence on the performance in historical samples of four automated record-linking algorithms, two match disambiguation techniques, and commonly used phonetic name-cleaning methods, Soundex and NYSIIS. Our results show high match rates for each algorithm, but we document important shortcomings of each. First, no method (including the ground truth) appears representative of the underlying population. Second, the incidence of type I errors are distressingly high in samples generated by automated methods, ranging from 19 percent to 81 percent. Third, the use of phonetic name cleaning universally increases type I errors by 60 to 100 percent. Finally, erroneous links are strongly correlated with baseline sample characteristics, suggesting that systematic measurement error introduced by different automated linking methods could have substantial (and difficult to sign) effects on parameter estimates. As an illustration, we show that different linking methods are associated with very different estimates of intergenerational income elasticities for the 1920 to 1940 period, ranging from 0.33 to a statistical zero. We conclude with constructive suggestions for improving automated methods without using clerical review or genealogical methods. 038 Townshend Hall Institute for Population Research popcenter@osu.edu America/New_York public

How Well do Automated Linking Methods Perform? Evidence from the LIFE-M Project

New initiatives to create longitudinal linkages from historical datasets are transforming the study of U.S. economic and demographic history. This paper uses two ground truth samples to provide new evidence on the performance in historical samples of four automated record-linking algorithms, two match disambiguation techniques, and commonly used phonetic name-cleaning methods, Soundex and NYSIIS. Our results show high match rates for each algorithm, but we document important shortcomings of each. First, no method (including the ground truth) appears representative of the underlying population. Second, the incidence of type I errors are distressingly high in samples generated by automated methods, ranging from 19 percent to 81 percent. Third, the use of phonetic name cleaning universally increases type I errors by 60 to 100 percent. Finally, erroneous links are strongly correlated with baseline sample characteristics, suggesting that systematic measurement error introduced by different automated linking methods could have substantial (and difficult to sign) effects on parameter estimates. As an illustration, we show that different linking methods are associated with very different estimates of intergenerational income elasticities for the 1920 to 1940 period, ranging from 0.33 to a statistical zero. We conclude with constructive suggestions for improving automated methods without using clerical review or genealogical methods.