NIH "Fast Track" Directory Project Definition
DRAFT 9
Tuesday, April 6, 1999

Table of Contents
1.0 Background
*2.0 Purpose
*3.0 Project Definition
*3.1 System Components
*3.2 Enterprise Databases and Directories Involved
*3.3 Directory Relational Database
*3.3.1 Strategy for FTRDB Initial Load
*3.3.2 Plan for FTRDB Maintenance
*3.3.3 Interface to FTRDB
*3.3.4 Personal Identification Numbers
*3.4 Record Linking Engine
*3.5 Directory Registration and Update Service
*3.5.1 Functionality
*3.5.2 User Interface
*3.6 Interface to PAID
*3.7 Interface to Telecommunications Database
*3.8 Interface to NIH Email Directory and Forwarding Service
*3.9 LDAP Directory Server
*3.10 Security Requirements
*Appendix A "Fast Track" Database Entity-Relation Diagram
*Appendix B "Fast Track" Attribute Descriptions
*Appendix C "Fast Track" Database Creation
*C.1 Record Linking
*C.2 Attribute Selection
*Appendix D Probabilistic Record Linkage
*D.1 Record Linking Example
*D.2 Record Linking References
*Appendix E "Fast Track" Update Service User Interface
*Appendix F Outstanding Questions and Issues
*Appendix G Wish List
*
List of Tables
Table 3-1 Enterprise Databases and Directories
*Table 3-2 NIH Databases Containing Individual Identifying Information
*Table B-1 Abbreviations for Data Sources
*Table B-2 Private Individual Identifying Information Associated with NIH UIDs
*Table B-3 Private Home and Personal Locator Attributes
*Table B-4 Public Labeling Attributes
*Table B-5 Public Organizational Attributes
*Table B-6 Public Locator Attributes
*Table B-7 Security Attributes
*Table B-8 Ancillary Attributes
*Table D-1 Example Attribute Comparison Outcome Frequencies
*As a result of the NIH Director’s Retreat of September, 1996, the NIH Director commissioned an NIH Information Technology Central Committee (ITCC) to make recommendations for improving IT management at NIH. Among its seven major recommendations, the ITCC’s report of November, 1996 included the following:
The NIH Director gave the acting NIH CIO the task of implementing these recommendations, who in turn commissioned the NIH Architectural Management Group (AMG), which is comprised of representatives from each NIH ICD, to undertake this work. The AMG’s Report on Interoperability at the NIH issued in May, 1997 made the following recommendations relating to the security and directory strategies:
The purpose of the "Fast Track" directory is to quickly bring up a working, but limited, directory containing NIH "white pages" information. The motivation for this effort is:
Building the "Fast Track" directory will give us valuable experience with:
In order to shorten the development time for the "fast track" directory, we must:
We can accomplish this by adopting the following design rules and limitations:
The major components of the "fast track" directory are:
Table 3-1summarizes the existing enterprise databases and directories which will either provide information to build the "fast track" directory database, or will interact which the "fast track" directory update service.
Table -1 Enterprise Databases and Directories
|
Name |
Population Description |
Pop. Size |
Data Quality |
Use |
|
Human Resources DB (HRDB) |
All NIH FTE employees (Federal Civil and Public Health Service) at all NIH sites |
18,000 |
Very good |
Initial load, compare to directory and produce exception reports |
|
Fellowship Payment System (FPS) |
All NIH non-FTE employees (Visiting Fellows and IRTAs) at all NIH sites |
2,800 active
|
Good |
Initial load |
|
J.E. Fogarty International Center (JEFIC) DB |
All NIH foreign visiting scientists at all NIH sites |
2,100 active
|
Good |
Initial load |
|
Parking and ID Badge (PAID) DB |
All individuals working at NIH sites in Maryland, except FDA employees who do not have NIH parking permits or participate in TRANSHARE |
32,000
|
Fair - Good |
Initial load, receive directory updates from Aos |
|
Telecommunications DB |
Permanent NIH Federal employees, Temporary FTE employees >1 year, Temporary Federal physicians >6 months, and other non-Federal employees (40% of records, no SSN), at all NIH sites |
17,500 |
Fair |
Initial load, receive directory updates from Aos |
|
NIH Email Directory and Forwarding Service (PH) |
Most individuals registered for one or more NIH email services |
>29,000 |
Poor |
Initial load |
|
Integrated Time and Attendance System (ITAS) |
All NIH FTE employees (Federal Civil and Public Health Service) at all NIH sites |
18,000 |
Very good |
Authentication and possibly authorization of AOs by directory registration and update service |
Of particular importance is individual identifying information, which is associated with an NIH UID so that an individual is assigned the same NIH UID each time they enter the NIH workforce. Table 3-2 summarizes the individual identifying information available in existing NIH databases.
Table -2 NIH Databases Containing Individual Identifying Information
|
Information |
HRDB |
JEFIC |
FPS |
PAID |
Telecom |
ITAS |
|
SSN |
X |
|
X |
|
X |
X |
|
Date of Birth |
X |
X |
|
|
|
|
|
Place of Birth |
|
X |
|
|
|
|
|
Sex |
X |
X |
|
|
|
|
|
Home Address |
X |
X |
X |
X |
|
|
|
Home Telephone |
|
|
|
|
X |
|
The "fast track" directory relational database (FTRDB) will be implemented with Oracle on CIT's Digital Alpha Enterprise Open System; however, it will be accessed via the ODBC/SQL standards such that it could be readily moved to a different database product or platform. The tables (See Appendix A) store:
Access to individual identifying information covered by the Privacy Act will be controlled by views, and the NIH UIDs of individuals accessing the attributes listed in "Table B-2 Private Individual Identifying Information Associated with NIH UIDs" will be logged, along with the time of access.
The FTRDB will also contain:
See Appendix B for a detailed description of the main FTRDB fields. While the FTRDB data dictionary will include all
nihInetOrgPerson attributes, the only attributes that will actually have values will be those that (1) can be initially loaded from one of the databases listed Table B-1or (2) can be entered and updated by AOs via the directory registration and update service. Strategy for FTRDB Initial LoadThe process of building the FTRDB will involve the following operations:
The strategy for loading the FTRDB is:
At each stage, conflicting attribute values for the same individual may be found in different databases, in which case values will be selected as described in Appendix C.
A detailed plan for loading the FTRDB is described in Appendix C.
Plan for FTRDB MaintenanceOnce the FTRDB is built, subsequent addition, update, and deactivation of records will be done by AOs by means of the directory registration and update service. However, viewed as a replacement for the current Request for DHHS Identification Card (form NIH 1308-4/5) and Request to Change NIH Directory Information (form NIH 433) business processes, this procedure will not track 100% of the directory population. First, the following groups of individuals are not issued NIH ID badges:
Second, temporary FTE employees, non-FTE employees, contractors, volunteers, and guests are not supposed to be listed in the NIH Telephone and Services Directory. Thus, there are populations of individuals not covered by either business process, as currently defined. But it is hoped that AOs will be willing to maintain directory information for these additional groups.
It will thus be necessary to periodically update the FTRDB with changes made to the HRDB, FPS, JEFIC, PAID, and Email systems that have bypassed the directory registration and update service. [The method for performing such updates needs to be better defined.]
Interface to FTRDB[More detail to be supplied by Bob on creating new UIDs/entries and interface to the HRDB]
A secret 4 – 8 digit Personal Identification Number (PIN) , perhaps derived from an individual’s SSN, date of birth, or voice mail PIN, will be associated with each UID by storing it in the
userPassword attribute. The "fast track" update service (see Section 3.5) will print an individual’s PIN, along with instructions for protection and use, on paper for the registering AO to give to a new employee or contractor. To protect from loss or theft, the paper will not contain any identification of the owning individual. In future phases of the directory project, an individual will be able to use the secret PIN together with their UID to authenticate to automated systems. Record Linking EngineAs noted previously, record linking refers to the process of determining if two records belong to the same individual. Record linking has several uses in connection with the NIH directory:
Record linking is easy in situations where a decision can be made based on the agreement or disagreement of a single attribute, for example, the SSN. However, it becomes more difficult when the records to be linked do not contain such an attribute, and the decision must be based either on a single attribute that may partially agree (such as a name) or several attributes of which only some may agree (such as organization, office address, and office telephone number).
The more difficult cases may be handled by applying probabilistic record linking, described in more detail in Appendix D. Briefly, the record linking engine calculates a number, called a binit weight, which is the log2 of the odds that two records constitute a linked pair, i.e., that they belong to the same individual. Thus, a positive binit weight of, say, +10 indicates that the odds are about 1,000 to 1 in favor of a linkage, a negative binit weight of –10 indicates odds of about 1,000 to 1 against a linkage (an unlinked pair), and a binit weight of 0 indicates even odds in favor of (or against) a linkage. Depending on the acceptable number of false positive and false negative links, and the number of borderline pairs one is willing to manually review, an upper and lower threshold can be established. Binit weights above the upper threshold are accepted as linked pairs, those below the lower threshold are accepted as unlinked pairs, and those between are subjected to manual review, perhaps suggesting additional tests to be incorporated in the linking engine to improve its discriminating power.
Directory Registration and Update ServiceThe "fast track" directory update service (FTUS) will enable AOs to register, update, and de-register NIH employees and contractors. It will be implemented as a web application server that will interact with AOs via Netscape or Microsoft browser clients and HTTP/HTML, and with the FTRDB via ODBC/SQL. The HTTP/HTML browser client will include a trusted certificate authority certificate which will enable a SSLv2 connection to be made to the FTUS. AOs will supply their SSN and ITAS passwords to the FTUS over this secured connection, which will query the ITAS database to validate passwords, confirm that the SSN belongs to an AO, and determine the organization for which the AO is authorized to use the FTUS.
FunctionalityThe FTUS will allow an authorized AO to:
The Badge Office and Telephone Directory Unit can also be authorized to use the FTUS to update the FTRDB with information received from walk-ins and paper forms. Updates made in this fashion will cause notification to be sent via email to the requesting AO and affected individual. Update access will be permitted to only those attributes present on the current Request for DHHS Identification Card (form NIH 1308-4/5) and Request to Change NIH Directory Information (form NIH 433).
The FTUS will be designed to allow ICs to easily extend it to collect additional information, automatically create LAN or email accounts, or send additional notifications, for example.
During the "Fast Track" phase, NIH UIDs will not be widely distributed, and individuals will in general not know their UID. We particularly need to provide email account administrators and others with a tool they can use to find someone’s UID so they can begin to add the UID to the accounts they manage. Therefore, the FTUS will also provide an interface to allow anyone to search the active records in the FTRDB (using the linking engine) for an individual’s UID by entering the individual’s surname (
sn) and any or all of the following public attributes: cn, givenName, nihNickname, middleName, o, ou, nihCompanyName, telephoneNumber, buildingName, roomNumber. User InterfacePrototype FTUS user interface screens are depicted in Appendix E. These generally adhere to the following guidelines:
[To be supplied by Denney and Diane]
[To be supplied by Dave]
An AO will generally not know an individual’s
@nih.gov email alias to enter into the FTRDB via the FTUS. Ideally, all email administrators will add the NIH UID to the email accounts they manage, and include the UID in the information they feed to the NIH Email Directory and Forwarding Service (PH). This would enable PH (or its replacement) to easily recognize duplicate entries, link entries and exchange attribute information with the FTRDB, and handle deregistration. Unfortunately, this is a difficult process to implement for the "Fast Track" because users will generally not know their UIDs, and there are 23 email systems that feed PH, each with its own group of administrators.The plan for dealing with this situation is:
[Note: Not a very convincing plan.]
LDAP Directory ServerAn LDAP directory server, such as Netscape's, will provide read-only access to the active and listed public information contained in the FTRDB. This will be accomplished by transferring daily a copy of this information from the FTRDB to the LDAP directory server.
The following information is from the DHHS handbook located at http://wwwoirm.nih.gov/policy/aissp.html.
The Central Directory would have a sensitivity level of either 2 or 3, depending on exactly what data is included in the database.
C. Security Level Requirements
The controls required to adequately safeguard a Level 1 AIS, AIS facility, or ITU are those which would normally be considered good management practice. These include, but are not limited to:
The controls required to adequately safeguard a Level 2 AIS, AIS facility, or ITU include all of the requirements for Level 1, plus the following requirements:
The controls required to adequately safeguard a Level 3 AIS, AIS facility, or ITU include all of the requirements for Levels 1 and 2, plus the requirement for an inventory of hardware and software.
CIT will need to develop a complete security plan for the directory before it becomes operational, in which all of these issues will be addressed.
"Fast Track" Database Entity-Relation Diagram

NOTE: An attribute name has the prefix nih when it does not match an X.500, LIPS, or LDAP standard attribute name. This is to avoid conflicts with future standard name usage.
Table -1 Abbreviations for Data Sources
|
Symbol |
Mnemonic |
Description |
|
A |
AO |
A dministrative Officer for owner of entry |
|
B |
PAID |
Parking/ID Badge/Transhare DB |
|
E |
PH |
NIH Email Directory and Forwarding Service |
|
F |
FPS |
F ellowship Payment System |
|
H |
HRDB |
NIH Human Resources Database |
|
J |
JEFIC |
J . E. Fogarty International Center DB |
|
O |
OWN |
O wner of entry (individual identified by entry). Not implemented for "Fast Track". |
|
P |
TELCOM |
NIH Telecommunications DB (Phone) |
|
S |
FTRDB |
Fast Track Relational Database System |
|
T |
ITAS |
Integrated Time and Attendance System |
|
Y |
ANY |
Anyone |
Table -2 Private Individual Identifying Information Associated with NIH UIDs
|
Attribute |
Description |
Req |
Multi Valued |
Initial Source |
Update From |
Update |
Read |
|
|
|
|
|
|
|
|
|
|
nihSSN |
permanent or temporary social security number (ddd-dd-dddd) |
N |
N |
AFHP |
A |
P |
A |
|
nihDateOfBirth |
Date of birth (yyyy-mm-dd) |
N |
N |
AH |
A |
|
A |
|
nihCityOfBirth |
City of birth |
N |
N |
AJ |
A |
|
A |
|
nihStateOfBirth |
State or province of birth |
N |
N |
AJ |
A |
|
A |
|
nihCountryOfBirth |
Country of birth (FIPS code via validation table) |
N |
N |
AJ |
A |
|
A |
|
nihAliasGivenName |
Other given names associated with uniqueIdentifier |
N |
Y |
ABEFHP |
A |
|
A |
|
nihAliasMiddleName |
Other middle names (or initials) associated with uniqueIdentifier |
N |
Y |
ABEFHP |
A |
|
A |
|
nihAliaseSn |
Other surnames associated with uniqueIdentifier |
N |
Y |
ABEFHP |
A |
|
A |
|
nihMothersSurname |
Mother's maiden surname |
N |
N |
A |
A |
|
A |
|
nihGender |
M | F; M =male; F=female |
N |
N |
AHJ |
A |
|
A |
Table -3 Private Home and Personal Locator Attributes
|
Attribute |
Description |
Req |
Multi Valued |
Initial source |
Update From |
Update |
Read |
|
homePhone |
Home telephone number in full international format |
N |
N |
AJ |
A |
|
AO |
|
homeFax |
Home fax number in full international format |
N |
N |
A |
A |
|
AO |
|
homePostalAddress |
Home postal address (street-address, city, state, postal-code) (RFC 2252 LDAPv3 postal address syntax, limited to 6 lines of 30 characters each) |
N |
N |
ABFHJ |
A |
|
AO |
|
personalMobile |
Personal mobile telephone number in full international format |
N |
N |
A |
A |
|
AO |
|
personalPager |
Personal pager number in full international format |
N |
N |
A |
A |
|
AO |
|
nihHomeMail |
Personal email address |
N |
N |
A |
A |
|
AO |
|
nihEmergencyContactCn |
Common name of emergency contact |
N |
N |
A |
A |
|
AO |
|
nihEmergencyContactPhone |
Telephone number of emergency contact in full international format |
N |
N |
A |
A |
|
AO |
Table -4 Public Labeling Attributes
|
Attribute |
Description |
Req |
Multi Valued |
Initial Source |
Update From |
Update |
Read Access |
|
cn |
(common name) System-generated from givenName, middleName, and sn. Values both with and without middleName generated if middleName attribute exists. Other values may be added by Update From sources. |
Y |
Y |
S |
A |
B |
Y |
|
generationQualifier |
e.g. Jr, III from validation table |
N |
N |
ABHJP |
A |
BP |
Y |
|
givenName |
First name |
Y |
N |
ABFHJP |
A |
BP |
Y |
|
initials |
Initial letters derived from givenName and middleName |
N |
N |
S |
|
|
Y |
|
personalTitle |
e.g. Mr., Dr. from validation table |
N |
N |
ABJP |
A |
BP |
Y |
|
uniqueIdentifier |
Assigned by system |
Y |
N |
S |
|
|
Y |
|
middleName |
Middle name or initial |
N |
N |
ABHJP |
A |
BP |
Y |
|
sn |
(surname) Last name |
Y |
N |
ABFHJP |
A |
BP |
Y |
|
nihSuffixQualifier |
e.g. MD, PhD from validation table |
N |
N |
AJP |
A |
BP |
Y |
|
description |
Free-form, multi-line |
N |
N |
A |
AO |
|
Y |
|
nihEmailNickname |
Nicknames from NIH Email Directory and Forwarding Service |
N |
Y |
E |
AO |
|
Y |
|
nihNickname |
Nicknames (givenNames only) |
N |
Y |
ABFHJP |
AO |
BP |
Y |
|
jpegPhoto |
Full size ID photo in jpeg binary format |
N |
N |
- |
B |
|
Y(??) |
|
thumbnailPhoto |
Thumbnail ID photo in jpeg binary format |
N |
N |
- |
B |
|
Y(??) |
Table -5 Public Organizational Attributes
|
Attribute |
Description |
Req |
Multi |
Initial Source |
Update From |
Update |
Read Access |
|
title |
Job title; designated position or function within organization. (free-form) |
N |
N |
AEH |
A |
|
Y |
|
businessCategory |
Terms that identify a person’s business, technical, special interest, or functions, e.g. "scientist", "molecular biology" (free-form) |
N |
Y |
A |
A |
|
Y |
|
secretary |
timekeeper UID |
N |
N |
AT |
A |
|
Y |
|
manager |
supervisor UID (leave approving official from ITAS) |
N |
N |
ATJ |
A |
|
Y |
|
o |
Institute or Center (IC) abbreviation; "NIH" if no IC |
Y |
N |
ABFHJP |
|
BP |
Y |
|
organizationalStatus |
C | F | G | N | V; C=contract; F=fellow; G=guest; N=NIH FTE; V=volunteer |
N |
N |
ABFHJ |
A |
|
Y |
|
ou |
Name and abbreviation of organization unit, generated by system from nihSAC |
N |
Y |
S |
|
|
Y |
|
nihOrgAbbr |
Organizational path name (e.g. "/NIH/CIT/OCRS/CFB/DSS/"). Determines distinguished name of entry in organizational view. |
N |
N |
S |
|
|
Y |
|
nihSAC |
NIH administrative code of person's ou |
N |
N |
AH |
A |
|
Y |
|
nihTelecomOu |
Organizational abbreviation used by NIH Telecommunications DB without IC component |
N |
N |
AP |
A |
P |
Y |
|
nihCompanyName |
Person's primary employment affiliation if not NIH |
N |
N |
A |
A |
B |
Y |
|
nihCompanyPhone |
Company telephone number in international format |
N |
N |
A |
A |
B |
Y |
Table -6 Public Locator Attributes
|
Attribute |
Description |
Req |
Multi Valued |
Initial Source |
Update From |
Update |
Read Access |
|
labeledURI |
URL of NIH related WEB site |
N |
Y |
- |
|
|
Y |
|
|
Preferred email address |
N |
N |
AE |
A |
|
Y |
|
nihUniqueMail |
Assigned email address @nih.gov |
N |
N |
ET |
AET |
T(??) |
Y |
|
facsimileTelephoneNumber |
Office fax number (international format) |
N |
N |
AEP |
A |
P |
Y |
|
mobileTelephoneNumber |
Office mobile number (international format) |
N |
N |
A |
A |
|
Y |
|
pagerTelephoneNumber |
Office pager number (international format) |
N |
N |
AE |
|
|
Y |
|
telephoneNumber |
Office telephone number (international format) |
N |
N |
ABEPJ |
|
PB |
Y |
|
buildingName |
Office building designator |
N |
N |
ABEPJ |
A |
PB |
Y |
|
houseIdentifier |
Same as buildingName. This attribute is not stored in the FTRDB or displayed on any forms—it exists only as a read-only, system generated attribute in the directory. |
N |
N |
S |
|
|
Y |
|
roomNumber |
Room designator for office |
N |
N |
ABEPJ |
A |
PB |
Y |
|
st |
(state) State name for office. Generated by system from buildingName. |
N |
N |
S |
|
PB |
Y |
|
c |
(country) Always "US" |
Y |
N |
S |
|
|
Y |
|
l |
(locality) City name, or other local designator, for office. Generated by system from buildingName. |
N |
N |
S |
|
PB |
Y |
|
nihPhysicalAddress |
Physical location of office (RFC 2252 LDAPv3 postal address syntax, limited to 6 lines of 30 characters each) |
N |
N |
S |
|
|
Y |
|
street |
Street address and name for office |
N |
N |
S |
|
|
Y |
|
nihMailstop |
NIH mail stop code |
N |
N |
AP |
A |
P |
Y |
|
nihDeliveryAddress |
Delivery address for private carriers (e.g., FedEx, UPS); |
N |
N |
A |
A |
|
Y |
|
PostalAddress |
Full USPS address, including street address, city, state, postal code, etc., to which mail can be sent. (RFC 2252 LDAPv3 postal address syntax, limited to 6 lines of 30 characters each) |
N |
N |
A |
A |
|
Y |
|
postalCode |
USPS ZIP |
N |
N |
A |
A |
|
Y |
|
Attribute |
Description |
Req |
Multi |
Initial Source |
Update From |
Update |
Read Access |
|
userCertificate |
Public Key Certificate |
N |
Y |
S |
|
|
Y |
|
userPassword |
Password to directory |
N |
N |
S |
O |
|
|
|
Attribute |
Description |
Req |
Multi |
Initial Source |
Update From |
Update |
Read Access |
|
creatorsName |
UID of administrator creating entry |
Y |
N |
S |
|
|
|
|
createTimestamp |
Time stamp of nihInetOrgPerson creation event |
Y |
N |
S |
|
|
|
|
ModifiersName |
UID of last administrator modifying entry |
Y |
N |
S |
|
|
|
|
ModifyTimestamp |
Time stamp of last nihInetOrgPerson modify event |
Y |
N |
S |
|
|
|
|
NihPersonStatus |
A | I | T; A =active; I=inactive; T=transferring |
N |
N |
ABFHJ |
A |
B |
A |
|
NihUidQuality |
0=not validated; 1=3rd party; 2=personal contact |
Y |
N |
A |
|
|
Y |
|
NihUidValidator |
UID of administrator validating UID |
N |
N |
S |
|
|
Y |
|
nihUidValidationTimestamp |
Time stamp of UID validation event |
N |
N |
S |
|
|
Y |
|
nihDirEntryUnlisted |
Y | N; Y=unlisted directory entry. For "Fast Track", causes entry not to be present on LDAP server. NOTE: since public directory information for Federal employees is subject to the Freedom of Information Act, an entry cannot be unlisted without justification. |
N |
N |
A |
A |
|
AO |
|
nihDirEntryEffectiveDate |
Date directory entry becomes active |
N |
N |
A(FHJ??) |
|
|
Y |
|
nihDirEntryExpirationDate |
Date directory entry expires |
N |
N |
A(BFHJ??) |
|
|
Y |
Associate the individual identifying information from these with UIDs, and add to the FTRDB. Report UIDs with duplicate SSNs for review. Select and load the name attributes (personalTitle, givenName, middleName, sn, generationQualifier, and nihSuffixQualifier), homePhone, homePostalAddress, organizationalStatus, title (from job series), o (authoritative), ou, nihSAC, nihDirEntryEffectiveDate, and nihDirEntryExpirationDate attributes with the values found in the most recently modified record. Load work address, work telephone, and manager from JEFIC.
Select and load the name attributes, homePostalAddress, buildingName, roomNumber, and telephoneNumber attributes.
At this point, the FTRDB should contain entries for about 21,000 individuals, with UIDs associated with individual identifying information of good quality. The following steps add entries for individuals with UIDs which will have little or no associated individual identifying information.
Load nihEmailNickName attribute. Select and load nihUniqueMail, title, facsimileTelephoneNumber, pagerTelephoneNumber, telephoneNumber, buildingName, and roomNumber attributes.
When multiple records from different databases are joined or linked to the same UID, conflicting values for the same attribute may result. These conflicts will be resolved and attributes selected for loading as follows:
Linkage is the bringing together of information from two database records that relate to the same individual. Calculating the likelihood that the linkage is correct makes the linkage process probabilistic.
The degree of certainty that a linkage is correct depends upon the comparisons of available attributes (or fields) of the records, and the outcomes of these comparisons. Generally, agreement between the values of an attribute in a pair of records argues in favor of accepting them as a linked pair, while disagreement of attribute values is characteristic of an unlinked pair.
However, agreement of various attributes and values have varying significance. For example:
The odds that a linkage is correct can be calculated by measuring the frequency of the outcomes of a comparison applied to a representative set of linked pairs, and dividing that by the frequency of the outcomes of the same comparison applied to a representative set of unlinked pairs:
![]()
where:
x indicates the attribute and its value on the record from database A
y indicates the attribute and its value on the record from database B
When multiple comparisons involving various attributes and values are performed on a pair of records, the overall odds of correct linkage are calculated by simply multiplying together the odds of the individual comparisons. However, it is customary to express the odds as a binit weight:
![]()
and to then calculate the total binit weight by summing the binit weights of the individual comparisons.
Note that the representative set of linked pairs need not be large (a few hundred is sufficient to start with), and it need not be perfect. Applying the linkage process generates more linked pairs, which can be added to the set, and the process iterated.
A representative set of unlinked pairs is not required if simple comparisons are used, because the outcome frequencies can be calculated. Care must be taken when performing complicated (and more powerful) comparisons, which involve:
Alternatively, one can estimate the outcome frequencies using any of several procedures, such as the Expectation-Maximization (EM) algorithm.
Finally, there are a multitude of tips and tricks for handling missing values and comparing surnames, initials, given names, dates and places of birth, and geographic attributes.
Record Linking ExampleAs a simple example of record linking, suppose that four attributes are compared on representative sample sets of linked and unlinked pairs of records from two files, and the outcome frequencies (expressed as percentages) are obtained as shown in Table D-1.
Table -1 Example Attribute Comparison Outcome Frequencies
|
Attribute |
Outcome |
Linked Pairs |
Unlinked Pairs |
Ratio |
Binit Weight |
|
Surname |
Agree |
96.5% |
0.1% |
965.0000 |
9.9 |
|
|
Disagree |
3.5% |
99.9% |
0.0350 |
-4.8 |
|
Given Name |
Agree |
79.0% |
0.9% |
87.7778 |
6.5 |
|
|
Disagree |
21.0% |
99.1% |
0.2119 |
-2.2 |
|
Date of Birth |
Agree |
93.3% |
8.3% |
11.2410 |
3.5 |
|
|
Disagree |
6.7% |
91.7% |
0.0731 |
-3.8 |
|
Place of Birth |
Agree |
98.1% |
11.7% |
8.3846 |
3.1 |
|
|
Disagree |
1.9% |
88.3% |
0.0215 |
-5.5 |
Now suppose that a pair of records from the two files are compared, and they agree on surname, date of birth, and place of birth, but disagree on given name. Then the total binit weight of the link for the pair is calculated by adding the bold face binit weights for the individual outcomes from Table D-1:
9.9 + (-2.2) + 3.5 + 3.1 = 14.3
This binit weight indicates odds of about 214.3
@ 10,000 to 1 that the pair should be linked
Wish List