NIH "Fast Track" Directory Project Definition

DRAFT 9

Tuesday, April 6, 1999

Table of Contents

1.0 Background *

2.0 Purpose *

3.0 Project Definition *

3.1 System Components *

3.2 Enterprise Databases and Directories Involved *

3.3 Directory Relational Database *

3.3.1 Strategy for FTRDB Initial Load *

3.3.2 Plan for FTRDB Maintenance *

3.3.3 Interface to FTRDB *

3.3.4 Personal Identification Numbers *

3.4 Record Linking Engine *

3.5 Directory Registration and Update Service *

3.5.1 Functionality *

3.5.2 User Interface *

3.6 Interface to PAID *

3.7 Interface to Telecommunications Database *

3.8 Interface to NIH Email Directory and Forwarding Service *

3.9 LDAP Directory Server *

3.10 Security Requirements *

Appendix A "Fast Track" Database Entity-Relation Diagram *

Appendix B "Fast Track" Attribute Descriptions *

Appendix C "Fast Track" Database Creation *

C.1 Record Linking *

C.2 Attribute Selection *

Appendix D Probabilistic Record Linkage *

D.1 Record Linking Example *

D.2 Record Linking References *

Appendix E "Fast Track" Update Service User Interface *

Appendix F Outstanding Questions and Issues *

Appendix G Wish List *

 

List of Tables

Table 3-1 Enterprise Databases and Directories *

Table 3-2 NIH Databases Containing Individual Identifying Information *

Table B-1 Abbreviations for Data Sources *

Table B-2 Private Individual Identifying Information Associated with NIH UIDs *

Table B-3 Private Home and Personal Locator Attributes *

Table B-4 Public Labeling Attributes *

Table B-5 Public Organizational Attributes *

Table B-6 Public Locator Attributes *

Table B-7 Security Attributes *

Table B-8 Ancillary Attributes *

Table D-1 Example Attribute Comparison Outcome Frequencies *

  1. Background

As a result of the NIH Director’s Retreat of September, 1996, the NIH Director commissioned an NIH Information Technology Central Committee (ITCC) to make recommendations for improving IT management at NIH. Among its seven major recommendations, the ITCC’s report of November, 1996 included the following:

The NIH Director gave the acting NIH CIO the task of implementing these recommendations, who in turn commissioned the NIH Architectural Management Group (AMG), which is comprised of representatives from each NIH ICD, to undertake this work. The AMG’s Report on Interoperability at the NIH issued in May, 1997 made the following recommendations relating to the security and directory strategies:

  1. Purpose

The purpose of the "Fast Track" directory is to quickly bring up a working, but limited, directory containing NIH "white pages" information. The motivation for this effort is:

Building the "Fast Track" directory will give us valuable experience with:

  1. Project Definition

In order to shorten the development time for the "fast track" directory, we must:

We can accomplish this by adopting the following design rules and limitations:

    1. System Components

The major components of the "fast track" directory are:

    1. Enterprise Databases and Directories Involved

Table 3-1summarizes the existing enterprise databases and directories which will either provide information to build the "fast track" directory database, or will interact which the "fast track" directory update service.

Table -1 Enterprise Databases and Directories

Name

Population Description

Pop. Size

Data Quality

Use

Human Resources DB (HRDB)

All NIH FTE employees (Federal Civil and Public Health Service) at all NIH sites

18,000

Very good

Initial load, compare to directory and produce exception reports

Fellowship Payment System (FPS)

All NIH non-FTE employees (Visiting Fellows and IRTAs) at all NIH sites

2,800 active

  • 1,200 Visiting Fellows
  • 1,600 IRTAs

Good

Initial load

J.E. Fogarty International Center (JEFIC) DB

All NIH foreign visiting scientists at all NIH sites

2,100 active

  • 1,200 Visiting Fellows
  • 500 Visiting Scientists & Associates
  • 400 other

Good

Initial load

Parking and ID Badge (PAID) DB

All individuals working at NIH sites in Maryland, except FDA employees who do not have NIH parking permits or participate in TRANSHARE

32,000

  • 24,000 employees
  • 6,000 contractors
  • 1,500 volunteers
  • 400 guests

Fair - Good

Initial load, receive directory updates from Aos

Telecommunications DB

Permanent NIH Federal employees, Temporary FTE employees >1 year, Temporary Federal physicians >6 months, and other non-Federal employees (40% of records, no SSN), at all NIH sites

17,500

Fair

Initial load, receive directory updates from Aos

NIH Email Directory and Forwarding Service (PH)

Most individuals registered for one or more NIH email services

>29,000

Poor

Initial load

Integrated Time and Attendance System (ITAS)

All NIH FTE employees (Federal Civil and Public Health Service) at all NIH sites

18,000

Very good

Authentication and possibly authorization of AOs by directory registration and update service

Of particular importance is individual identifying information, which is associated with an NIH UID so that an individual is assigned the same NIH UID each time they enter the NIH workforce. Table 3-2 summarizes the individual identifying information available in existing NIH databases.

 

Table -2 NIH Databases Containing Individual Identifying Information

Information

HRDB

JEFIC

FPS

PAID

Telecom

ITAS

SSN

X

 

X

 

X

X

Date of Birth

X

X

 

 

 

 

Place of Birth

 

X

 

 

 

 

Sex

X

X

 

 

 

 

Home Address

X

X

X

X

 

 

Home Telephone

 

 

 

 

X

 

    1. Directory Relational Database

The "fast track" directory relational database (FTRDB) will be implemented with Oracle on CIT's Digital Alpha Enterprise Open System; however, it will be accessed via the ODBC/SQL standards such that it could be readily moved to a different database product or platform. The tables (See Appendix A) store:

Access to individual identifying information covered by the Privacy Act will be controlled by views, and the NIH UIDs of individuals accessing the attributes listed in "Table B-2 Private Individual Identifying Information Associated with NIH UIDs" will be logged, along with the time of access.

The FTRDB will also contain:

See Appendix B for a detailed description of the main FTRDB fields. While the FTRDB data dictionary will include all nihInetOrgPerson attributes, the only attributes that will actually have values will be those that (1) can be initially loaded from one of the databases listed Table B-1or (2) can be entered and updated by AOs via the directory registration and update service.

      1. Strategy for FTRDB Initial Load

The process of building the FTRDB will involve the following operations:

The strategy for loading the FTRDB is:

  1. Use the best data (in HRDB, FPS, and JEFIC) and the best keys (the SSN and JEFIC case number) to create entries with UIDs associated with quality individual identifying information.
  2. Use the SSN to link to records from ITAS and the Telecommunications databases in order to load in work locator attributes. These provide more common attributes on which to base links to the other databases, which do not contain the SSN.
  3. Link records from PAID on name and other common attributes, loading the additional attributes it contains.
  4. Add new UIDs for active records in the PAID database that have not yet been associated with a UID. These "weak" UIDs will have little or no associated individual identifying information.
  5. Link records from the Telecomminications and Email directories on name and other common attributes, loading the @nih.gov email address and other additional attributes they contain.

At each stage, conflicting attribute values for the same individual may be found in different databases, in which case values will be selected as described in Appendix C.

A detailed plan for loading the FTRDB is described in Appendix C.

      1. Plan for FTRDB Maintenance

Once the FTRDB is built, subsequent addition, update, and deactivation of records will be done by AOs by means of the directory registration and update service. However, viewed as a replacement for the current Request for DHHS Identification Card (form NIH 1308-4/5) and Request to Change NIH Directory Information (form NIH 433) business processes, this procedure will not track 100% of the directory population. First, the following groups of individuals are not issued NIH ID badges:

Second, temporary FTE employees, non-FTE employees, contractors, volunteers, and guests are not supposed to be listed in the NIH Telephone and Services Directory. Thus, there are populations of individuals not covered by either business process, as currently defined. But it is hoped that AOs will be willing to maintain directory information for these additional groups.

It will thus be necessary to periodically update the FTRDB with changes made to the HRDB, FPS, JEFIC, PAID, and Email systems that have bypassed the directory registration and update service. [The method for performing such updates needs to be better defined.]

      1. Interface to FTRDB
      2. [More detail to be supplied by Bob on creating new UIDs/entries and interface to the HRDB]

      3. Personal Identification Numbers

A secret 4 – 8 digit Personal Identification Number (PIN) , perhaps derived from an individual’s SSN, date of birth, or voice mail PIN, will be associated with each UID by storing it in the userPassword attribute. The "fast track" update service (see Section 3.5) will print an individual’s PIN, along with instructions for protection and use, on paper for the registering AO to give to a new employee or contractor. To protect from loss or theft, the paper will not contain any identification of the owning individual. In future phases of the directory project, an individual will be able to use the secret PIN together with their UID to authenticate to automated systems.

    1. Record Linking Engine

As noted previously, record linking refers to the process of determining if two records belong to the same individual. Record linking has several uses in connection with the NIH directory:

  1. elimination of multiple records belonging the same individual in existing databases,
  2. linking records across two databases (e.g. the CSO Email Directory and the HRDB) so they can be associated with the same NIH UID,
  3. determination of which UID, if any, belongs to a "new" employee who may have previously been assigned a UID, so the same UID can be reassigned, and
  4. in future phases, joining newly-created entries from connected directories into the NIH meta-directory.

Record linking is easy in situations where a decision can be made based on the agreement or disagreement of a single attribute, for example, the SSN. However, it becomes more difficult when the records to be linked do not contain such an attribute, and the decision must be based either on a single attribute that may partially agree (such as a name) or several attributes of which only some may agree (such as organization, office address, and office telephone number).

The more difficult cases may be handled by applying probabilistic record linking, described in more detail in Appendix D. Briefly, the record linking engine calculates a number, called a binit weight, which is the log2 of the odds that two records constitute a linked pair, i.e., that they belong to the same individual. Thus, a positive binit weight of, say, +10 indicates that the odds are about 1,000 to 1 in favor of a linkage, a negative binit weight of –10 indicates odds of about 1,000 to 1 against a linkage (an unlinked pair), and a binit weight of 0 indicates even odds in favor of (or against) a linkage. Depending on the acceptable number of false positive and false negative links, and the number of borderline pairs one is willing to manually review, an upper and lower threshold can be established. Binit weights above the upper threshold are accepted as linked pairs, those below the lower threshold are accepted as unlinked pairs, and those between are subjected to manual review, perhaps suggesting additional tests to be incorporated in the linking engine to improve its discriminating power.

    1. Directory Registration and Update Service
    2. The "fast track" directory update service (FTUS) will enable AOs to register, update, and de-register NIH employees and contractors. It will be implemented as a web application server that will interact with AOs via Netscape or Microsoft browser clients and HTTP/HTML, and with the FTRDB via ODBC/SQL. The HTTP/HTML browser client will include a trusted certificate authority certificate which will enable a SSLv2 connection to be made to the FTUS. AOs will supply their SSN and ITAS passwords to the FTUS over this secured connection, which will query the ITAS database to validate passwords, confirm that the SSN belongs to an AO, and determine the organization for which the AO is authorized to use the FTUS.

      1. Functionality

The FTUS will allow an authorized AO to:

The Badge Office and Telephone Directory Unit can also be authorized to use the FTUS to update the FTRDB with information received from walk-ins and paper forms. Updates made in this fashion will cause notification to be sent via email to the requesting AO and affected individual. Update access will be permitted to only those attributes present on the current Request for DHHS Identification Card (form NIH 1308-4/5) and Request to Change NIH Directory Information (form NIH 433).

The FTUS will be designed to allow ICs to easily extend it to collect additional information, automatically create LAN or email accounts, or send additional notifications, for example.

During the "Fast Track" phase, NIH UIDs will not be widely distributed, and individuals will in general not know their UID. We particularly need to provide email account administrators and others with a tool they can use to find someone’s UID so they can begin to add the UID to the accounts they manage. Therefore, the FTUS will also provide an interface to allow anyone to search the active records in the FTRDB (using the linking engine) for an individual’s UID by entering the individual’s surname (sn) and any or all of the following public attributes: cn, givenName, nihNickname, middleName, o, ou, nihCompanyName, telephoneNumber, buildingName, roomNumber.

      1. User Interface

Prototype FTUS user interface screens are depicted in Appendix E. These generally adhere to the following guidelines:

  1. The collection of fields that appear together on a screen or tab correspond to the attributes described in the tables in Appendix B.
  2. Fields for required attributes are visually distinguished from those for optional attributes.
  3. Fields for multi-valued attributes are drop-down or scrollable fill-in lists.
  4. Fields for attributes that a user is not authorized to read are completely filled in with "*"s.
  5. Fields for attributes that a user is authorized to read, but not modify, are visually distinguished from those that a user is authorized to change.
  6. All fields are fixed format, unless explicitly indicated otherwise in Appendix B.
  7. Fields for attributes with FTRDB validation tables (job series, organization names, building names, work telephone area codes and exchanges, job series) are drop down pick lists.
  8. Attributes listed in "Table B-2 Private Individual Identifying Information Associated with NIH UIDs" will not be displayed unless (a) the user is authorized, and (b) the user has indicated via a check box field that they need to see this information.
    1. Interface to PAID
    2. [To be supplied by Denney and Diane]

    3. Interface to Telecommunications Database
    4. [To be supplied by Dave]

    5. Interface to NIH Email Directory and Forwarding Service

An AO will generally not know an individual’s @nih.gov email alias to enter into the FTRDB via the FTUS. Ideally, all email administrators will add the NIH UID to the email accounts they manage, and include the UID in the information they feed to the NIH Email Directory and Forwarding Service (PH). This would enable PH (or its replacement) to easily recognize duplicate entries, link entries and exchange attribute information with the FTRDB, and handle deregistration. Unfortunately, this is a difficult process to implement for the "Fast Track" because users will generally not know their UIDs, and there are 23 email systems that feed PH, each with its own group of administrators.

The plan for dealing with this situation is:

[Note: Not a very convincing plan.]

    1. LDAP Directory Server
    2. An LDAP directory server, such as Netscape's, will provide read-only access to the active and listed public information contained in the FTRDB. This will be accomplished by transferring daily a copy of this information from the FTRDB to the LDAP directory server.

    3. Security Requirements

The following information is from the DHHS handbook located at http://wwwoirm.nih.gov/policy/aissp.html.

The Central Directory would have a sensitivity level of either 2 or 3, depending on exactly what data is included in the database.

C. Security Level Requirements

  1. Level 1 Requirements

The controls required to adequately safeguard a Level 1 AIS, AIS facility, or ITU are those which would normally be considered good management practice. These include, but are not limited to:

  1. An employee AIS security awareness and training program.
  2. The assignment of sensitivity designations to every employee position.
  3. Physical access controls.
  4. A complete set of AIS documentation.
  1. Level 2 Requirements

The controls required to adequately safeguard a Level 2 AIS, AIS facility, or ITU include all of the requirements for Level 1, plus the following requirements:

  1. A detailed risk management program.
  2. A CSSP for systems processing sensitive information.
  3. Record retention procedures.
  4. A list of authorized users.
  5. Security review and certification procedures.
  6. Required background investigations for all employees.
  7. Required background investigations for all contractor personnel.
  8. A detailed fire emergency plan.
  9. A formal written contingency plan.
  10. A formal risk analysis.
  11. An automated audit trail.
  12. Authorized access and control procedures.
  13. Secure physical transportation procedures.
  14. Secure telecommunications.
  15. An emergency power program.
  1. Level 3 Requirements

The controls required to adequately safeguard a Level 3 AIS, AIS facility, or ITU include all of the requirements for Levels 1 and 2, plus the requirement for an inventory of hardware and software.

CIT will need to develop a complete security plan for the directory before it becomes operational, in which all of these issues will be addressed.

  1. "Fast Track" Database Entity-Relation Diagram
  2. "Fast Track" Attribute Descriptions
  3. NOTE: An attribute name has the prefix nih when it does not match an X.500, LIPS, or LDAP standard attribute name. This is to avoid conflicts with future standard name usage.

    Table -1 Abbreviations for Data Sources

    Symbol

    Mnemonic

    Description

    A

    AO

    Administrative Officer for owner of entry

    B

    PAID

    Parking/ID Badge/Transhare DB

    E

    PH

    NIH Email Directory and Forwarding Service

    F

    FPS

    Fellowship Payment System

    H

    HRDB

    NIH Human Resources Database

    J

    JEFIC

    J. E. Fogarty International Center DB

    O

    OWN

    Owner of entry (individual identified by entry). Not implemented for "Fast Track".

    P

    TELCOM

    NIH Telecommunications DB (Phone)

    S

    FTRDB

    Fast Track Relational Database System

    T

    ITAS

    Integrated Time and Attendance System

    Y

    ANY

    Anyone

     

    Table -2 Private Individual Identifying Information Associated with NIH UIDs

    Attribute

    Description

    Req

    Multi Valued

    Initial Source

    Update From

    Update
    To

    Read
    Access

     

     

     

     

     

     

     

     

    nihSSN

    permanent or temporary social security number (ddd-dd-dddd)

    N

    N

    AFHP

    A

    P

    A

    nihDateOfBirth

    Date of birth (yyyy-mm-dd)

    N

    N

    AH

    A

     

    A

    nihCityOfBirth

    City of birth

    N

    N

    AJ

    A

     

    A

    nihStateOfBirth

    State or province of birth

    N

    N

    AJ

    A

     

    A

    nihCountryOfBirth

    Country of birth (FIPS code via validation table)

    N

    N

    AJ

    A

     

    A

    nihAliasGivenName

    Other given names associated with uniqueIdentifier

    N

    Y

    ABEFHP

    A

     

    A

    nihAliasMiddleName

    Other middle names (or initials) associated with uniqueIdentifier

    N

    Y

    ABEFHP

    A

     

    A

    nihAliaseSn

    Other surnames associated with uniqueIdentifier

    N

    Y

    ABEFHP

    A

     

    A

    nihMothersSurname

    Mother's maiden surname

    N

    N

    A

    A

     

    A

    nihGender

    M | F; M=male; F=female

    N

    N

    AHJ

    A

     

    A

     

    Table -3 Private Home and Personal Locator Attributes

    Attribute

    Description

    Req

    Multi Valued

    Initial source

    Update From

    Update
    To

    Read
    Access

    homePhone

    Home telephone number in full international format

    N

    N

    AJ

    A

     

    AO

    homeFax

    Home fax number in full international format

    N

    N

    A

    A

     

    AO

    homePostalAddress

    Home postal address (street-address, city, state, postal-code) (RFC 2252 LDAPv3 postal address syntax, limited to 6 lines of 30 characters each)

    N

    N

    ABFHJ

    A

     

    AO

    personalMobile

    Personal mobile telephone number in full international format

    N

    N

    A

    A

     

    AO

    personalPager

    Personal pager number in full international format

    N

    N

    A

    A

     

    AO

    nihHomeMail

    Personal email address

    N

    N

    A

    A

     

    AO

    nihEmergencyContactCn

    Common name of emergency contact

    N

    N

    A

    A

     

    AO

    nihEmergencyContactPhone

    Telephone number of emergency contact in full international format

    N

    N

    A

    A

     

    AO

     

    Table -4 Public Labeling Attributes

    Attribute

    Description

    Req

    Multi Valued

    Initial Source

    Update From

    Update
    To

    Read Access

    cn

    (common name) System-generated from givenName, middleName, and sn. Values both with and without middleName generated if middleName attribute exists. Other values may be added by Update From sources.

    Y

    Y

    S

    A

    B

    Y

    generationQualifier

    e.g. Jr, III from validation table

    N

    N

    ABHJP

    A

    BP

    Y

    givenName

    First name

    Y

    N

    ABFHJP

    A

    BP

    Y

    initials

    Initial letters derived from givenName and middleName

    N

    N

    S

     

     

    Y

    personalTitle

    e.g. Mr., Dr. from validation table

    N

    N

    ABJP

    A

    BP

    Y

    uniqueIdentifier

    Assigned by system

    Y

    N

    S

     

     

    Y

    middleName

    Middle name or initial

    N

    N

    ABHJP

    A

    BP

    Y

    sn

    (surname) Last name

    Y

    N

    ABFHJP

    A

    BP

    Y

    nihSuffixQualifier

    e.g. MD, PhD from validation table

    N

    N

    AJP

    A

    BP

    Y

    description

    Free-form, multi-line

    N

    N

    A

    AO

     

    Y

    nihEmailNickname

    Nicknames from NIH Email Directory and Forwarding Service

    N

    Y

    E

    AO

     

    Y

    nihNickname

    Nicknames (givenNames only)

    N

    Y

    ABFHJP

    AO

    BP

    Y

    jpegPhoto

    Full size ID photo in jpeg binary format

    N

    N

    -

    B

     

    Y(??)

    thumbnailPhoto

    Thumbnail ID photo in jpeg binary format

    N

    N

    -

    B

     

    Y(??)

     

    Table -5 Public Organizational Attributes

    Attribute

    Description

    Req

    Multi
    Valued

    Initial Source

    Update From

    Update
    To

    Read Access

    title

    Job title; designated position or function within organization. (free-form)

    N

    N

    AEH

    A

     

    Y

    businessCategory

    Terms that identify a person’s business, technical, special interest, or functions, e.g. "scientist", "molecular biology" (free-form)

    N

    Y

    A

    A

     

    Y

    secretary

    timekeeper UID

    N

    N

    AT

    A

     

    Y

    manager

    supervisor UID (leave approving official from ITAS)

    N

    N

    ATJ

    A

     

    Y

    o

    Institute or Center (IC) abbreviation; "NIH" if no IC

    Y

    N

    ABFHJP

     

    BP

    Y

    organizationalStatus

    C | F | G | N | V; C=contract; F=fellow; G=guest; N=NIH FTE; V=volunteer

    N

    N

    ABFHJ

    A

     

    Y

    ou

    Name and abbreviation of organization unit, generated by system from nihSAC

    N

    Y

    S

     

     

    Y

    nihOrgAbbr

    Organizational path name (e.g. "/NIH/CIT/OCRS/CFB/DSS/"). Determines distinguished name of entry in organizational view.

    N

    N

    S

     

     

    Y

    nihSAC

    NIH administrative code of person's ou

    N

    N

    AH

    A

     

    Y

    nihTelecomOu

    Organizational abbreviation used by NIH Telecommunications DB without IC component

    N

    N

    AP

    A

    P

    Y

    nihCompanyName

    Person's primary employment affiliation if not NIH

    N

    N

    A

    A

    B

    Y

    nihCompanyPhone

    Company telephone number in international format

    N

    N

    A

    A

    B

    Y

    Table -6 Public Locator Attributes

    Attribute

    Description

    Req

    Multi Valued

    Initial Source

    Update From

    Update
    To

    Read Access

    labeledURI

    URL of NIH related WEB site

    N

    Y

    -

     

     

    Y

    mail

    Preferred email address

    N

    N

    AE

    A

     

    Y

    nihUniqueMail

    Assigned email address @nih.gov

    N

    N

    ET

    AET

    T(??)

    Y

    facsimileTelephoneNumber

    Office fax number (international format)

    N

    N

    AEP

    A

    P

    Y

    mobileTelephoneNumber

    Office mobile number (international format)

    N

    N

    A

    A

     

    Y

    pagerTelephoneNumber

    Office pager number (international format)

    N

    N

    AE

     

     

    Y

    telephoneNumber

    Office telephone number (international format)

    N

    N

    ABEPJ

     

    PB

    Y

    buildingName

    Office building designator

    N

    N

    ABEPJ

    A

    PB

    Y

    houseIdentifier

    Same as buildingName. This attribute is not stored in the FTRDB or displayed on any forms—it exists only as a read-only, system generated attribute in the directory.

    N

    N

    S

     

     

    Y

    roomNumber

    Room designator for office

    N

    N

    ABEPJ

    A

    PB

    Y

    st

    (state) State name for office. Generated by system from buildingName.

    N

    N

    S

     

    PB

    Y

    c

    (country) Always "US"

    Y

    N

    S

     

     

    Y

    l

    (locality) City name, or other local designator, for office. Generated by system from buildingName.

    N

    N

    S

     

    PB

    Y

    nihPhysicalAddress

    Physical location of office (RFC 2252 LDAPv3 postal address syntax, limited to 6 lines of 30 characters each)

    N

    N

    S

     

     

    Y

    street

    Street address and name for office

    N

    N

    S

     

     

    Y

    nihMailstop

    NIH mail stop code

    N

    N

    AP

    A

    P

    Y

    nihDeliveryAddress

    Delivery address for private carriers (e.g., FedEx, UPS);

    N

    N

    A

    A

     

    Y

    PostalAddress

    Full USPS address, including street address, city, state, postal code, etc., to which mail can be sent. (RFC 2252 LDAPv3 postal address syntax, limited to 6 lines of 30 characters each)

    N

    N

    A

    A

     

    Y

    postalCode

    USPS ZIP

    N

    N

    A

    A

     

    Y

     

    Table -7 Security Attributes

    Attribute

    Description

    Req

    Multi
    Valued

    Initial Source

    Update From

    Update
    To

    Read Access

    userCertificate

    Public Key Certificate

    N

    Y

    S

     

     

    Y

    userPassword

    Password to directory

    N

    N

    S

    O

     

     

     

    Table -1 Ancillary Attributes

    Attribute

    Description

    Req

    Multi
    Valued

    Initial Source

    Update From

    Update
    To

    Read Access

    creatorsName

    UID of administrator creating entry

    Y

    N

    S

     

     

     

    createTimestamp

    Time stamp of nihInetOrgPerson creation event

    Y

    N

    S

     

     

     

    ModifiersName

    UID of last administrator modifying entry

    Y

    N

    S

     

     

     

    ModifyTimestamp

    Time stamp of last nihInetOrgPerson modify event

    Y

    N

    S

     

     

     

    NihPersonStatus

    A | I | T; A=active; I=inactive; T=transferring

    N

    N

    ABFHJ

    A

    B

    A

    NihUidQuality

    0=not validated; 1=3rd party; 2=personal contact

    Y

    N

    A

     

     

    Y

    NihUidValidator

    UID of administrator validating UID

    N

    N

    S

     

     

    Y

    nihUidValidationTimestamp

    Time stamp of UID validation event

    N

    N

    S

     

     

    Y

    nihDirEntryUnlisted

    Y | N; Y=unlisted directory entry. For "Fast Track", causes entry not to be present on LDAP server. NOTE: since public directory information for Federal employees is subject to the Freedom of Information Act, an entry cannot be unlisted without justification.

    N

    N

    A

    A

     

    AO

    nihDirEntryEffectiveDate

    Date directory entry becomes active

    N

    N

    A(FHJ??)

     

     

    Y

    nihDirEntryExpirationDate

    Date directory entry expires

    N

    N

    A(BFHJ??)

     

     

    Y

  4. "Fast Track" Database Creation
    1. Record Linking
  1. Create UIDs for HRDB, FPS, and JEFIC after linking to identify overlaps.

Associate the individual identifying information from these with UIDs, and add to the FTRDB. Report UIDs with duplicate SSNs for review. Select and load the name attributes (personalTitle, givenName, middleName, sn, generationQualifier, and nihSuffixQualifier), homePhone, homePostalAddress, organizationalStatus, title (from job series), o (authoritative), ou, nihSAC, nihDirEntryEffectiveDate, and nihDirEntryExpirationDate attributes with the values found in the most recently modified record. Load work address, work telephone, and manager from JEFIC.

  1. Link records from ITAS to the FTRDB on SSN and select and load the secretary (timekeeper), manager (leave approving official) and nihUniqueMail attributes.
  2. Link records from the Telecommunications database to the FTRDB on SSN.
  1. Link records from the PAID database to the FTRDB on name, organizationalStatus, home address, organization, work address(?), and work phone(?).

Select and load the name attributes, homePostalAddress, buildingName, roomNumber, and telephoneNumber attributes.

At this point, the FTRDB should contain entries for about 21,000 individuals, with UIDs associated with individual identifying information of good quality. The following steps add entries for individuals with UIDs which will have little or no associated individual identifying information.

  1. Add new UIDs for active records in PAID that have not yet been associated with a UID. These should be individuals who are contractors, visitors, and guests, and who are not JEFIC "Others". Load the name attributes, homePostalAddress, organizationalStatus, o, buildingName, roomNumber, and telephoneNumber attributes.
  2. Link records from the Telecommunications database to the FTRDB on name, o, nihTelecomOu, telephoneNumber, buildingName, roomNumber, attributes. Report links with conflicting SSNs for clerical review. Select and load: name, nihTelecomOu, telephoneNumber, buildingName, roomNumber,
  3. Link records from the NIH Email Directory and Forwarding Service to the FTRDB on name, o, nihUniqueMail, telephoneNumber, facsimileTelephoneNumber(?), and physical office location(?).

Load nihEmailNickName attribute. Select and load nihUniqueMail, title, facsimileTelephoneNumber, pagerTelephoneNumber, telephoneNumber, buildingName, and roomNumber attributes.

 

    1. Attribute Selection

When multiple records from different databases are joined or linked to the same UID, conflicting values for the same attribute may result. These conflicts will be resolved and attributes selected for loading as follows:

  1. HRDB, ITAS, FPS, JEFIC, and PAID are the authoritative sources, in that order of priority, for the following attributes: generationQualifier, givenName, personalTitle, middleName, sn, nihSuffixQualifier.
  2. Different values for givenName, middleName, and sn from other sources are not treated as conflicts, but are saved as nihAliasGivenName, nihAliasMiddleName, nihAliasSn cn, and nihNickName attributes.
  1. HRDB, FPS, JEFIC, and PAID are the authoritative sources, in that order of priority, for the following attributes: organizationalStatus, o, and ou.
  2. Attributes will only be loaded from records with a matching o (IC) attribute.
  3. Attributes from Telecommunications will be loaded in preference to those from PAID or the NIH Email Directory.
  4. Conflicting attributes will be selected from PAID or the NIH Email Directory based on which was more recently modified.
  1. Probabilistic Record Linkage

Linkage is the bringing together of information from two database records that relate to the same individual. Calculating the likelihood that the linkage is correct makes the linkage process probabilistic.

The degree of certainty that a linkage is correct depends upon the comparisons of available attributes (or fields) of the records, and the outcomes of these comparisons. Generally, agreement between the values of an attribute in a pair of records argues in favor of accepting them as a linked pair, while disagreement of attribute values is characteristic of an unlinked pair.

However, agreement of various attributes and values have varying significance. For example:

The odds that a linkage is correct can be calculated by measuring the frequency of the outcomes of a comparison applied to a representative set of linked pairs, and dividing that by the frequency of the outcomes of the same comparison applied to a representative set of unlinked pairs:

where:

x indicates the attribute and its value on the record from database A

y indicates the attribute and its value on the record from database B

When multiple comparisons involving various attributes and values are performed on a pair of records, the overall odds of correct linkage are calculated by simply multiplying together the odds of the individual comparisons. However, it is customary to express the odds as a binit weight:

and to then calculate the total binit weight by summing the binit weights of the individual comparisons.

Note that the representative set of linked pairs need not be large (a few hundred is sufficient to start with), and it need not be perfect. Applying the linkage process generates more linked pairs, which can be added to the set, and the process iterated.

A representative set of unlinked pairs is not required if simple comparisons are used, because the outcome frequencies can be calculated. Care must be taken when performing complicated (and more powerful) comparisons, which involve:

Alternatively, one can estimate the outcome frequencies using any of several procedures, such as the Expectation-Maximization (EM) algorithm.

Finally, there are a multitude of tips and tricks for handling missing values and comparing surnames, initials, given names, dates and places of birth, and geographic attributes.

    1. Record Linking Example
    2. As a simple example of record linking, suppose that four attributes are compared on representative sample sets of linked and unlinked pairs of records from two files, and the outcome frequencies (expressed as percentages) are obtained as shown in Table D-1.

       

      Table -1 Example Attribute Comparison Outcome Frequencies

      Attribute

      Outcome

      Linked Pairs

      Unlinked Pairs

      Ratio

      Binit Weight

      Surname

      Agree

      96.5%

      0.1%

      965.0000

      9.9

       

      Disagree

      3.5%

      99.9%

      0.0350

      -4.8

      Given Name

      Agree

      79.0%

      0.9%

      87.7778

      6.5

       

      Disagree

      21.0%

      99.1%

      0.2119

      -2.2

      Date of Birth

      Agree

      93.3%

      8.3%

      11.2410

      3.5

       

      Disagree

      6.7%

      91.7%

      0.0731

      -3.8

      Place of Birth

      Agree

      98.1%

      11.7%

      8.3846

      3.1

       

      Disagree

      1.9%

      88.3%

      0.0215

      -5.5

      Now suppose that a pair of records from the two files are compared, and they agree on surname, date of birth, and place of birth, but disagree on given name. Then the total binit weight of the link for the pair is calculated by adding the bold face binit weights for the individual outcomes from Table D-1:

      9.9 + (-2.2) + 3.5 + 3.1 = 14.3

      This binit weight indicates odds of about 214.3 @ 10,000 to 1 that the pair should be linked

    3. Record Linking References
  1. Gill, L. E., and Baldwin, J. A. (1987), "Methods and technology of record linkage: some practical considerations" in J. Baldwin, E. D. Acheson, and W. Graham (ed.) Textbook of Medical Record Linkage, Oxford: Oxford University Press, 39-54.
  2. Newcombe, H. B. (1988), Handbook of Record Linkage: Methods for Health and Statistical Studies, Administration, and Business, Oxford: Oxford University Press. Classic book reference. Covers some of the theory and much of the heuristics needed for good record linkage practice. Now out of print.
  3. Newcombe, H. B. (1987), "Record linking: the design of efficient systems for linking records into individual and family histories" in J. Baldwin, E. D. Acheson, and W. Graham (ed.) Textbook of Medical Record Linkage, Oxford: Oxford University Press, 15-38.
  4. Winkler, W. E. (1994), "Advanced Methods of Record Linkage," American Statistical Association, Proceedings of the Section of Survey Research Methods, 467-472. Describes new theory and algorithms in computer science, operations research, and statistics that were developed at the Census Bureau and used in current Census system. Extends original Jaro string comparator and gives likelihood-based methods for connecting the comparators to the main decision rule of Fellegi and Sunter. Introduces a new assignment algorithm for forcing 1-1 matching that is as fast the benchmark Burchard-Derigs algorithm and uses 1/500 as much storage; is also much faster and uses less storage than the MCF algorithm of Klingman. Gives general theory extending EM ideas of Meng and Rubin (Biometrika 1994) and shows how it is applied in estimating record linkage parameters. Gives method for estimating record linkage error rates that holds in more situations than the Belin-Rubin method, that does not require a training set as does the Belin-Rubin method, and requires an ad hoc intervention that tends to limit its application to record linkage experts.
  5. Winkler, W. E. (1995), "Matching and Record Linkage," in B. G. Cox et al. (ed.) Business Survey Methods, New York: J. Wiley, 355-384. Survey article that gives much background about record linkage. Describes available software, list acquisition and preparation, and a large number methods for evaluating the quality of lists and the quality of matching results.
  1. "Fast Track" Update Service User Interface
  2. Outstanding Questions and Issues
  1. Propagate updates from directory update service to ph/CSO? A: No.
  2. Changes to attributes in the NIH email directory made directly by users will be overwritten by changes made by AOs. A: Not a problem--see #1.
  3. Provide for "unlisted" directory entries? A: Yes. For the "fast track" directory an "unlisted" entry will not be transferred to the LDAP directory server.
  4. Can "weak" UIDs be eliminated or upgraded to strong UIDs?
  5. What individual identifying information can AOs acquire and enter from NIH employees? From NIH contractors?
  6. Can/should we switch NIH UIDs to the ISO/IEC 7812-2 identification card standard?
  7. Is it feasible to use S/MIME to secure email notifications from the FTUS to other organizations and systems?
  8. Would it be feasible to add SOUNDEX codes for surnames contained in common name attributes added by the entry owner?
  9. Add attributes for hair color, eye color, and height?
  10. Add digitized written signature image attribute?
  11. Allow listing of other than sn, giveName, etc. in phone book?
  12. Check with Tom Boyce re: status of tracking contractors in ITAS.
  13. Privacy and practicality issues re: collecting mother’s maiden name.

 

 

 

 

  1. Wish List