Architectural Recommendations


Appendix C

Standards for World Wide Web-Based Applications at the NIH
Report and Recommendations

April 16, 1997
Prepared by AMG - WWW Applications Working Group

STANDARDS FOR WORLD WIDE WEB-BASED APPLICATIONS AT NIH

Prepared by: AMG, WWW Applications Working Group

April 16, 1997

THE NEED

The NIH needs a consistent infrastructure to efficiently support development of NIH-wide Web applications. Unfortunately, this infrastructure does not currently exist.

Mandate: To address this need, a Web sub-committee of the AMG was established to formulate a set of World Wide Web standards and features that will enable the development of Web applications across the NIH.

Executive Summary

The AMG Web sub-committee, chaired by Fernando Burbano, included members from a wide cross section of NIH (see Attachment D for a list of committee members and support staff). The committee examined several aspects of Web technology -- application development, Web servers, and other crosscutting Web issues. Input from previous groups (the AMG Web Technologies Subcommittee as well as the NIH Home page Committee and its Search Engine Sub-committee) was extremely useful as the AMG Web sub-committee developed its recommendations. These recommendations include both strategic and tactical actions.

Adopt Web technologies as the "client/server architecture of choice" for new NIH applications (Tactical and Strategic)

The NIH should embrace and exploit Web technologies for research, administrative, and management purposes, as has the computer industry at large. We expect that new vendor products will incorporate the look and feel of Web browsers for their user interface. This should provide a high level of consistency across COTS applications -- thus reducing training and support requirements while increasing user productivity. NIH should join this rapidly growing movement to maximize its IT effectiveness. It would be extremely inefficient for the NIH to ignore industry trends.

Act to ensure that on and after June 1998 all NIHnet-connected desktop systems include a fully functional, standards compliant Web browser (Tactical and Strategic)

It is important to provide NIH-wide consistency for desktop clients (i.e., Web browsers). Funding is needed to build this level playing field to avoid creating pockets of "have nots" that would compromise the ubiquity and effectiveness of Web-based applications. This central funding will promote a controlled implementation -- and the future cost of not funding it may far outweigh the initial investment.

Act to provide adequate employee training and support for the use of Web browsers no later than June 1998 (Tactical)

The opportunity to have maximum impact on NIH's use of Web technologies is now, before each organizational entity has become committed to their own implementation standards.

Create a standing AMG Web Committee (Tactical)

Web technologies are evolving at an incredibly rapid rate. Thus, an algorithm was developed to identify appropriate application functionality and Web products such as browsers. A standing AMG Web Committee should apply this algorithm every six months in order to provide NIH with a current set of browsers that adhere to the NIH standards.

I. Overview

The World Wide Web is a technology that NIH should embrace and exploit for research, administrative, and management purposes. The scope and rate of change in Web technologies and associated products is so rapid that it is not practical to establish static standards for the NIH. Rather, it is the consensus of this sub-committee that specific Web standards for NIH should follow industry trends in a well-defined, controlled manner; thus, a process for identifying these trends should be adopted. Toward this goal, the sub-committee recommendations are in three areas:

* Application development issues, including browser/application compatibility, HTML syntax, applet usage, and database access.

* Server issues, including security, robustness, backup/recovery, scalability, and capacity management.

* Crosscutting issues, including choice of search engines, the use of Web robots, and applications with high bandwidth requirements.

As with many aspects of the web, the rapid pace of technological change makes it difficult to develop a set of rigid guidelines and "best-practices" as they are often out-of-date by the time they are completed. This problem is best addressed through the creation of a standing AMG Web Committee to deal with Web standards and recommendations. This Web committee should update specific Web standards recommendations for NIH Web application development and review the currency and applicability of the server and ancillary issues included in this paper. The sub-committee recommends that this process take place no less frequently than every six months.

In addition to creating a Web committee, we feel that it is vital for the AMG to recommend NIH action to ensure that every NIHnet-connected desktop system have an appropriate (i.e., standards compliant) Web browser installed and functional. Further, it should recommend that training be available so that all NIH staff is able to effectively use their browser. This sub-committee recommends that these actions have a target completion date of June 1998.

To restate, this sub-committee feels that the NIH should:

* Adopt the use of Web technologies as the "client/server architecture of choice" for new NIH applications

* Act to ensure that on and after June 1998 all NIHnet-connected desktop systems include a fully functional, standards compliant Web browser

* Act to provide adequate employee training and support for the use of Web browsers no later than June 1998

* Create a standing AMG Web Committee

This proposal further describes each of these recommendations and provides the rationale for them.

We feel that acting now will create NIH-wide architectural consistency in the early stages of the new Web computing technology -- when it is most efficient (and possible) to do so. It is rare to have the opportunity to have a major impact throughout NIH's IT community. The specific actions envisioned will, of course, require funding. We estimate that the objectives can be reached at a cost of $550,000 during FY98 with an annual expenditure of about $230,000 thereafter. Attachment B provides the basis for this cost estimate.

In addition to the recommendations above, the sub-committee suggests that the AMG Web Committee create a monitored LISTSERV List and Web page to distribute information and promote collaboration among Web Masters and Web application developers.

II. Web Policy Considerations

It has been suggested that this AMG sub-committee review other policy issues found in the current NIH WWW Guidelines document to determine whether selected items need to be updated or made consistent with the HHS WWW Guidelines and Best Practices. The majority of topics covered in these existing documents are not related to the technical aspects of making applications compatible across the Internet. Instead, they concern broader agency Web issues such as: appropriate-use of federal ADP resources, adherence to existing federal laws and polices such as the Privacy Act, editorial and artistic quality control, use of standard design elements, effective design practices (such as the economic use of images and the design of fast-loading graphics, marketing, user feedback, the need for text-only pages or other techniques that provide for the accessibility of on-line information for the disabled, content issues such as copyrights and disclaimers of endorsement, and the relevancy, accuracy and timeliness of posted data).

We examined these NIH and DHHS policies and felt comfortable with their recommendations. We do, however, consider it highly desirable to identify the roles of the various groups and committees that currently deal with NIH Web policy and to clearly define an overall centralized structure to provide coordination and leadership for their efforts.

Two aspects of Web policy have special bearing on other architectural issues. For this reason, we feel that it is important to re-affirm our support of these policies for NIH Web documents:

a. Web content must meet DHHS standards on proper HTML, headers and footers, use of logos, etc.

b. Documents should contain descriptive, meaningful titles (commonly displayed on search engine document hit lists).

III. Application Development Issues

The scope and rate of change in Web technologies and associated products is so rapid that it is not practical to establish static standards for application development at NIH. Further, Web technologies are still in their infancy, thus industry directions have not yet been determined. With this in mind, we feel that NIH will be best served by following de facto industry standards; thus, a process for identifying these standards should be adopted. The process will be based on:

An algorithm to identify a "standard" set of browsers based on market share.

* Using an identified information authority used to determine market share.

* With a periodic review by an AMG Web Committee that will apply the algorithm to obtain a list of products (i.e., browsers) that make up the "standard" set.

This approach will permit application developers to utilize relatively new Web technologies while still achieving the efficiency benefits of NIH-wide architectural consistency.

The sub-committee believes that different considerations are appropriate for applications that have different intended audiences. With this in mind, the sub-committed has addressed two types of application: trans-NIH applications and general public applications. Trans-NIH applications are those whose intended audience is anyone involved in the direct business of NIH (research, administration, or management). Thus, the audience is some unnamed group of NIH-related individuals whose desktop software includes Web products that adhere to the standards as defined below. General public applications are those intended for use by the public at large.

"Standard" Browsers

a. The standard for Web browsers is any one of the market share-leading browsers that together constitute 85% of the market at the time the AMG's Web committee convenes.

b. The Browser Statistics Usage Page of the University of Illinois at Urbana-Champaign should be used to identify browser market share. (See Attachment A for a discussion of this information authority.)

c. From the set of browsers identified above, an application developer may assume the following level of currency:

* Trans-NIH applications -- all production releases during the past six months.

* General Public applications -- all production releases during the past twelve months.

d. If the set of browsers and releases identified above does not address all desktop systems that must be supported by NIH policy, then additional browsers/releases will be added to accommodate this requirement.

e. The standard browsers are assumed to be in their off-the-shelf state (i.e., no additional plug-ins or extensions).

An application of this process, and a set of "standard" browsers as of March 20, 1997, is included in Attachment A. We recommend that these be used as the standard browsers until the AMG Web Committee provides a more current recommendation.

Software Design

a. Applications should be designed to use HTML, applets, and security features that are functional with the complete set of browsers identified above (for either trans-NIH or general public audiences).

b. Database access should be compatible with the Web browsers identified above.

c. Applications should be tested with all of the standard Web browsers. Note: the AMG Web Committee should consider developing a "test bed" that will allow software developers to try new applications with all of the browsers (and versions) that constitute the standard browser set.

Robot Considerations

Application developers and Web page owners should ensure that their pages provide for robot-friendly navigation. This requires that the responsible individuals:

a. Use robots.txt to exclude sensitive documents (i.e., Intranet documents) (See Attachment C.)

b. Consider robots when designing a CGI interface, and test prior to implementation.

IV. Server Issues

Web-based applications are quickly becoming critical components of the work of NIH. Thus, it is essential for the servers that run these applications to receive the same considerations that are appropriate for other production computing applications. The DHHS Automated Information Systems Security Program (AISSP) defines the security levels for application systems and data files -- and specifies the actions that are required for the computing system (platform and environment) used. These actions address the many areas that must be considered to insure that the NIH business process has total integrity and reliability. Servers should adhere to the DHHS standards for all aspects of security, robustness, and operational management that is appropriate for the applications and data that depend upon them.

V. Crosscutting Issues

Three areas of Web technology that merit a great deal of consideration at NIH are:

* Robots

* Search Engines

* High Bandwidth Applications

Robots

Robots are a new breed of web-based applications that can have a direct effect on other NIH services and resources (see http://info.webcrawler.com/mak/projects/robots/faq.html). Unlike traditional applications that operate within a well-defined domain, usually a single computer or interconnected group of related computer systems, robots traverse across Web space examining/collecting data from systems far beyond the application developer's control. Spiders, used by search engines to index Web document collections, are a common example of a Web robot. Agents, programs that crawl Web space searching for specific information for a particular user, are another growing class of robot applications.

Robots are extremely useful in organizing and sifting through the vast amount of data available across NIH Web space. However, they can cause some unwanted side effects:

a. Rapid-fire HTTP requests, made by a robot, can overwhelm a small Web server.

b. A robot can also overwhelm Web servers that have a poorly designed CGI interface. This is particularly true of servers that generate documents on-the-fly, where multiple links to the same "page" have different uniquely generated URLs.

c. A robot that downloads a Web server's complete document collection can greatly skew that Web server's access statistics.

d. A robot, collecting documents within NIH Web space for distribution to the general public, could circumvent NIH IP based security.

Releasing Web Robots

a. At least 30 days prior to deploying a robot based application the application developer must post an announcement to the LISTSERV list created and managed by the AMG Web Committee. This announcement should include:

i. Brief description of the application

ii. Deployment date.

iii. Contact information, including name(s), phone number and e-mail address.

iv. Instructions on how identify and exclude the application (e.g., robots.txt).

b. All robots must adhere to the Robot Exclusion Standard (see Attachment C).

c. Robots should contain a built-in delay of at least 5 seconds between making multiple HTTP requests from the same server. If possible, the robot should time the requests to match the performance of the server (i.e., longer delays between requests to a slower server).

Search Engines

A sub-committee of the NIH Home page Committee recently completed an evaluation of search engines. We endorse this group's approach and current recommendations that can be found at

http://bigblue.od.nih.gov/websearch/report.htm

The AMG Web Committee should periodically review and update these recommendations.

High Bandwidth Applications

Web based applications that transmit/receive huge quantities of data (e.g., PointCast, real audio, video streaming, MBONE) can also have an adverse impact on the NIH Web community. Since NIH shares a common network, an application that has sustained high bandwidth requirements could impede access to other NIH systems.

Whenever an application that has high bandwidth requirements is being considered, the responsible individual should first identify the expected network resources likely to be impacted, then consult with those responsible for those resources. This will permit arrangements to be made for additional network capacity, or will permit re-examination of the technical approach prior to significant resource investments.

Standards And Features For World Wide Web-Based Applications

Working Group Recommendations

Recommendations

Initial Step(s)
Proposed Agent
1. Adopt Web technology as the "client/server architecture of choice."
* Draft policy

* Publish and mandate compliance

AMG

CIO

2. Every NIHnet -connected desktop system should have a standards compliant browser.
* Draft policy

* Publish and mandate compliance

AMG

CIO

3. A training program in browser use should be implemented.
* Develop Training Program

* Provide Resources

AMG

CIO

4. Create a standing AMG Web committee to deal with Web standards.
* Draft committee charter for review and approval

* Form initial committee membership

AMG, CIO

AMG

5. Provide funding to create and maintain a Web infrastructure for NIH.
* Review resource estimates

* Provide funds

AMG

CIO

6. NIH should designate as its standard those off-the-shelf browsers that constitute an 85% market share. For Trans-NIH applications, all production releases within the past 6 months. For general public applications, all production releases within the past 12 months.
* Draft policy

* Publish and mandate compliance

AMG

CIO

7. Use only HTML and applets that are functional with the standard browsers.
* Draft policy

* Publish and mandate compliance

AMG

CIO

8. Use only database access methods compatible with the standard browsers.
* Draft policy

* Publish and mandate compliance

AMG

CIO

9. Use only those security measures compatible with the standard browsers
* Draft policy

* Publish and mandate compliance

AMG

CIO

10. Applications should be thoroughly tested with the standard browsers.
* Draft policy

* Publish and mandate compliance

AMG

CIO

11. Ensure a Robot-friendly environment by adopting the Robot Exclusion Standard.
* Draft policy

* Publish and mandate compliance

AMG

CIO

12. The AMG Web committee should review these recommendations on a periodic basis.
* Draft policy
AMG, CIO

Attachment A

Standard Web Browsers

An Application of the Recommended Algorithm

(As of March 20, 1997)

As described in the report above, the process for establishing Web application standards is based on identifying those families of Web browsers that make up 85 percent of browser usage market share at any specific time. This in turn depends on the availability of authoritative statistics for Web browser usage. For this purpose, and until an alternative source of browser statistics can be identified and justified, such statistics will be derived from Information obtained from the Browser Statistics Usage Page of the University of Illinois at Urbana-Champaign (UIUC):

http://www.cen.uiuc.edu/bstats/latest.html

This site publishes browser usage statistics for "hits" to the UIUC Engineering Workstations WWW Servers, and is currently (as of March 1997) basing its information on over 1 million "hits" per week from over 83 thousand different browser hosts. Statistics are provided for browser operating system, browser vendor, and browser version. Available time increments are daily, weekly, and monthly, with historical data archives starting from April 1996.

The processes by which an NIH Web application developers might ensure that they are adhering to the compatibility standards set forth here are:

1) Suppose that a developer preparing to implement an application for production in March 1997 and that the client audience for that application will be trans-NIH. In order to ensure that the application functionality will be synchronized with the anticipated level of NIH Web browser functionality, the developers would monitor the UIUC statistics site described above. From that site they would determine which families of Web browsers make up 85 percent (or more) of general browser usage.

For example, based on statistics from the week ending with March 16, 1997 (83,819 hosts in 102 countries making 1,046,435 accesses), they would find that:

Netscape (all OSs and versions) = 74.2 percent

Microsoft Internet Explorer (MSIE)(all OSs and versions) = 23.0 percent

Lynx = 0.5 percent

Mosaic = 0.4 percent

Therefore, since Netscape and MSIE currently account for 97.2 percent of the current general browser usage, confining compatibility testing to Netscape and MSIE is acceptable in the March 1997 time frame for application testing and deployment.

2) For a trans-NIH client audience, the lag period for browser implementation is stipulated to be six months. Therefore the developers must test their applications with all production version releases (by version number and not by platform) of the browsers from Netscape and MSIE starting with October 1996. This would then indicate that the application should be tested with Netscape versions 3+ and MSIE 3+. If the client audience were the general public, then the lag period would be 12 months, and the developers would be requested to test their application with Netscape version 2+ and all versions of MSIE.

Attachment B

Cost Basis

The $550,000 FY98 and additional annual expenditure of $230,000/year are specifically for maintaining currency in NIH browser support. Costs associated with hardware, operating systems, and other applications are not included in these estimates.

NIH browser costs are derived from the following assumptions:

* There are approximately 16,000 NIHnet-connected desktop computers. We estimate that 40% of the people responsible for these computers will acquire their own browsers, even if NIH provides a "free" one.

Thus, about 10,000 browsers will need to be acquired

* Netscape browsers cost about $37/copy when purchased singly

* Large volume, site licenses, or right-to-buy agreements can significantly reduce the cost of browser purchase

* The Internet Explorer browser is included in the cost of Windows 95 and Windows NT

Thus, the cost for providing 10,000 additional browsers for NIH use will be approximately $200,000 ($20/copy) + 15% annual maintenance ($30,000)

* Since there is already an infrastructure for software support and training, we feel that the additional resources needed for browser support services could be acquired for approximately $350,000 during FY98 (the "ramp up" period), then about $200,000 annually for future years. This annual cost might become lower once the current rate of technology upgrades diminishes and new browser versions aren't released so frequently.

Attachment C

Robot Exclusion Standard

This attachment has been adapted from:

http://info.webcrawler.com/mak/projects/robots/robots.html.

To exclude robots from a server or specify an access policy for robots, the webmaster must create a robots.txt file accessible via HTTP as http://server.name/robots.txt.

The robots.txt file consists of one or more records separated by one or more blank lines. Each record contains lines of the form:

<field>:<optional space><value>

The field name is case insensitive. The '#' character is used to indicate that the remainder of the line is a comment and can be discarded.

The record starts with one or more User-agent lines, followed by one or more Disallow lines, as detailed below.

User-agent

The value of this field is the name of the robot the record is describing access policy for.

If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record. If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records..

Disallow

The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html. Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to be present in a record.

Example

The following example "/robots.txt" file specifies that no robots should visit any URL starting with "/nih/intranet/" or "/cgi-bin/:

User-agent: * # Applies to all robots

Disallow: /nih/intranet/ # Internal documents

Disallow: /cgi-bin/ # CGI applications

Attachment D

Working Group Membership:

Dennis Burns (ORS/OAM)

Ron Edwards (NCRR)

Dr, Robert Goldschmidt (OD/OER)

Paul Logan (NHLBI)

Pete Morton (DCRT)

Dennis Rodrigues (OD/OC)

Mark Silverman (NLM)

Susan Teper (NIAAA)

Roy Standing (NLM) Vice-Chair

Fernando Burbano (NLM) Chair


Report on Interoperability at the NIH