Biodiversity Informatics: Norman F. Johnson
Biodiversity Informatics: Norman F. Johnson
Biodiversity Informatics
Norman F. Johnson
Department of Entomology, The Ohio State University, Columbus,
Ohio 432121157; email: [email protected]
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
421
ANRV297-EN52-20 ARI 21 November 2006 10:30
pogenic sources.
Although these data are nominally acces-
sible to researchers, for all practical purposes TECHNOLOGICAL
they have been unavailable to the general ENVIRONMENT
community. As a result, the investments that The basic tool of information management
have been made in acquiring, processing, and is the database. Most commercially available
storing the specimens and their data often products either are relational databases or em-
provide little in scientic knowledge. In the ulate them. A wide range of books is avail-
publication process, the link to the underly- able on relational theory (30). At its core, a
ing primary data usually is broken. After pub- relational database stores information in one
lication, data from newly collected material or more two-dimensional arrays called ta-
frequently inspired by that piece of research bles or relations. Each row of a table should
often cannot be incorporated into the collec- be a uniquely identiable occurrence of the
tive understanding. These problems are not concept modeled by the table; the columns
inherent in the data themselves. Rather, they represent different attributes of that occur-
arise from the tools that have traditionally rence. The value of one or more attributes
been used to manage and disseminate infor- that uniquely dene each row is that tables
mation, primarily paper-based publications. A primary key. The values recorded in each cell
new suite of tools that can effectively address of the table should be atomic, that is, bro-
these limitations is now available. ken down to represent indivisible values. Spe-
Biodiversity informatics has been dened cic rows may be located either by searching
as the application of information technolo- through the entire table, guratively from top
gies to the management, algorithmic explo- to bottom, or by using an index, a map of the
ration, analysis and interpretation of primary location of values within a table. An index oc-
data regarding life, particularly at the species cupies space in the computer but can signi-
level of organization (86). These data doc- cantly reduce search times in large tables.
ument primarily the occurrence of organ- Different types of information are repre-
isms in space and time. The development sented in different tables. The relationships
of this eld has been driven largely by the among these data are expressed by the use of
botanical and vertebrate research communi- foreign keys: In addition to its own primary
ties; entomologists, as a whole, have been re- key, a table may have a column that references
luctant participants. This is so possibly be- the value of the primary key of another table.
422 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
In a recursive relationship, the table uses its indicate emotional emphasis. Thus, an appli-
own primary key as a foreign key. This con- cation that seeks to nd and extract scientic
struct is useful for modeling a hierarchy, such names from an HTML document must rely
Standard: an agreed
as a taxonomic classication. on formatting or contextual hints to identify upon format or
Structured query language (SQL; pro- items of interest. Such page scraping re- structure
nounced s-q-l or sequel) is an industry quires intelligence to be built into the appli- Application:
standard used to add, retrieve, or update in- cation. Because each page on the Web is po- software that uses
formation within a database. Many commer- tentially unique in its formatting and can be the capability of a
cial products offer graphical user interfaces changed at will by the data provider, such ap- computer to perform
a task
for these same purposes or may have pro- plications are highly unstable.
cedural language extensions that allow for Extended markup language (XML) (38) XML: extended
markup language
more elaborate programming. In this con- provides a mechanism by which to commu-
text, the development of a standard is a for- nicate the semantic content of items. The
string <ScienticName>Musca domestica
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
tually adopted by means of a vote of the stan- the text is to be interpreted as a scientic
dards body. Once adopted, adherence to the name. There is no constraint on the number
standard by hardware or software develop- or meaning of XML tags used; these tags are
ers provides others with a stable base toward dened, usually externally, by a specialized
which to work and assures users of minimal document, an XML schema. A typical Web
performance. browser does not know what to do with such a
The Internet is now one of the most im- tag as <ScienticName>, and most browsers
portant mechanisms for the dissemination of ignore such things when formatting the text
information. Most database products provide for display. However, an application that
a mechanism, usually a form, for formulat- understands the underlying XML schema
ing a query to the database and displaying the can properly identify entities within the
response. The primary format for this is hy- document and the relationships between
pertext markup language (HTML) (71). This them. XML is not really intended to be read
is a set of tags, most of which serve to indi- by humans; rather, it is a medium of exchange
cate to a piece of software, typically a Web of data between software applications. An
browser, the manner in which the informa- XML style sheet can be used to process the
tion is to be displayed. For example, text that is information contained and present it in a
contained between the tag <em> and </em> format suitable for human consumption.
will be displayed in the browser in a prede-
ned font and style to indicate that the text
is emphasized, often by an italic font. The DATA CAPTURE AND STORAGE
HTML standard provides a number of tags
that, in the hands of talented and imaginative
Specimen-Occurrence Data
designers, can generate an amazing range of The term specimen occurrence encompasses
content. HTML has an important limitation: both an individual that is captured (or subsam-
The text and images displayed have no inher- pled) as well as observations (17): The same
ent meaning. That is, the string <em>Musca basic data apply to each. Observations have
domestica</em> when seen by an entomol- the disadvantage that the identity of the taxon
ogist is readily recognized, both by the words cannot be independently veried or updated
and the italic text, as a scientic name of a as taxonomies are revised. Specimens, on the
common species. To a software application, other hand, are essentially a snapshot of an in-
though, this interpretation is not obvious: The dividual at one point in time in its ontogeny.
text could just have easily been italicized to For this review, specimen occurrence refers
to both physical specimens as well as observa- levels of quality assurance (62). Although the
tions. fundamental data elements are identical, dif-
Early formal attempts to explicitly enu- ferences in organization lead to signicantly
Protocol: the
structure of messages merate the types of data that document different protocols for retrospective data cap-
used to communicate specimen occurrences and the relationships ture (i.e., of information from material already
between computers between these data were developed by the incorporated into the collection) and prospec-
ECN: Entomology Association of Systematics Collections (ASC) tive data capture (i.e., information for new col-
Collections Network (3) and Colwell (19). The ASC information lections) (98, 99).
model is a fairly extensive general model, For new material, the specimens are typi-
although originally it did not extend into cally derived from a small number of collect-
the domains of literature. Colwells biota ing events, and they share all data elements
model is a specic database implementation. except determination, sex, or other individual
Most recently, the data elements involved, characteristics that are recorded. Therefore,
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
and to some degree their interrelationships, the data entry personnel may enter the bulk
have been enumerated in the form of XML of the data only once, updating unique identi-
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
424 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
of the process of taxonomic research (98, 99), nent landmarks. Estimates of the error arising
thus affording greater authority for identi- from both low accuracy and precision are crit-
cations as well as taking advantage of the ical for users of the data to assess their tness
taxonomists ability to more accurately deci- for use (16, 18).
pher laconic or incomplete data, or bad hand- Tools available to facilitate the process
writing. The advantages of such an approach of georeferencing include BioGeoMancer
are clear, but this effort can become a pre- (9) and GEOLocate (46). Geographic name
scription for inaction. A preferable approach servers are freely available for the United
is to indicate the tness for use of the data, States (45), Canada (14), Mexico (58),
e.g., by indicating the names of the deter- Argentina (43), Australia (41), New Zealand
miners and the date at which the identica- (72), Germany (44), Austria (5), Italy (59),
tion was made. The risk is that the data will United Kingdom (70), and South Africa (88).
be misapplied; the advantage is that the ex- Several servers have global coverage (2, 47,
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
and corrected by the specialists, resulting in a No amount of training or expertise can elim-
more efcient and productive use of time and inate 100% of such errors. Therefore, it is
expertise. important that data be reviewed on a regu-
lar basis to nd and correct mistakes. General
principles of data cleaning are discussed by
Georeferencing Chapman (15); Guralnick & Neufeld (53) de-
One of the critical tasks in data capture is the scribe a protocol for monitoring the quality of
process of georeferencing, that is, the conver- georeferencing.
sion of descriptions of locations at which spec-
imens were collected into a common coordi-
nate system. This typically is a value-added Authority Files
component in the process of retrospective Authority les, of which the drop-down lists
data capture. Global positioning system re- are one example, are useful mechanisms to de-
ceivers largely eliminate this step for new col- crease error rates. Despite their name, author-
lections. However, georeferencing is probably ity les are not necessarily authoritative, but
the most signicant bottleneck in the digiti- provide a standard from which data can be se-
zation process. Effective management of this lected or against which records may be com-
step requires clear protocols, and a number of pared. Many places are well known as classical
tools are available and under development to collecting localities; a prominent example is
facilitate the process (66). Nova Teutonia, Brazil, from which thousands
The Mammal Networked Information of specimens were collected by Fritz Plau-
System (MaNIS) provides an excellent re- mann (unpublished observations). Unfortu-
view of the issues involved in georeferencing nately, there is not yet a georeferenced listing
a place name (66). The most thorough rep- of such localities for use by the general en-
resentation of a locality requires a pair of co- tomological community. The botanical com-
ordinates (latitude and longitude), elevation munity has developed a series of standards for
or depth (with explicit or implied units of some elements of occurrence data. Examples
measure), geodetic datum, and error estimate include geographic names (54), the authors of
(along with units). Online digital gazetteers plant names (12), the structure of plant names
(see below) provide a means of nding the ge- (11), and abbreviations for herbaria (55). With
ographic coordinates for place names (usually the exception of the unofcial but widely
populated places), as well as for other promi- used codens (abbreviations for the names of
426 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
not necessarily a formal scientic name, rep- of string such as the use of any number. But a
resents a particular authoritys assertion about piece of software cannot differentiate between
the circumscription of a taxon. Thus, it in- the meaning of the use of the term Head in
DELTA:
DEscriptive cludes not only the name itself, but also an ac- the phrase Head longer than wide. . . and
Language for cording to clause that can differentiate his- that in the phrase South Carolina: Hilton
Taxonomy torical or regional concepts of a taxon. TCS Head. . .. Attempts to develop XML stan-
SDD: structure of does not explicitly include the taxon itself dards that can be used to indicate the seman-
descriptive data in the model: Taxonomic concepts reference tic elements within taxonomic documents are
DiGIR: Distributed only the taxon implicitly. One concern with under way, but none have yet been formally
Generic Information this schema is that there is no clear criterion adopted. Weitzman & Lyal (104) are devel-
Retrieval to indicate when one authors concept of a oping taXMLit, using as a testbed the Bi-
taxon is sufciently different from anothers ologia Centrali-Americana, which is an impor-
to justify the recognition of a new concept. tant baseline of information on the ora and
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
Uncontrolled concept ination would make fauna of the Americas. TaxonX (77) is under
this standard impractical. At the time of this development with the objective of delimiting
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
428 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
inventory, and status. For the purposes of GBIF provides a UDDI registry for its net-
querying a biodiversity data provider, the work of biodiversity data providers (49). As of
search option asks the provider for the data this writing, over 9 million records are avail-
UDDI: Universal
elds specied in the response section for all able from 167 providers around the world. Description,
the records of specimens that match the lter. Some of the database applications avail- Discovery and
The lter is composed of access points, e.g., able for handling collection data include Integration
the elds of the Darwin Core, logical opera- the software needed to provide data using
tors (and, or, not), and comparison operators DiGIR protocols and Darwin Core elds. Al-
( = , <,. , >,
. , =, like). The code snippet be- ternatively, data provider software is available
low requests records for specimens from Ohio from GBIF (42). Probably few database ap-
of the family Acrididae (whitespace is unim- plications directly use either the Darwin Core
portant in XML). or ABCD structure to store data. Rather, they
<lter>
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
<and>
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
<equals><StateProvince>Ohio</StateProvince></equals>
<equals><Family>Acrididae</Family></equals>
</and>
</lter>
The DiGIR protocol is the primary query either map their internal structure to one of
interface for the GBIF data provider net- those XML schemas or save a snapshot of their
work. An alternative query format, the Bio- data in a single table with the necessary elds.
logical Collection Access Service (BioCASe) The internal structure of the database, there-
(10) was developed in conjunction with the fore, is hidden from outside view. The data
ABCD schema. It is widely used among data provider has full control over the access to
providers in the European Union. TAPIR data and can limit, modify, or even deny ac-
(TDWG Access Protocol for Information Re- cess to information on sensitive species. The
trieval) (93) is currently being developed to data provider is also free to adopt or develop
unite these two basic methods of querying col- database software that best suits his/her needs
lection databases. and programming ability.
Simple Object Access Protocol (SOAP) To combine results from a number of dif-
(13) is a protocol for messaging between com- ferent data providers, a user rst sends a
puters. It is basically an XML document com- DiGIR query to each data provider. Portal
prising an envelope, encoding for datatypes, software that can use DiGIR and Darwin
and encoding for remote procedure calls and Core to query data providers and to aggre-
responses. The generality and wide use of gate the results is available (42). The replies,
SOAP messaging in the Internet world sug- in the form of XML messages, can then be
gest that it will assume a prominent place in aggregated, sorted, and processed. Data for
the future of biodiversity informatics. a specimen may be held in more than one
The existence of a Web service on the In- database. Thus, some process is required to
ternet is of little value if potential users do recognize that two records from different data
not know that it is available. Universal De- sources actually refer to the same physical
scription, Discovery and Integration (UDDI) object. Within one collections database, the
provides a mechanism for data providers to primary key uniquely identies a specimen.
register the services that they provide, to indi- In entomology, this key is usually a bar code
cate their location (URL) on the Internet, and (99); the standard recommended by the ECN
to specify how to access those services. The is an alphabetic string, which identies the
organization producing and storing the data nity. Most collecting techniques are capable
(such as those in Reference 37), and a se- of capturing such huge numbers of individu-
quential number. This information should als that identifying all of them is impractical.
be printed both as a bar code and as Many species are unidentiable to both the
human-readable text. However, when data are beginning student and the taxonomic expert.
aggregated from different collections, and po- This list could continue, but the practical im-
tentially from different disciplines, this com- port is that there are very few localities for
bination may not ensure a globally unique which the entire entomofauna is condently
identier (GUID) for the specimen. The known. Increased collecting effort almost in-
problem of dening such GUIDs is now being variably results in the discovery of additional
addressed through TDWG and GBIF (96). species, as evidenced in species-accumulation
curves, which, theoretically, should approach
an asymptote as the count approaches the true
DATA ANALYSIS
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
430 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
represent the locations where individuals were bility of occurrence, sometimes interpreted as
either collected or observed. A large number its geographic distribution, then is a function
of mechanisms, from simple image manipula- of the values of those variables. Some models,
GARP: Genetic
tion programs to full-blown geographic infor- such as BIOCLIM (74), dene a set of vari- Algorithm for
mation systems, are available for this purpose ables for predicting the distribution. A typical Ruleset Production
(68). The typical weaknesses of this method- set might include a combination of temper-
ology are that the map projection is unspeci- ature and precipitation values, e.g., average
ed; the actual collecting data are not speci- annual temperature, maximum monthly pre-
ed; the area covered by the dot on the map cipitation, precipitation in warmest quarter of
may be very large; and the absence of dots the year, etc. The value for each of the vari-
may indicate either lack of collecting or true ables is determined for each of the localities
absence of the species. Sometimes authors at- from which specimens have been collected or
tempt to supplement the dots-on-maps ap- observed. Suitable habitat is then dened as
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
proach by drawing a bounding line to indicate those geographic areas in which the value for
the hypothesized limits to distribution. The each of the variables falls between the maxi-
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
problem here is that the decision of where to mum and minimum values observed for the
draw that line is based at best on unspoken as- specimens. A distinction between marginal
sumptions or inside knowledge on the part of and more suitable habitats may be made by
the author. At worst, these lines are entirely dening the latter as being bounded by, for
arbitrary. example, the ninety-fth and fth percentile
Just as the actual number of species in a values for each variable, and thus taking into
community, a geographic distribution is dy- account the relative abundance of specimens
namic. Distributions change over time, per- and disregarding individual extreme values.
haps in a cyclical fashion, perhaps in response An algorithm such as BIOCLIM is rela-
to more general, long-term factors such as hu- tively simple to dene and calculate when the
man population growth or climate change. A environmental data are available. It has some
distribution, therefore, may not be expressed important drawbacks, however. The number
best as an area delimited by solid black lines, of variables possible is unlimited, and there
but as a probability surface, and those proba- is no a priori manner to determine which are
bilities may uctuate through time. The data best at predicting the distribution of any given
that go into producing a model of the distri- species. Absence data are typically not in-
bution should be accessible and the methods cluded, even if available. The model identies
used to produce the model should be clearly only areas in which, according to the variables
stated. used, the habitat is suitable for the species to
The methods for modeling geographic occur. It fails to factor in whether those areas
distribution typically rely on the correlation are accessible to the species in question (87).
between environmental variables and the ob- Another class of models uses methods of
served presence or absence of the species articial intelligence to predict distributions.
at particular localities. Because of insects Basically these methods divide the dataset of
vagility, lack of apparency, and difculty in observations of presence or absence into two
identication, it is rare that one can deni- parts: a training set and a test set. A model is
tively state that a taxon does not occur in a constructed from the data in the training set;
locality. Presence data are more clear-cut; va- its accuracy in predicting distribution is tested
grancy is still an issue, but ideally this can be against the test set. The model is then mod-
detected by measures of relative abundance. ied, retested, modied, retested, etc., un-
A commonly adopted approach is to model til an acceptable level of accuracy is reached.
the niche of the species of interest (87) on the One widely used method, Genetic Algorithm
basis of environmental variables. The proba- for Ruleset Production (GARP) (31), uses an
algorithm that mimics the process of natural taxa with wide distributions are not (105).
selection to modify and, ultimately, improve With explicit models of distribution, there-
models. Such a methodology allows the re- fore, the relative endemism of a taxon can be
searcher to use any set of environmental vari- quantied and the total level of endemism and
ables as predictors of distribution. For ex- extent of areas of endemism can be identied
ample, the network of highways within the on biological rather than political terms.
United States might not, at rst, appear to A number of organizations have, with great
have much to do with predicting the distribu- fanfare, drawn the spotlight to so-called bio-
tion of a species, but the distance from a road diversity hotspots, areas with reportedly high
may in fact be a good predictor of the presence values of species richness or endemism. The
of a weed species. Ultimately, though, even underlying empirical basis and methodolo-
these models depend on the extent of the envi- gies for such delimitations are rarely critically
ronmental layers against which the presence- examined. The availability and application
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
absence data are compared. A desktop version of objective, testable models of distribution
of GARP modeling software is available on- would strengthen the scientic credibility of
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
line (31). A large number of modeling tech- such demarcations and enable researchers to
niques are now available; Segurado & Araujo test the proposition that subsets of taxa (such
(85) provide an evaluation of several of these as owering plants, vertebrates, or butteries)
techniques, and Stockman et al. (91) speci- are capable of serving as proxies for overall
cally critique the accuracy of the GARP mod- richness and endemism.
eling approach. LifeMapper (63) was a recent, now sus-
One advantage of explicit models of dis- pended, effort to build upon federated bio-
tribution is that they allow the researcher diversity databases to build a library of dis-
to ask what if questions. How, for exam- tribution models. This project included a
ple, might the distribution of an endangered distributed computing component in which
species be affected by an increase in average the calculation of GARP models was spread
annual temperature (67, 79, and references through the world community as a screen
therein)? How far might an exotic species saver computation. The lack of a sufcient
spread if it invades the country and becomes body of electronically accessible data that con-
established (80)? A caveat is necessary, though: formed to community data standards has put
The scale at which the primary data were the project on hold. However, the library of
recorded and georeferenced, i.e., their accu- distributions, although currently off-line, is a
racy and precision, must be compatible with community resource that could support a wide
the scale at which the results are to be used. range of fruitful research.
Data for which there is an error of 1 km
are inappropriate for mapping the distribu-
tion of individuals of an endangered species. Identifications
This is another example of the importance of The classical tools for the identication of or-
recording estimated georeferencing errors so ganisms are images and dichotomous keys.
that users can assess the tness for use of the Optimally, these two resources are used to-
data (16, 18). gether to enable users to tap into and apply the
Discussions of endemism typically begin collective expertise of the community of taxo-
with the a priori denition of an area of in- nomic specialists. The traditional reliance on
terest; for example, how many species are en- hard-copy publications, however, placed lim-
demic to the Everglades National Park? How- its both on the numbers of images that could
ever, endemism is essentially the reciprocal of be used and on the structure of the keys. Inter-
the area of a species distribution: Taxa with active, or multiple-entry, keys provide a ready
limited distributions are highly endemic, and mechanism to sidestep such limitations (28).
432 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
Such tools have a relatively long history, with have been adopted or are under development.
early ones often marketed as expert systems. The Universal Biological Indexer and Orga-
Over the past 40 years many different applica- nizer Project (103) is working to develop tools
tions have been developed (24), and new ones that can contend with the practical difcul-
continue to be released. ties of taxonomy and effectively use the names
The DELTA project developed Intkey (22, of organisms to nd and organize the pub-
26, 27), one of the most widely used inter- lished information on every aspect of their bi-
active key applications. Lucid (64) is a re- ology. The International Commission on Zo-
cent commercial application that is becoming ological Nomenclature has set as one of its
widely used in the entomological commu- goals the development of an on-line registry
nity. Recent releases allow the key to be run for all animal names (82). Such a registry, op-
from a Web browser in addition to its stand- timally, would provide a mechanism for re-
alone version. Lucid 3 and the Electronic searchers to locate all newly proposed names,
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
Field Guide project (35) can use data in SDD as well as documentation of the descriptions
format, thus freeing the application from a and specimens used to validate those names.
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
unique data format and potentially permit- Many of these projects, and more, are accessi-
ting data to be freely transported to and from ble primarily through the World Wide Web:
other SDD-compliant applications. Dallwitz through project sites, wikis (forums for inter-
(23) provides a comparison of some interac- active discussion), and blogs. Other forums
tive key programs. for new developments include the electronic
Images form an important part of many journal Biodiversity Informatics (8), the GBIF
biodiversity studies, not only in identica- home page (48), and meetings and Web sites
tion keys, but also in descriptions of taxa and of groups interested in these issues, such as
habitat, morphometric studies, and even doc- TDWG (95) and ECN.
umentation of data-capture from specimen Entomology collections around the world
labels, especially if the labels are written in probably house more specimens and more
non-Latin alphabets or in nonalphabetic lan- data than any other subgroup of natural his-
guages. Dissemination of high-resolution im- tory museums. This information is impor-
ages over the Internet is usually constrained by tant not only for the discipline of entomology
bandwidth considerations. At the very least, a itself, but for wider concerns of conserva-
user should be warned of the size of the im- tion, ecology, and evolution. Yet these data re-
age that is being accessed. MorphBank (69) main hidden within the collections. The data
aims to provide a secure, replicated archive of management tools of biodiversity informat-
imagery used in biological studies. Metadata ics offer a powerful means to address such
standards for such imagery, above and beyond issues and to enable more effective steward-
that associated with the technical details of the ship of the signicant investments in time and
image itself, are being developed through the money represented by a collection. The de-
TDWG/GBIF process. velopment and broad implementation of com-
munity standards will make it possible to avoid
the problem of becoming trapped in an obso-
THE ROAD AHEAD lescent software application. As funding agen-
A large number of initiatives are underway cies emphasize digitization of collection hold-
that will probably be felt within the entomo- ings and providing reasonable open Internet
logical community in coming years, and only access to the data (71a), both curators and
a few can be mentioned here. The Unied their administrators can embrace these tech-
Biosciences Information Framework (102) at- nologies as a critical enhancement of their
tempts to dene a common foundation for mission and not as a secondary waste of
many of the TDWG/GBIF standards that time.
Similarly, systematic entomology in par- change, the task of biodiversity discovery and
ticular has much to gain by embrac- conservation is an urgent imperative for the
ing these tools. Godfray (51) has offered current generation. In the nearly 250 years
some interesting suggestions for increas- since the publication of the tenth edition of
ing the currency and availability of tax- Systema Naturae, we have collectively recog-
onomic information. Tools of biodiversity nized no more than 30% of the range of or-
informatics make it possible for systema- ganisms with which we share the planet (91a).
tists to documentunconstrained by pub- The needed technologies are available and in-
lishing coststhe material foundation upon expensive and have the potential to enhance
which their work is based. This is a posi- dramatically both scientic productivity and
tive move away from argument by authority relevance to society. All that remains is for
and toward a more accountable, data-driven this generation of researchers and curators
science. to grasp the opportunity and collectively deal
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
Finally, in the face of a growing human with the inevitable bottlenecks and roadblocks
population, habitat loss, and global climatic that will appear.
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
SUMMARY POINTS
1. The rich holdings of the worlds entomological collections are an irreplaceable re-
source of data necessary for understanding insect diversity and biology.
2. Tools and standards of biodiversity informatics can provide the means for effective
dissemination of biodiversity data.
3. Entomologists, as a broad generalization, have not been extensively involved in the
development and use of these standards and tools. Thus, the experience and require-
ments of the entomological communityboth data providers and usersmay be
marginalized.
4. The integration of biodiversity informatics into the practice of systematics and collec-
tion curation promises to dramatically accelerate and improve the process of species
discovery and description in the near future.
ACKNOWLEDGMENTS
Thanks to L. Musetti and D. Agosti for fruitful discussions. This material is based upon work
supported in part by the National Science Foundation under grant No. DEB-3044034.
LITERATURE CITED
1. ABCD schema 2.06. http://www.bgbm.org/TDWG/CODATA/Schema/
2. Alexandria Digital Library (ADL) geospatial network. http://clients.alexandria.ucsb.edu/
webclient/index.jsp
3. Association of Systematics Collections. 1993. Committee on Computerization and
Networking. An information model for biological collections. http://www.nscalliance.org/
bioinformatics/asc%20model/Ascmodrpt.pdf
4. Australian faunal directory. http://www.deh.gov.au/biodiversity/abrs/online-
resources/fauna/afd/index.html
5. Austrian map online. http://www.austrianmap.at
434 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
6. Bellinger PF, Christiansen KA, Janssens F. 2006. Checklist of the Collembola of the world.
http://www.collembola.org/
7. Berendsohn W, ed. 2005. Standards, information models, and data dictionaries for biological
collections. http://www.bgbm.org/TDWG/acc/Referenc.htm
8. Biodiversity Informatics. http://jbi.nhm.ku.edu/index.php/jbi
9. BioGeomancer. http://www.biogeomancer.org/
10. Biological collection access services. http://www.biocase.org/
11. Bisby F. 1995. Plant names in botanical databases. Plant taxonomic database standards
No. 3. Pittsburgh: Hunt Institute Botanical Documentation. 30 pp. http://www.tdwg.
org/plants.html
12. Brummitt RK, Powell CE. 1992. Authors of Plant Names. Kew, UK: Royal Botanic Gar-
dens. 732 pp.
13. Box D, Ehnebuske D, Kakivaya G, Layman A, Mendelsohn N, et al. 2000. Simple object
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
15. Chapman AD. 2005. Principles and methods of data cleaning: primary species and species-
occurrence data, version 1.09. Report for the Global Biodiversity Information Facility,
Copenhagen. http://www.gbif.org/prog/digit/data quality
16. Chapman AD. 2005. Principles of data quality, version 1.0. Report for the
16. An excellent
Global Biodiversity Information Facility, Copenhagen. http://www.gbif.org/prog/ primer on the
digit/data quality important issues
17. Chapman AD. 2005. Uses of primary species-occurrence data, version 1.0. Re- surrounding
port for the Global Biodiversity Information Facility, Copenhagen. http://www. quality control and
assurance for data
gbif.org/prog/digit/data quality
providers.
18. Chrisman NR. 1983. The role of quality information in the long-term functioning of a
GIS. Proc. AUTOCART06 2:30321
19. Colwell RK. 1996. Biota: the biodiversity database manager. Sunderland, MA: Sinauer. 574 17. An extensive
and thorough
pp. http://viceroy.eeb.uconn.edu/Biota
illustration of the
20. Colwell RK. 2005. EstimateS 7.5 Users guide. http://viceroy.eeb.uconn.edu/estimates scientific and
21. Colwell RK, Coddington JA. 1994. Estimating terrestrial biodiversity through extrapo- practical
lation. Philos. Trans. R. Soc. London B 345:10118 importance of data
22. Dallwitz MJ. 1980. A general system for coding taxonomic descriptions. Taxon 29:4146 from museum
specimens and
23. Dallwitz MJ. 2005. A comparison of interactive identication programs. http://www.delta-
observations.
intkey.com/
24. Dallwitz MJ. 2006. Programs for interactive identication and information retrieval.
http://delta-intkey.com/www/idprogs.htm
25. Dallwitz MJ, Paine TA. 2005. Denition of the DELTA format. http://www.delta-
intkey.com/www/standard.pdf
26. Dallwitz MJ, Paine TA, Zurcher EJ. 1993. Users guide to the DELTA system: a general
system for processing taxonomic descriptions, fourth edition. http://www.delta-intkey.com/
27. Dallwitz MJ, Paine TA, Zurcher EJ. 1995. Users guide to Intkey: a program for interactive
identication and information retrieval. http://www.delta-intkey.com/
28. Dallwitz MJ, Paine TA, Zurcher EJ. 2000. Principles of interactive keys. http://www.delta-
intkey.com/
29. Darwin Core. http://digir.net/schema/conceptual/darwin/2003/1.0/darwin2.xsd
30. Date CJ. 2004. An Introduction to Database Systems. Boston: Pearson/Addison Wesley. 983
pp. 8th ed.
31. DesktopGarp. http://www.lifemapper.org/desktopgarp/
436 Johnson
ANRV297-EN52-20 ARI 21 November 2006 10:30
solicitation. http://www.nsf.gov/pubs/2006/nsf06569/nsf06569.htm
72. New Zealand geographic placenames database. http://www.linz.govt.nz/core/placenames/
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
searchplacenames/
73. Nichols JD, Boulinier T, Hines JE, Pollock KH, Sauer JR. 1998. Inference methods for
spatial variation in species richness and community composition when not all species are
detected. Conserv. Biol. 12:139098
74. Nix HA. 1986. A biogeographic analysis of Australian elapid snakes. In Atlas of Australian
Elapid Snakes, ed. R Longmore, Australian Flora and Fauna Series 7:415. Canberra: Aust.
Govt. Pub. Serv. 115 pp.
75. Nomina insecta nearctica. http://www.nearctica.com/nomina/main.htm
76. Noyes JS. 2005. Universal Chalcidoidea database. http://internt.nhm.ac.uk/jdsml/perth/
chalcidoids/
77. NSF taxonomic literature project: treatment markup. http://research.amnh.org/
informatics/taxlit/schemas
78. Penny ND. 1997. World checklist of extant Mecoptera species. http://www.calacademy.org/
research/entomology/Entomology Resources/mecoptera/index.htm
79. Peterson AT, Ortega-Huerta MA, Bartley J, Sanchez-Cordero V, Soberon J, et al. 2002.
Future projections for Mexican faunas under global climate change scenarios. Nature
416:62629
80. Peterson AT, Scachetti-Pereira R, Hargrove WW. 2004. Potential geographic distribu-
tion of Anoplophora glabripennis (Coleoptera: Cerambycidae) in North America. Am. Midl.
Nat. 151:17078
81. Phylogeny Programs. http://evolution.genetics.washington.edu/phylip/software.
html
82. Polaszek AD, Agosti D, Alonso-Zarazaga M, Beccaloni G, Bjrn PP, et al. 2005. A uni-
versal register for animal names. Nature 437:477
83. Ross ES. 1999. World list of extant and fossil Embiidina (=Embioptera). http://www.
calacademy.org/research/entomology/Entomology Resources/embiilist/embiilist.
html
84. Schorr M, Lindeboom M, Paulson D. 2005. World Odonata list. http://www.ups.
edu/x6140.xml
85. Segurado P, Araujo MB. 2004. An evaluation of methods for modelling species distribu-
tions. J. Biogeogr. 31:155568
86. Soberon J, Peterson AT. 2004. Biodiversity informatics: managing and applying primary
biodiversity data. Philos. Trans. R. Soc. London B 359:68998
HomePage
95. Taxonomic database working group. http://www.tdwg.org/
95. Forum for
discussions and 96. TDWG: globally unique identiers. http://wiki.gbif.org/guidwiki/wikka.php?wakka=
development of HomePage
standards for 97. The Zoraptera database: catalog of the order Zoraptera. http://www.famu.org/zoraptera/
biodiversity data. catalog.html
98. Thompson FC, ed. 1990. Automatic Data Processing for Systematic Entomology: Promises and
Problems. Washington, DC: Entomological Collections Network. 48 pp.
99. Thompson FC. 1994. Bar codes for specimen data management. Insect Collect. News 9:24
100. Thompson FC, ed. 2005. Biosystematic database of world Diptera, version 7.5. http://www.
sel.barc.usda.gov/Diptera/biosys.htm
101. Trichoptera world checklist. http://entweb.clemson.edu/database/trichopt/index.html
102. Unied biosciences information framework. http://wiki.cs.umb.edu/twiki/bin/view/UBIF/
WebHome
103. Universal biological indexer and organizer. http://www.ubio.org/
104. Weitzman AL, Lyal CHC. 2004. An XML schema for taxonomic literature:
taXMLit. http://www.sil.si.edu/digitalcollections/bca/documentation/taXMLitv1-
3Intro.pdf
105. Williams P, Gibbons D, Margules C, Rebelo A, Humphries C, Pressey R. 1996. A com-
parison of richness hotspots, rarity hotspots, and complementary areas for conserving
diversity of British birds. Conserv. Biol. 10:15574
438 Johnson
Contents ARI 24 October 2006 17:16
Annual Review of
Contents Entomology
Frontispiece
Charles D. Michener p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p xiv
The Professional Development of an Entomologist
Charles D. Michener p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 1
Insect/Mammal Associations: Effects of Cuterebrid Bot Fly Parasites
Annu. Rev. Entomol. 2007.52:421-438. Downloaded from www.annualreviews.org
on Their Hosts
Frank Slansky p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p 17
Access provided by 36.74.238.33 on 11/20/16. For personal use only.
vii
Contents ARI 28 September 2006 19:28
Indexes
Errata
An online log of corrections to Annual Review of Entomology chapters (if any, 1997 to
the present) may be found at http://ento.annualreviews.org/errata.shtml
viii Contents