iNaturalist System Architecture Final Work Product

I243 Document Engineering and Information Architecture
Prof. Bob Glushko
Final Project Proposal

Nate Agrin
Jessica Kline
Ken-ichi Ueda

Contents

  1. Introduction
  2. Document Exchange Analysis
    1. Interviews
    2. Document Analysis and Component Harvesting
    3. Identifying Codesets
  3. Process Modeling
  4. Data Modeling
    1. Component Consolidation
    2. Data Model Overview
    3. Observations
    4. People
    5. Groups
    6. Media
    7. Localities
    8. PrivateEntities
    9. Taxa
    10. Journals & Entries
    11. Comments
  5. Implementation Suggestions
    1. Services Analysis
    2. Deciding What to Build and What to Outsource
  6. Conclusion
  7. Appendices
    1. Class Presentations
    2. Interview Notes

Introduction

iNaturalist.org will be an online community where people interested in nature can record their observations, share and explore their data, and meet other naturalists. We began implementing this project as a part of I213: User Interface Design and Development, where we focused on developing the user interface for adding observations to the system. Here we describe project work for I243: Document Engineering, in which we use the methods of document engineering to model the data in iNaturalist. Our goal is to develop a robust data model and a set of implementation suggestions to guide us in the design of the back-end system architecture for iNaturalist.

Document Exchange Analysis

Interviews

We interviewed three groups of potential users in order to learn about how things are identified in nature, how these findings are recorded, and how they are shared and explored. These groups included a researcher, an amateur naturalist, and the Document Engineering class. The following includes a summary of our findings (please see the appendix for our complete interview notes).

The researcher that we interviewed models, quantifies, and maps ecological processes on a landscape scale. These processes include the spread of plant pathogens, such as Sudden Oak Death and Pierce's Disease. This researcher also utilizes community-based input to expand her research. Meanwhile, the amateur naturalist that we interviewed enjoys birdwatching, taking photos, and using her outdoor experiences as inspiration for her art. This amateur naturalist also uses Flickr to share and learn about things in the nature. And our document engineering class consisted of many students that are interested in (but have little experience with) observing, recording, and sharing things in the natural world.

From our interviews we found that researchers have higher accuracy demands than amateurs as well as derive more data from amateur observations than simply the identification of the thing observed. When making an identification, researchers require not only recognizing distinguishing features, but also require physical tests (the interviewee, for example, requires both a DNA test and a culture test in order to determine the existence of a specific specimen). Also, while researchers use the observations of amateurs in order to expand their research, they also benefit from the demographic data of contributing amateurs (the interviewee explained that this helps determine the depth of the contributed observations). We also learned that amateurs consider enforced structure a hindrance. Amateurs dislike enforced structure because their motivations are to have fun and enjoy nature. Furthermore, amateurs often do not utilize a formal identification process (the interviewee, for example, makes identifications by referring both to field guides and other amateurs).

Document Analysis and Component Harvesting

We have identified six existing document types: photo-sharing resources, blogs, listservs, survey forms, field guides, and observation journals. We performed the D-O-C-U-M-E-N-T analysis on photo-sharing resources, blogs, and listservs as well as performed the document component harvesting process for each of the identified document types.

1. Photo-sharing resource

Photo-sharing resource - Flickr
Screenshot of a Flickr observation. http://www.flickr.com/photos/ken-ichi/467809079

D-O-C-U-M-E-N-T analysis:

  • D - The document types include photos, photostreams, collections, and sets.

  • O - The organizational processes include taking a photo, uploading it onto flickr, adding information to it (title, tags, description, copyright, assigning it to a set), sharing it with others, and having others contribute additional information (comments, notes, tags).

  • C - The context of this analysis is that Flickr is a popular photo management and sharing application. It allows users to organize their photos through tags, sets, and collections, rather than physical photo albums. It also has features relevant to the iNaturalist application, by allowing users to set privacy levels, add both text-based and map-based location information, and create social networks.
  • U - The user types include photographers, Flickr contacts, non-Flickr friends and contacts, Internet searchers.
  • M - The model that applies is the "community model". It provides a community for individuals to connect with others with similar interests. It provides a basic and free service to anyone and allows users to upgrade to a premium service for a small fee.
  • E - The enterprise and ecosystem consist of the following: Flickr is is part of the Yahoo! enterprise, an Internet services company that provides search, communications, commerce, and content-specific (news, sports, games, finance, etc) services and is part of an ecosystem of photographers, Internet searchers, social networking sites, and other photo management and storage sites.

  • N - The need that drives Flickr is the ability to provide an all-in-one photo application, specifically providing a service that organizes, stores, and shares photos.
  • T - The technology constraints include the ability to detect users that are using the service for commercial purposes and generally encourage existing Flickr users to adopt the premium service and encourage non-Flickr users to adopt the basic service.

Component harvesting table:

Component NameSynonymsSemantic Description
PhotoImage, PhotographAn object taken with a camera that recreates something observed in nature
TitleImage name, NameA name that describes the image or distinguishes it from others
DescriptionNote, ExplanationA note that provides additional information about the image
Observation dateDate, TimeThe date and time that the image was taken
Upload dateDate, TimeThe date and time that the image was uploaded
TagLabel, KeywordA descriptive term associated with the image
CommentNote, FeedbackFeedback provided by others about the image

2. Blogs

Blogs - mushroom blog
Screenshot of a Cornell Mushroom Blog posting. http://hosts.cce.cornell.edu/mushroom_blog/?p=193

D-O-C-U-M-E-N-T analysis:

  • D - The document types include individual descriptive posts, group discussion posts, and commentary.

  • O - The organizational processes include logging into the publishing service, writing a narrative, inserting photos or other multimedia components, and submitting the narrative to the publication service.
  • C - The context of this analysis is many naturalists currently use self-publication methods to author their findings. It is important to understand how the users currently publish their findings in order to understand what publication methods could be improved upon and what methods work well.
  • U - The user types include naturalists, their readers, and other self-publication participants.
  • M - The model that applies is the "self-publisher model". This model represents someone who records and publishes their findings for any number of anonymous users to read or discover. Generally this content is provided for free, and incorporates typical web-based subscription technologies such as RSS or other XML based feeds.
  • E - The enterprise and ecosystem consist of the following: blogs and community journals are generally a part of the ‘blogosphere’, which can be thought of as a loosely coupled network of self-publishing authors. Blogs are typically loosely authored and contain highly unstructured narrative content, along with photos or other imported data relevant to the author’s online identity.
  • N - The need that drives blogs is the need to provide a self-publication mechanism. This mechanism allows non-technically skilled web users to publish content and this content is purposefully unstructured so that it can accommodate a wide range of possible data.

  • T - The technology constraints include the issues of content storage and retrieval. It is difficult to tie the narrative of a nature sighting to a geographic location or event without extensive data mining and natural language processing. In addition these self-publication methods generally do not expose users to the vast network of other like users, but rather relies on their own search-and-discovery methods to find new users.

Component harvesting table:

Component NameSynonymsSemantic Description
Outing titleBlog title, Post titleTitle of the narrative or a brief description of its content
Date and time of postPost timeTime at which the post was submitted
Date and time of observationSighting timeTime (or time ranges) when the observation occurred
Location of observationLocation, Geo position, Geo coordinates, GPSLocation of the observation, either provided as a descriptive phrase or as a set of latitude and longitude coordinates
UsernameName, User, Handle, ID, IdentifierUser's identification handle, which may be a real or made up name
Blog postNarrative, Story, DescriptionNarrative of the observation, describing the sequence of events or other analysis
Species name Taxonomic name, Name, Latin name, Common nameMore specific descriptor than simply 'bird' or 'mushroom' (generally naturalists will provide a common name and/or latin name for proper identification)
ImagePhotoImage of the observation, which helps explain or describe the narrative
LinkHyperlink, ReferrerConnection to another location within the Internet, providing a connection or placemarker for another page, entry, email, photo, etc.
MovieFilm, VideoSeries of moving images which display some detailed action regarding the observation. These may be of the actual organism observed or some other relevant aspect of the outing

3. Listservs

Listservs - Example post
Screenshot of a Massachusetts Birding List posting. http://birdingonthe.net/mailinglists/MASS.html#1177208921

D-O-C-U-M-E-N-T analysis:

  • D - The document types include e-mails, photos, species (both target and ancillary), locations, times, identities of other people present, behavioral notes, and morphological anomalies.

  • O - The organizational processes include reporting sightings (rare and common organisms), reporting environmental conditions (weather, degradation, etc.), question asking and answering (ID, edibility, behavior, etc.), and moderating the forum.
  • C - Listservs usually have a geographical or topical context, like "Birding in Alameda County." By joining a listserv, naturalists can share and receive updates about recent species observations and announcements about local events (such as field trips and classes).
  • U - The user types include amateur naturalists, professional biologists, and academics.
  • M - The model that applies is the "community model". It provides a community for individuals to connect with others with similar interests. Some listservs are free for anyone to join (such as the Mount Diablo Audubon Society) while others require a membership with the host organization (such as the Mycological Society of San Francisco).
  • E - The enterprise and ecosystem consist of the following: listservs are often associated with a particular organization, pertain to specific groups of species, and often concern limited geographical areas.

  • N - The need that drives listervs is the need to learn about observed organisms in nature, find places to go observe things, and maintain awareness of of rare or unusual organisms that can be observed.
  • T - The technology constraints include the ability for listservs to encourage postings related to the discussion topic as well as the ability to effectively block spam.

Component harvesting table:

Component NameSynonymsSemantic Description
Species nameSpecies, Taxon, Common name, Scientific name, Latin nameUnique identifier for a biological species, usually defined as an organism that can produce viable offspring with others of its kind
LocationPlace, Spot, Region, Town, ParkPoint or region in geographic space, usually expressed using an identifier from a semi-controlled vocabulary (i.e. Monterey, CA)
Person nameName, Nickname, Handle, NickHuman identifier
LinkHyperlinkHypertext link to an article or other online data source regarding the subject matter of interest
DatetimeDate, TimeDate/time or span of time at which an observation was made
NoteComment, Description, Supplementary informationAdditional information about an observation, often in narrative, unstructured form involving terminology from uncontrolled vocabularies
Morphological note Description of the physical state of the organism, often utilizing a semi-controlled vocabulary
Behavioral Note Description of the behavior of the organism, often utilizing a semi-controlled vocabulary
CountNumber, Group sizeNumber of organisms observed

4. Survey form

Survey form - BFL form
Screenshot of the Birds in Forested Landscapes Field Form. http://www.birds.cornell.edu/bfl/fieldform.pdf

Component harvesting table:

Component NameSynonymsSemantic Description
Other species Other species related to the observed species
HabitatEnvironmentHome of a plant or animal
Observation periodSegmentObservation contains three distinct time periods: the observation, playback, and behavior watch periods
Luring technique Technique used by the naturalist to initiate the observation of a species, such as a bird call or a fishing lure

5. Field guide

Component harvesting table:

Component NameSynonymsSemantic Description
Visual IndexThumbnailsMany images displayed on a single page for the purpose of quick discovery
Common NameNameName used by lay people to describe an organism, usually descriptive and often unspecific (i.e. can refer to multiple species)
Scientific NameLatin name, Taxonomic nameName used by scientists to describe an organism, generally in Latin or utilizing Latin word morphology
IllustrationDrawingVisual representation of an organism or of components within an organism
Life stageLife phase, AgeDevelopmental phase of the organism (e.g. larva, juvenile, adult
Range mapMap, RangeGeographic map showing the spatial distribution of an organism
DescriptionPhysical descriptionTextual description of the organism's physical state. May contain information on appearance, sound, smell, touch, or taste. Usually narrative, but may contain less narrative fields, such as length or weight
Similar species List of similar species, often accompanied by differentiating features
Life history Description of an organism's life history, including reproduction, habitat, and other behavioral information
GenderSexDescriptive components relative to the sex of the species

6. Observation journal

Component harvesting table:

Component NameSynonymsSemantic Description
TitleName, HeadingShort description of the series of observations, which may include the location or date
DateDate range, Month, Day, YearDate of entry, generally one day (but may be a range)
Description of speciesNarrativeDescription of the observation, such as what the species looked like, behavioral observations, special characteristics, gender, etc.
IllustrationDrawingTypically a hand drawn representation of a species or a particular part of a species
Time of observationSighting timeThe time the observation occurred, which may be specific (hour/minute/second) or more general (morning, noon, or evening)
LocationAddress, Geolocation, Street, Town, City, PlaceLocation of the observation
Species NameSpecies title, Common name, Latin nameSpecies' given name, common name, or scientific name
SampleOrganismRepresentation of the species, such as a leaf or insect

Identifying Codesets

The biological sciences are rife with codesets and naming conventions, many which overlap, conflict, or are ignored. The published literature is usually the authoritative source of biological taxonomic names, but even published names often conflict. Furthermore, the published literature often moves too fast for external authoritative lists to remain current. The following table lists codesets that may have adequate authority or relevance for the needs of iNaturalist.

CodesetSourceExampleDescription
IUCN Red List Categories World Conservation Union (http://www.iucn.org) ENDANGERED (EN) A taxon is Endangered when the best available evidence indicates that it meets any of the criteria A to E for Endangered (see Section V), and it is therefore considered to be facing a very high risk of extinction in the wild. A set of descriptors and codes for describing the conservation status of a species. For the list of codes see, http://www.iucnredlist.org/info/categories_criteria2001#categories
International Code of Zoological NomenclatureInternational Commission on Zoological Nomenclature29.2. Suffixes for family-group names. The suffix -OIDEA is used for a superfamily name, -IDAE for a family name, -INAE for a subfamily name, -INI for the name of a tribe, and -INA for the name of a subtribe. These suffixes must not be used at other family-group ranks. The suffixes of names for taxa at other ranks in the family-group are not regulated.Naming conventions for inventing and interpreting scientific taxon names for animals. See http://www.iczn.org.
International Code of Botanical NomenclatureInternational Botanical Congress16.1. The name of a taxon above the rank of family is treated as a noun in the plural and is written with an initial capital letter. Such names may be either (a) automatically typified names, formed by replacing the termination -aceae in a legitimate name of an included family based on a generic name by the termination denoting their rank (preceded by the connecting vowel -o- if the termination begins with a consonant), as specified in Rec.16A.1-3 and Art. 17.1; or (b) descriptive names, not so formed, which apply to taxa with a recognized circumscription and which may be used unchanged at different ranks.Naming conventions for inventing and interpreting scientific taxon names for plants.
Integrated Taxonomic Information System (ITIS)ITIS.govKingdom: Animalia
Taxonomic Rank: Species
Synonym(s): Lichanura trivirgata Cope, 1861
Common Name(s): Rosy Boa [English]
Taxonomic Status:
Current Standing: valid
Data Quality Indicators:
Record Credibility Rating: verified - standards met
Database of authoritative taxon names for North America. See http://itis.gov
FishBaseFishBase.orgComprehensive index of ichthyological nomenclature. Includes extensive information of common and scientific name synonymy. See http://fishbase.org
General Lepidoptera Names IndexLepIndex Index of nomenclature for butterflies and moths. See http://www.nhm.ac.uk/research-curation/projects/lepindex/

Process Modeling

We have identified two processes used by naturalists: make an identification and record an observation.

Make an identification

Naturalists either identify observations on-site or off-site. A naturalist generally begins with the on-site identification process. He or she will begin recording observational data (see the Record an observation process). At times the naturalist is familiar with the observation or its identified features and can make an immediate identification. However, when the naturalist is unfamiliar with the observation, he or she typically consults other resources available on-site, such as field guides and other available naturalists. At times these resources will help make an identification. But if they are also unfamiliar with the observation, the naturalist can continue the identification process off-site.

With the off-site identification process, the naturalist refers to recorded observational data when searching his or her off-site resources, such as taxonomy lists, species databases, and other personal records. If these off-site resources do not help produce an identification, the naturalist will consult his or her nature contacts, such as individual colleagues or groups of naturalists found on listservs, websites, and photo applications. (If the naturalist does not have such contacts, he or she will search for such contacts before proceeding). The contacted naturalists will then use the recorded observational data and their off-site resources to identify the observation. Once a naturalist identifies the observation, he or she will notify the original naturalist (and the contacted naturalist group, if applicable) of his or her findings.

Process diagram - make an identification
Make an identification sequence diagram

Record an observation

In general, with the exception of multimedia data (photos, videos, audio files), this process begins after the observation has been made. The naturalist records narrative data such as color, size, habitat, location, and time in a notebook or checks off a species from a list. After this initial data recording, data can be annotated with additional details, taxonomic identification, and weather data. After the observation data is complete, the data is rarely shared. A notebook or annotated field guide might be shared with another naturalist incidentally, but they are more often used to aid personal recollection. However, the data might be shared with friends via mail, email (personal emails, listservs), or web sites (blogs, photo-sharing sites).

Data Modeling

Component Consolidation

The following is a composite component table for recording an observation. The 'Y' characters indicate that the component is used in the respective document type.

Component Name/Sub ComponentsFlickrBlogMailing ListBFL FormObservation Journal
Title of Record Y Y Y Y
Species Name
Common Name Y Y Y Y Y
Latin Name Y Y Y Y Y
Date and Time Record Created
Month Y Y Y Y
Day Y Y Y Y
Year Y Y Y Y
Observation Period
Observer Identifier
First, Last Name Y Y
Special ID Y
Username Y Y Y
Email Y
Description
General Observation Y Y Y Y Y
Morphological Note Y Y
Behavioral Note Y Y
Habitat Y Y
Luring Technique Y Y
Other Species Observed Y Y Y Y Y
Location
Country
State Y Y
Town Y Y Y Y Y
Street Y Y
Latitude Y
Longitude Y
Count
Reference
Hyperlink Y Y Y
Citation Y Y Y Y
Photo
Image Title Y Y
Camera / Camera Properties Y
Copyright Y
Image Date Y
Image Size Y
Movie
Movie Time Y

Data Model Overview

Overview of the iNaturalist data model, including higher level entities

After consolidating components from documents currently in use by naturalists we examined existing schema definitions that might help us encapsulate these components and build our own model. DarwinCore version 1.4 draft specification (DwC) is an XML vocabulary which very closely matched the data types that we required. We decided that rote adoption of DarwinCore was not the most appropriate design choice for encapsulating our model, and instead used it as a foundation from which we added and removed entities as we felt appropriate. The advantage of taking this approach is that we loosely conform to an emerging standard and thus can easily share data with the scientific community. In the overview model, we have mapped out the data relationships between the major entities of our system. The attributes of this system are defined in following models where the DarwinCore definitions become apparent (in some cases synonyms have been selected over the original DarwinCore attribute names).

We chose to encode our model using Entity Relationship diagrams, a visual vocabulary for data modeling usually used in the design of relational databases. We chose this convention because we envision iNaturalist.org as a database-driven web application, and indeed, our prototypes to date have all been driven by an underlying relational database.

Observations

Observations ER diagram

Observations lie at the core of iNaturalist, alongside People. Since we found such a good match between our own document component consolidation and the elements of the DwC, we let the DwC guide much of our Observation model. Attributes like the valid_distribution_flag are straight from the DwC (this attribute indicates organisms that are were clearly outside of their range, such as animals in zoos). However, the DwC also lacked some components that we felt were important. Primary among these were the lack of support for multimedia (see below) and the assumption of one organism per observation.

The issue of multiple organisms per observation was a matter of contention within the group. Some of us thought this was over-modeling the domain, creating a model that was too unwieldy to actually implement. Others thought this kind of complexity could be useful to our users, and that if anything we should over-model to encompass possible features rather than have to revise the model to add them later.

Ultimately, the over-modeling faction won out (by majority). While we still believe multiple organisms per Observation is an edge case, we think it important to support cases of organismal interactions, such as predation or territoriality. Given these needs, we added some substructure to the DwC model, allowing for multiple ObservedTaxa for a given Observation. The ObservedTaxa entity represents a single, physical observed organism, and includes some of the DwC Biological Elements, along with a reference to a Taxon. Note that the Taxon allows for some taxonomic ambiguity by allowing values above the level of species, but it does not accommodate ambiguity between disparate taxa. For instance, you cannot say you saw either a bird or a bat. You can only say you saw something within a group that contains the two, like vertebrates.

People

People ER diagram

DarwinCore does not extensively define a person entity, aside from the ‘Collector’ attribute which is a simple string for the Collector’s name. Our model takes into account the needs of our application, by including additional People attributes measuring the dates of login, contact information, and optional elements a person can use to describe their interests in the system.

We also loosely define concepts of Friends and Services entities. The Friends entity simply represents a social relationship between two People entities. Services define the types of external services that user may subscribe to, which might include services like photography sharing sites, listserves or blogs. Services are meant to allow iNaturalist to work in parallel with user’s current method of work, instead of attempting to supplant them.

Groups

Groups ER diagram

Groups represent one or more People focused on a common social interest, such as a geographic location, or types of observations they share. Each group may have its own journal which the members post to, allowing for discussion of diverse topics, and the group can be associate itself with a single locality. Groups are a social component of the site and as such there was little we could parse from the DarwinCore? specification. The Groups entity must contain a name and can have an optional description.

Media

Media ER diagram

DarwinCore suggests an entity called “!ImageURL” and another “RelatedInformation”. These two entities can be used to incorporate multimedia aspects to the observation data. We have encapsulated this information in a parent class titled “Media”. Media uses a more abstract concept of the !ImageURL, which we call “file_url”. file_url simply allows us to use extra multimedia content like video and audio while capturing a static link. For the moment we are focused on elements of the system that may be implemented in house and have only defined Photos and External Photos as the types of data inheriting from Media. External Photo is a photo hosted by some other service and accessed through an abstracted API. Photos contains a single extra element, EXIF_data, which allows us to capture automatically generated information from a camera when an image is originally taken.

Localities

Localities ER diagram

The DwC locality elements were more robust that we needed, so we kept the name "Locality" and proceeded to model for our own needs. If in the future we need to provide Observation data in fully valid DwC format, we feel confident that we will be able to auto-fill most of the elements based on our own data. Localities in our model are essentially rectangular areas defined by latitude and longitude, with points represented as zero width and zero height rectangles (minimum width equals maximum width, etc.). Distinguishing between public and personal localities was another point of contention. While we all agreed that users should be able to create their own Localities to customize their maps and personally engage with their areas of interest, some of us felt that distinguishing between public places and personal places was important enough to be modeled. For example, you backyard might be of enormous importance to you, and you would want to see that on your own map all the time. However, your backyard is not meaningful to most other users, and should therefore be hidden if necessary. Others thought this was another case of over-modeling, but, again, the argument for erring on the side of overcompensation won out.

Another argument for the PersonalLocalities relates to privacy. While our primary PrivateEntity model covers binary states of privacy (see below), such as whether or not a group of users can access an Observation, it does not broach the issue of spatial privacy, which could have multiple states. For example, a mushroom hunter might have a Locality for her favorite chanterelle spot. She would want to hide that spot from the general public to avoid competition, but she might want to share the fact that the chanterelles were up at a certain time of year at a broader spatial scale, like the county level. We added the "obscurity" attribute to the PersonalLocality entity to handle this situation. An "obscure" locality will be abstracted to some broader spatial scale without completely denying access to other users.

PrivateEntities

PrivateEntities ER diagram

Privacy is very important for some naturalists, especially those who harvest resources rather than simply observing them. We see the PrivateEntity as a sort of meta-entity, akin to an Interface in the Java programming language. Essentially, entities that declare themselves as "private" will inherit the properties of the PrivateEntity. The PrivateEntity basically establishes what People and/or Groups have exclusive access to an entity. It also includes a "relationship_type" attribute, to allow access restriction based on classes of users (e.g. "only my friends can see this").

Taxa

Taxa ER diagram

Our taxonomy model is quite simple. A taxonomy is simply a tree-like data structure, so its entities are essentially names and referrals to their containing (parent) entities. There was also some debate over the Taxa, as some of us wanted to add more life history data to the model, such as a "physical_characteristics" attribute, or a "reproductive_behavior" attribute. However, the reductionists won this argument, largely because if we plan to support user-generated life history information to the taxonomy, the "description" attribute will probably suffice. This design was largely based on that of the Integrated Taxonomic Information System (ITIS) database.

Journals & Entries

Journals & Entries ER diagram

Journals and Entries essentially support blogging and forum functionality within iNaturalist. While most of the attributes are fairly self-explanatory, the complex set of relations require some annotation. Every Person in the system has one Journal. This acts as a sort of personal nature journal and supports more narrative, free-form kinds of data recording. Additionally, every Group has a journal, which acts as a sort of community forum. Users can write Entries just like they would for their own Journal, but post them to a Group Journal. Entries can belong to many Journals because we decided it was important to be able to cross-post Entries to several journals. For example, you might want to keep an Entry about your morning hike in Tilden Park in your personal Journal, but also post it to the Tilden Group, and maybe the Berkeley Birding Group if you saw some interesting birds.

Entries can also be associated with Observations. An Entry might describe several Observations, and we imagine allowing users to create Observations separately and associate them with Entries, and to automatically extract observations from the text of entries.

Comments

ER diagram for Comments

Comments are a standard component of online participatory media, and we intend to support them across several different types of content. Again, the attributes are fairly self-explanatory. The main feature of note is that a Comment must have a parent entity, like an Observation or a Locality. All comments must be in reference to something. Comments can also refer to other Comments, to allow for threaded discussions.

Implementation Suggestions

Services Analysis

The following describes the specific services that we anticipate reusing, redistributing, and creating or recreating when building the iNaturalist system.

Services to reuse

The services we anticipate reusing include online maps, taxonomic data, weather data, photo and blog applications, and listserv communities. Specific examples of these services include the Google Maps API, the National Climatic Data Center (NCDC), the Universal Biological Indexer and Organizer (uBio), the Flickr API, the Blogger API, and listservs focused on specific groups of animals or plants. Because these services are well utilized, comprehensive, and meaningful, we plan to reuse them in order to support existing service-specific communities as well as focus on creating other important services.

We have also considered two auxiliary services not integral to the iNaturalist system. These services include recipe resources (such as Epicurious.com and book finding resources (such as Amazon.com or libraries). The former allows users to find recipes for edible observations and the latter allows users to located supplementary resources related to specific observations.

Services to redistribute

The services we anticipate adding taxonomic tags back into services that support them like Flickr, cross-posting to external blogs and lists, and annotated spatial data (such as names and observations) via data standards such as KML and WMS. Because iNaturalist will aggregate and collect data about observations in nature, we hope to share this data with other groups and services. We envision these groups and services to include teachers, researchers, map services, ecotourism services, and public land managers.

Deciding What to Build and What to Outsource

There are many services to exploit on the Web, and as we have shown, many of them are relevant to iNaturalist. However, some systems are too important to us to rely on an external service that might fail or change its API at any time. The services we anticipate developing internally include annotated spatial data (such as names and observations), photo implementation, journal implementation, and taxonomic search and discovery. Most importantly, we hope to provide a community where naturalists can search and share observations in nature. These observations are represented by various media formats (such as photos, audio files, videos, illustrations, and text) as well as include additional spatial, weather, and taxonomic data. While services such as photo and blog applications can be provided externally, we also plan to implement these services internally in order to provide iNaturalist users with a robust application that includes these integral services.

Conclusion

Data modeling is difficult, and the many problems involved in modeling a domain are often difficult to predict before actually designing a system. Nevertheless, we feel that we gained a great deal of perspective on iNaturalist by utilizing the document engineering approach, starting with basic needs assessment and document harvesting, consolidating data types and identifying relevant processes, and finally creating a data model for almost all the information in our system. To be sure, points of contention remain, our model is by no means perfect, and will no doubt experience further refinement as we begin implementation. Nonetheless, both the process and the results have been extremely useful.

The next step in the process is to design a database schema based on our ER diagrams, and determine how our database entities will manifest as classes and objects in our application code. We will also use our process modeling to guide our implementation of the same processes in iNaturalist (e.g. making an identification). Given that we found the DarwinCore specification so useful, it would also be beneficial to begin a dialog with the authors of the specification, some of which work here at UC Berkeley and across the Bay at the California Academy of Sciences in San Francisco. Their input on the diversions we made from their model would no doubt be invaluable, and perhaps some of our own modeling experiences might aid them in further refining the DarwinCore.

Appendices

Class Presentations

  • First presentation (4/9/07) - This presentation included an introduction to iNaturalist and initial D-O-C-U-M-E-N-T analysis. (see presentation slides)

  • Second presentation (4/22/07) - This presentation consisted of our interview results, component and document harvesting tables, codeset analysis, and process descriptions. (see presentation slides)

Interview Notes

Interview #1: Researcher

The interviewee is a researcher that models, quantifies, and maps ecological processes on a landscape scale, mostly the spread of plant pathogens (e.g. Sudden Oak Death, Pierce's Disease). This researcher also utilizes community-based input to expand this research. This interview was particularly useful in discovering the processes specific to the research community. Our findings include the processes of making an identification and reporting data as well as information about the data itself.

Task Scenarios

  • IDs rarely made on the ground, but there are diff. contexts
  • learning about new organisms and phenomena mostly from colleagues in academia and government
  • generally does not record observations in person

Contexts of Use

  • Make an identification (researcher):
    • Need to know characteristics of Sudden Oak Death: dead oak trees, rat homes, crown change, bark bleeding, increased beetle activity, wood decay fungi
    • Learned about these characteristics by going into the field with other researchers (foresters, oak biologists, pathologists, entomologists)
    • Official identification involves DNA tests for a particular pathogen and a culture test using a Petri dish
  • Make an identification (general public):
    • Provide identification workshops and field trips
    • Establish identification documents: checklist, reporting form, example photos
  • Report data
    • Qualified samplers perform culture and genetic test
    • Take GPS location (but this can have inaccuracies – people will take location in an arbitrary location)

Requirements / Constraints

  • Funding discrepancies: currently funding only includes public lands (and there is no difference between different public land jurisdictions)
  • Privacy concerns exist: public doesn’t want specific geographic location used because they don’t want their neighbors or friends to identify their property as the specific location
  • Differences in vocabulary: the public will use terminology such as “that thing” while scientists refer to the scientific name
  • General data requirements:
    • Data doesn’t need to be exact, and exact data information is taken with a grain of salt because precise spatial information is often inaccurate (as described in the report data section).
    • Want demographic data of those reporting
    • An initial filter is required: was the observation actually seen (or speculated), does this observation make sense

Interview #2: Amateur naturalist

Participant is a professional artist / artisan who has been interested in nature since childhood. She credits her interest in birding to her grandmother, who fed birds and possessed encyclopedic knowledge of her local birds. Interestingly, her partner, whose parents did not have a specific interest in natural history, credits some of his current interest in the topic to artifacts of duck hunting in his house while growing up, remnants of his grandfather's hobby.

The participant enjoys observing birds for fun, and also uses her observations and interest in some of her artwork. Was very insistent on distinguishing herself from more "serious" birders and assiduous list-keepers. Found it interesting that some people express an intrinsic collecting instinct through naturalism, but stated that she does not.

Task Scenarios

  • IDs made with field guides, with the help of other amateur naturalists, and occasionally from professionals, i.e. pro. naturalists at public facilities or scientists and veterinarians at animal rehab. centers
  • learning thru
    • in-person questions of other naturalists
    • on-line questions of naturalists (e.g. on Flickr)
    • reading nature guides and online info omnivorous
  • rarely writes down observations
    • occasionally makes small lists, leaves in field guides
    • but takes a lot of pictures, posts to Flickr w/ lots of tags

Contexts of Use

  • Only context was casual, naturalism for fun.
  • Actually very unwilling to use the words "naturalist" or "birder" to self-describe
    • Thought these terms implied more authority, or more "seriousness" than she felt about her own behavior
    • Said "birder" means really hardcore, while "birdwatcher" refers to more casual people
    • Doesn't consider it a scientific or rigorous endeavor: "It's very visual for me." and "Sort of like a puzzle."
  • has also worked in an animal rehabilitation center
    • ID, learning, and recording largely from co-workers, managers of the facility

Requirements / Constraints

  • very few formal constraints for offline tasks
    • politeness when talking to ppl?
    • time / fun: don't want to spend time doing things that aren't fun, or are tedious; actually professed that she likes blogs specifically because there are few constraints, posting is quick and easy and doesn't require editing; there might be a time req. when hiking with non-naturalists who might not like stopping all the time to look at birds; on the other hand, the participants partner needs to stop frequently on the trail, so making observations actually paces them
  • online there are technical constraints, but when asked, participant said she did not feel her tools limited her capacity for self-expression
    • expectations may be preset by the tools, though
    • for example, during our ride, we stopped in a meadow and remarked on the quiet but persistent buzzing of bees. Later I asked if the participant ever thought about recording such sounds and sharing them, but she had not considered it.
  • concerned about aesthetic quality of photos
  • somewhat concerned about accuracy and form of naming
    • expressed fear that using scientific was "pretentious" for her
    • ultimately more willing to make a mistake and be corrected

Applications

  • pen and paper (tedious)
  • digital camera
  • Flickr
  • Blog

Again, I asked about technological constraints, but didn't feel any of them constrained her ability to record or express.

Interview #3: Document Engineering class

The Document Engineering class consisted of many students that are interested in (but have little experience with) observing, recording, and sharing things in the natural world.

When asked how they identify observations in nature, some noted that they ask experienced naturalists for identification help, while others noted that they refer to field guides and other accessible databases (such as Cal Flora).

When asked how they record observations in nature, some noted that they take photos and others noted that they use pen and paper to either sketch the observation or record distinguishable features.

Attachments