Organizing research data
© Sestoft; licensee BioMed Central Ltd. 2011
Published: 20 June 2011
Research relies on ever larger amounts of data from experiments, automated production equipment, questionnaries, times series such as weather records, and so on. A major task in science is to combine, process and analyse such data to obtain evidence of patterns and correlations.
Most research data are on digital form, which in principle ensures easy processing and analysis, easy long-term preservation, and easy reuse in future research, perhaps in entirely unanticipated ways. However, in practice, obstacles such as incompatible or undocumented data formats, poor data quality and lack of familiarity with current technology prevent researchers from making full use of available data.
This paper argues that relational databases are excellent tools for veterinary research and animal production; provides a small example to introduce basic database concepts; and points out some concerns that must be addressed when organizing data for research purposes.
A database is an organized collection of data. This section presents the most common tool for storing and processing data in modern society: the relational database.
However, this is a poor solution for several reasons:
The address of a farm is repeated for every cow, and the birth date of a cow is repeated for every milking event belonging to that cow. Such redundancy typically leads to inconsistency (e.g. two different addresses recorded for the same farm) and to update problems (e.g. if the street name of a farm is changed).
If one needs to register a farm before it has a cow, or register a cow before it has a milking event, one must leave some fields blank, which is likely to confuse later processing and analysis.
A better solution is to use a relational database ; since 1985 this is the dominant technology for organizing and handling large data sets in production, commerce, finance, research and so on.
Tables in relational databases
In a relational database the example from Figure 1 would be broken into three separate tables called Farm, Cow and Milk, as shown below. The tables would all be stored in the same database inside a database system. The database system may simply be Microsoft Access, which is part of the Microsoft Office suite, or it may be the SAS statistical analysis system, and hence the database may reside on the researcher's normal computer. However, if the database is to be shared with others, it is more sensible to keep it on a separate server.
Missing observations, such as those in the cellCount column of the Milk table, are said to be null. We may require, and the database system may enforce, that all values must be non-null, except possibly in the cellCount column. This requirement would not work in the original flat list in Figure 1, because it would prevent us from creating a farm record before the farm has a cow, which is illogical. Furthermore, the splitting of the flat list into separate Farm, Cow and Milk tables means that there is no redundancy and hence less risk of inconsistency: the address of a farm is stated only once per farm, and the farm to which a cow belongs is stated only once per cow.
Queries in relational databases
The beneficial splitting of the flat list of farm, cow and milk data into three separate tables introduces a challenge, though: How does one combine the tables to obtain useful information, such as the total milk production in each postcode? In a relational database this is done using queries, expressed in the language SQL, or Standard Query Language. All modern database systems, including the open source systems MySql and PostgreSql and the commercial systems DB2, Oracle, Microsoft SQL Server and Microsoft Access, understand some variant of SQL and can execute queries involving millions of records in a few seconds. Although the complete SQL language is rather complex, an introduction can be found in any database book, such as . Here we shall just consider some examples of SQL queries, from very simple to moderately complex.
The above small examples give a taste of some common SELECT queries. Hopefully it transpires that SQL is a very powerful language once one understands how to combine the operations into larger queries. Moreover, relational databases and SQL can be used from inside standard desktop tools such as Excel spreadsheets or the statistical packages R and SAS. Thus large data sets may be stored in a relational database and may be extracted and preprocessed using SQL, and then visualization, statistical analysis and data mining or pattern discovery may be performed using tools that researchers are already familiar with.
Database design and documentation
The result of a database design is a database schema: a list of the database's tables; and for each table, a list of its columns, the type (e.g. number or text) of values in each column, information about which column holds the table's key, which columns are allowed to hold null values, and so on.
The database schema is part of the metadata, that is, data about the data. Other kinds of metadata that are often neglected, but that are very important for scientific use, are the units of measurements (e.g. liter, kilogram, gram, percentage by volume, percentage by weight), the precision of measurements, time zone information (local time, universal time, daylight savings time), and the exact interpretation of "codes" such as clinical observations (see the section on terminology and ontology) or answer categories of questionnaires. All of this must be documented and the documentation preserved and kept up-to-date for the data to be of any future value.
A central concept in database design is normal form, which basically stipulates that tables do not have certain kinds of redundancies. We shall not go into further details here, except to note that the Farm, Cow and Milk tables shown in Figure 2 through Figure 4 are on the so-called Boyce-Codd normal form. Normalization is amply covered in any database book, such as .
Temporal and spatial databases
Our farm-cow-milk database example is highly simplified. In particular, it assumes that a cow belongs forever to the same farm, whereas in reality it may be sold from one farm to another. To solve this problem the Cow table could be made temporal, by adding a validFrom and a validTo column. Then each record describes the period in which a given cow belongs to a given farm, which allows for much more detailed queries, such as what is the number of cows for each farm on 30 July 2010, or what is the total milk production per postcode in each of the months of 2010. Unfortunately, the SQL queries become a good deal more complex. The theory of temporal databases is well-developed; a good introduction is provided by .
Moreover, much data is spatial: a farm or field is located at a particular place, which may be described by UTM coordinates or longitude and latitude. Knowing where objects are when allows for queries such as at what times was this cow near Gelsted or find all pairs of cows that were within 8 km of each other at some time as well as epidemiological analyses and easy visualization.
Terminology and ontology
However, there are some potential problems with the clinical term codes in Figure 14. First of all, codes 81 and 140 appear to have the same meaning, so there is a risk that two people may use different codes for the same observation, which may later produce misleading results (e.g. statistics) when queries are made to the database. Second, no distinction is made between findings (e.g. 88 will not drink), diagnoses (e.g. 11 udder infection) and procedures (e.g. 80 hoof trimming); whether or not this leads to problems depends on the discipline and consistency with which veterinarians register clinical observations. Finally, some codes correspond to subcategories or specializations of others; for instance 11 udder infection and 38 joint infection are both special cases of 42 infection; should one then always use the most specific code available (e.g. 11 or 38) or alternatively always register a more general code (e.g. 42) along with more specific ones (e.g. 11 or 38)? In the former case, will somebody who queries the Clinical table in Figure 13 for all cases of infection remember to also query for the more specific ones (e.g. 11 and 38)? This example illustrates some problems with designing category codes for use in databases, and in classifying observations in general.
A suitable system of "codes", including a consideration about how "codes" relate to each other, is often called a terminology, a controlled vocabulary, or an ontology.
An ontology reflects the domain that it describes, such as the domain of animal disease symptoms discussed above. One must first decide what parts of reality to model (for instance, this cow has an infection), what parts of reality to ignore (such as, where is the infection located). Similarly, in a database of clinical observations one must make clear whether one records symptoms (e.g. diarrhea) or diagnosis (e.g. enteritis) or cause (Salmonella) or all of these. One must also decide how to relate the various parts of reality to each other. For instance, pneumonia is a special case of infection. Moreover, it affects the lungs, which is part of the anatomy. A good domain model should be able to express both forms of hierarchical relationship.
It takes domain experts, technological understanding, and good taste to arrive at adequate domain models that are not too complex.
An example of a well-designed (but complex) domain model is SNOMED/CT, which stands for Systematized Nomenclature of Medical-Clinical Terms. This is a set of standard terms for use in hospitals, electronic patient records, and so on . There are three components of SNOMED/CT:
Concepts, used to describe disorders (e.g. 128139000 Inflammatory disorder and 233604007 Pneumonia), procedures (e.g. 11466000 Cesarean section), findings (e.g. 62315008 Diarrhea and 55184003 Infectious enteritis), causative organisms (e.g. 110378009 Salmonella enterica), anatomy, and more.
Descriptions, used primarily for synonyms, e.g. 497137013 Infective enteritis (synonym for concept 55184003 Infectious enteritis).
Relationships, used to describe how concepts relate to each other, e.g. Pneumonia IS_A Inflammatory disorder and Pneumonia FINDING SITE Lung structure.
Note how each concept and each description has a unique numeric key. Also note how relationships can be used to relate one concept (pneumonia) both to a disease category and to anatomy, that is, to place the concept in different hierarchies.
SNOMED/CT is maintained by an international organization whose member countries include the United States, United Kingdom, Germany, The Netherlands, Spain, Sweden, Denmark, and many more. In Denmark and most other places, electronic patient records are still based on older and less powerful classification systems, but SNOMED/CT is expected to replace those in the future .
Full SNOMED/CT is very complex, with 311,000 concepts, 800,000 descriptions and 1,360,000 relations as of April 2010. A smaller subset for veterinary use is being maintained by Virginia Terminology Services .
Data stewardship, standards, and sharing
Sometimes a whole discipline manages to agree on an ontology, as in the case of SNOMED/CT. Such standardization requires considerable effort, but also offers huge synergistic benefits, especially when databases are made available to all interested parties in a standard format. For instance, within bioinformatics this has led to tremendous advances in research on animals, microorganisms, plants and medicine. Important steps were the 1980es development of standard formats  that enable free interchange of DNA sequence data between US, Japanese and European institutions, and the requirement that any sequence data used as basis for a scientific publication must be published, free of any restrictions on further research, in the joint international databases .
While the development of standard formats and ontologies is important and enables much better utilization of research investments, it looks more like infrastructure development than research, which means that it appears less exciting and that it may be difficult to obtain funding for it. As a consequence, it may be more tempting to propose new organizations, web sites and portals than to lay the foundation for them, which caused a Nature editorial to admonish that "Initiatives for digital research infrastructure should focus more on making standardized data openly available, and less on developing new portals" .
Thanks to lab automation, sensor development and computerized instruments, research produces new data on a scale never seen before. Yet in many cases the required efforts to document, check and preserve all these data lag behind researchers' ability to generate the data in the first place .
This problem is the subject of a report from the US National Academies  on integrity, accessibility and stewardship of digital data, encouraged and sponsored in part by leading journals [12, 13]. The report's three main concerns are integrity of data (preventing accidental or willful tampering), sharing of data (to allow others to check accuracy, verify analyses and build on previous work), and stewardship (long-term preservation) of data. Some of the problems have simple technological solutions; for instance, fingerprinting with cryptographic checksums promotes integrity by proving that data has not been tampered with. For the most part however, solutions are organizational and come down to policies and proper documentation. Neither sharing nor long-term preservation is very useful if there is confusion about the meaning of code 114, or if some recordings in the same column are in kilograms, others in liters.
To further give a flavour of the report we quote a few of the recommendations:
Recommendation 1: Researchers should design and manage their projects so as to ensure the integrity of research data, adhering to the professional standards [...]
Recommendation 6: In research fields that currently lack standards for sharing research data, such standards should be developed [...]
Recommendation 9: Researchers should establish data management plans at the beginning of each research project that include appropriate provisions for the stewardship of research data.
In short, modeling the domain of one's research and designing a database is only the beginning. Researchers must also consider how to preserve and eventually share raw data to enable replication of experiments and statistical analyses as well as future research that may use the data in unanticipated ways.
This article has been published as part of Acta Veterinaria Scandinavica Volume 53 Supplement 1, 2011: Databases in veterinary medicine: validation, harmonisation and application. Proceedings of the 24th Symposium of the Nordic Committee for Veterinary Scientific Cooperation (NKVet). The full contents of the supplement are available online at http://www.actavetscand.com/supplements/53/S1.
- Codd EF: A Relational Model of Data for Large Shared Data Banks. Communications of the ACM. 8211, 13 (6): 377-387. doi:10.1145/362384.362685View ArticleGoogle Scholar
- Churcher C: Beginning Database Design. Apress. 2007Google Scholar
- Snodgrass R: Developing Time-Oriented Database Applications in SQL. 1999, Morgan Kaufmann, Full text at http://www.cs.arizona.edu/people/rts/tdbbook.pdfGoogle Scholar
- SNOMED Clinical Terms User Guide. 2009, International Health Terminology Standards Development Organisation, At http://www.ihtsdo.org/snomed-ct/
- Lippert S: IT University of Copenhagen, personal communication. 2010Google Scholar
- Wilcke R: Veterinary adaptation of SNOMED-CT. Presentation at Talbot Symposium, AVMA Convention. 2009, At http://snomed.vetmed.vt.edu/Google Scholar
- International Nucleotide Sequence Database Collaboration: The DDBJ/EMBL/GenBank Feature Table, version 8.3. 2010, At http://www.insdc.org/Google Scholar
- Brunak S, Danchin A, Hattori M, Nakamura H, Shinozaki K, Matise T, Preusset D: Nucleotide Sequence Database Policies. Science. 2002, 298: 1333-View ArticlePubMedGoogle Scholar
- Data for the masses. Editorial. Nature. 2009, 457: 129-129. doi:10.1038/457129aGoogle Scholar
- Gray J, Liu DT, Nieto-Santisteban M, Szalay AS: Scientific Data Management in the Coming Decade. Microsoft Research Technical Report MSR-TR-2005-10. 2005, At http://arxiv.org/pdf/cs/0502008Google Scholar
- National Academy of Sciences: Ensuring the integrity, accessibility, and stewardship of research data in the digital age. National Academies Press. 2009Google Scholar
- Information overload. Editorial. Nature. 2009, 460: 551-551. doi:10.1038/460551aGoogle Scholar
- Kleppner D, Sharp PA: Research Data in the Digital Age. Editorial. Science. 2009, 325: 368-368. 10.1126/science.1178927. doi:10.1126/science.1178927View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.