I’m frequently asked the question in the title of this post, usually in one of the following contexts:
- People have used one of these terms various circumstances to mean different things
- People have pulled together a list or model of terms and have named it one of these things and are seeking to justify their choice of name
- People are truly confused by the difference and are asking because they just want to be told which one fits their use cases
- People who enjoy grammar pedantry and have nothing better to do
Before I answer this question, a few points:
- These are the definitions that I have used in data architecture – yes, I know there are other ways to define these things but this is how I will define them within this site
- Precision of meaning is the important thing here – terms defined by these artifacts are the pieces of intellectual property underlying what you’re building or defining
- When we say ‘use’ one of these things, we mean applying terms to physically implemented things, eg messages, databases, queries, etc as defined in these artifacts
- If a term has not been signed off by a business owner, it is open to reasonable challenges but if it has been signed off, the owner can decide if they want to entertain challenges or not
- There’s no such thing as a ‘complete’ dictionary; terms will be added or retired just as they are in any living language
- Maturity of terms is a factor – things get drafted and go through several iterations before they’re fit for purpose (just like every language does, as you may have guessed)
Ok, so let’s get started.
Glossary
What is a glossary?
A glossary, also known as a vocabulary, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end of a paper or a book and includes terms within the text that are either newly introduced, uncommon, or specialized. While glossaries are often associated with educational or non-fiction books, in some cases fictional novels may come with a glossary for unfamiliar terms (I’m sure there’s a glossary in Lord of the Rings or other books with an invented language; very similar thing). A Data Glossary is intentionally structure-agnostic and will be relatively unconstrained in the terms it includes (i.e. if the business wish this to be a commonly understood term then include it). It’s intentionally free of any organization of content, so by its very nature, is faster to produce and wider reaching than most other Data Architecture produced deliverables.
Why do we need one?
The objective of a glossary is to articulate the definitions of terms that define the language of whatever task it is we’re working on, be it a process, an entity, etc. It’s a useful artefact for people who have no interest in the anatomy of what something is or where it sits within the wider Data Architecture, but simply wish to know that the term exists and is defined. We frequently have issues where the same name is used to mean different things, so having an agreed meaning for a term reduces problems in understanding caused by this. Lots of global organizations have challenges when they conduct business in one language, say English, and employ many people for whom this is not a native tongue. Because English can be different in different locations (eg USA vs UK vs Australia), it is important to a) have a list of all the agreed terms in English that we have designated as the ones we want to use globally, and b) have the those terms linked to terms in local language where a local regulator or local business requirements necessitate articulating the corresponding local term. Don’t assume that all English (or Spanish or whatever the local language is) is all the same.
How many do we need and how do we control them?
Any institution or organization should have only one enterprise glossary (the clue is in the title!). The terms and their definitions that make up an enterprise-wide glossary should be mastered in a tool for this purpose (some people use commercial tools, some use hand-coded tools, doesn’t matter as long as you have something that tracks changes and manages published versions). There are different schools of thought on how to build up a glossary. You can crowd source it and have everyone enter their own terms. You can have a central team define all the terms and publish them. You can purchase a data model and use those terms across your enterprise. All of these have pros and cons and the best thing to do is to find the method that works for your organization. You will need some owners, some stewards and some governance, so think about how you want to manage changes.
Dictionary
What is a dictionary?
A Data Dictionary is an extract of structured data elements and their metadata, taken from a given Data Model or Data Architecture scope. A Data Dictionary is essentially a by-product of the data modelling process, and can be thought of as a data model in narrative form. It will include semantic name and definition detail like a Data Glossary, but will also include other structural characteristics over and above this, e.g. of what other elements is it a composite part (i.e. attribute x belongs to entity y), the length, data type, or allowable value constraints that that dictate valid content, or the relationships that exist between given dictionary elements.
Why do we need one?
The objective of a dictionary is to extend the glossary with additional information and to add other items that are not business terms. If the glossary is the ‘business relevant’ or ‘human readable’ articulation of our data, then the dictionary is the ‘technology relevant’ or ‘machine readable’ articulation of our data. The dictionary bridges the gap between the content as its defined and how that content is articulated in an implementation, be it a message, data flow, data structure, file, table, etc. When you want to know if ‘Account ID’ is a character field or a numerical field, you look it up in the data dictionary.
How many do we need and how do we control them?
Since the dictionary is built from the glossary, there will only be one enterprise-wide dictionary. We typically partition dictionaries down from enterprise-wide to a smaller set of attributes that cover the context/scope of the content specific for an implementation. The semantic/logical terms in the dictionary will be the same, but the physical names might vary depending on the limitations of the technologies they’re implemented in. For example, we might have limitations on physical names or levels of verbosity permitted within a protocol or message type (I can’t think of an example but I’m sure there will be some), but the requirement is that all physical attributes have a dictionary entry. If they are attributes that exist to support a business requirement (eg counterparty name), they will link back to a logical attribute (and probably a term). If they are a physical only attribute (eg last_update_time or prev_cpty_id and current_cpty_id) then they might only exist in the dictionary.
Taxonomy
What is a taxonomy?
Taxonomy is the practice and science of classification. A taxonomy, or taxonomic scheme, is a particular classification The word finds its roots in the Greek τάξις, taxis (meaning ‘order’, ‘arrangement’) and νόμος, nomos (‘law’ or ‘science’). Originally taxonomy referred only to the classifying of organisms or a particular classification of organisms. In a wider, more general sense, it may refer to a classification of things or concepts, as well as to the principles underlying such a classification. Specifically A Data Taxonomy is a defined classification of terms, organized hierarchically into any number of levels of category and sub-category as required, and to serve a given purpose.
Why do we need one?
In the most basic sense, taxonomies organize things into a system or hierarchy that makes it easier to find those things. The most famous taxonomy is the one designed by Swedish botanist Carolus Linnaeus in the 18th century, known now as the ‘Linnaean system’ of biological classifications using the familiar binomial nomenclature of species identification (hello fellow Homo sapiens). In the simplest terms, we need to know the classifications of things so we can decide what we’re looking for and be able to ask others for those same things. If we continued to name things each as a distinct instance, we wouldn’t have Homo sapiens, we’d have a set of beings each known by a distinct word (eg Shannon or John Smith or Superman) and lots of those words would repeat (eg how many John Smiths would we have?). What makes taxonomies challenging is that each project or program or regulation seems to start from a blank sheet of paper each time they try to classify things. What people may be unaware of is that there are standard taxonomies published via industry bodies, ISO and other standards providers that can be used instead of inventing their own taxonomy.
Ontology
What is an ontology?
An ontology, is a formal naming and definitions of the types, properties and interrelationships of the entities that fundamentally exist for a particular domain of discourse. It is a practical application of philosophical ontology, with a taxonomy. An ontology compartmentalizes the variables needed for some set of computations and establishes the relationships between them. In the semantic web, ontologies can then be used for inference. From the EDM Council: “The primary application of the business conceptual ontology is for data harmonization and for the unambiguous sharing of meaning across data repositories. This common language (or Rosetta stone) for the financial industry supports business process automation and facilitates risk analysis.”
Why do we need one?
We need an ontology to bridge the gap between interpreting data that is easily machine readable (think tabular data) and interpreting data that is textual (think web pages). We use it also for inference, meaning if something has a bill, webbed feet, feathers and quacks, we can infer from those characteristics that it’s a duck. An ontology makes any content you have (tables, documents, messages, logfiles) able to be read by a machine but more importantly it makes it interpretable by a machine without additional programming (think machine learning).
Once we’ve linked these data elements to the common language or ontology, we can use inference to determine what things are. The advantage of inference is that the thing is defined by its characteristics, not what it’s called or what structure someone’s bent it into to get it into a system. Inference is a key capability that powers linked data – once we know what things are and what they’re called, we can link things together to form a living picture of our landscape. This keeps all this information live and fresh; no looking at mapping diagrams or data flows that are so out of date no one knows by how much. The ontology becomes the lingua franca, the vernacular, the common set of words by which we understand what our data means. Technologies like graph databases are excellent for ontologies and inference, as they use the language of things to link them together instead of pointers that the underlying system has fixed (eg like a relational database or row-based structure).
How many do we need and how do we control them?
Again there should be one enterprise level ontology as by definition it’s the common language for the organization. We will control the changes through our change control process and in sync with our other artifacts. Mastering it is a bit tricky as there are not a lot of tools that support ontologies, but the best ones I’ve used TopBraid or Protégé (which is open source).
Here is a paper that has a similar comparison between these objects. Here is a paper that includes thesauri and knowledge graphs in the comparison.