the data maven

What’s the Difference: Glossary, Dictionary, Taxonomy, Ontology?

I’m frequently asked the question in the title of this post, usually in one of the following contexts:

People have used one of these terms various circumstances to mean different things
People have pulled together a list or model of terms and have named it one of these things and are seeking to justify their choice of name
People are truly confused by the difference and are asking because they just want to be told which one fits their use cases
People who enjoy grammar pedantry and have nothing better to do

Before I answer this question, a few points:

These are the definitions that I have used in data architecture – yes, I know there are other ways to define these things but this is how I will define them within this site

Precision of meaning is the important thing here – terms defined by these artifacts are the pieces of intellectual property underlying what you’re building or defining

When we say ‘use’ one of these things, we mean applying terms to physically implemented things, eg messages, databases, queries, etc as defined in these artifacts

If a term has not been signed off by a business owner, it is open to reasonable challenges but if it has been signed off, the owner can decide if they want to entertain challenges or not

There’s no such thing as a ‘complete’ dictionary; terms will be added or retired just as they are in any living language

Maturity of terms is a factor – things get drafted and go through several iterations before they’re fit for purpose (just like every language does, as you may have guessed)

Ok, so let’s get started.

Glossary

What is a glossary?

A glossary, also known as a vocabulary, is an alphabetical list of terms in a particular domain of knowledge with the definitions for those terms. Traditionally, a glossary appears at the end of a paper or a book and includes terms within the text that are either newly introduced, uncommon, or specialized. While glossaries are often associated with educational or non-fiction books, in some cases fictional novels may come with a glossary for unfamiliar terms (I’m sure there’s a glossary in Lord of the Rings or other books with an invented language; very similar thing). A Data Glossary is intentionally structure-agnostic and will be relatively unconstrained in the terms it includes (i.e. if the business wish this to be a commonly understood term then include it). It’s intentionally free of any organization of content, so by its very nature, is faster to produce and wider reaching than most other Data Architecture produced deliverables.

Why do we need one?

The objective of a glossary is to articulate the definitions of terms that define the language of whatever task it is we’re working on, be it a process, an entity, etc. It’s a useful artefact for people who have no interest in the anatomy of what something is or where it sits within the wider Data Architecture, but simply wish to know that the term exists and is defined. We frequently have issues where the same name is used to mean different things, so having an agreed meaning for a term reduces problems in understanding caused by this. Lots of global organizations have challenges when they conduct business in one language, say English, and employ many people for whom this is not a native tongue. Because English can be different in different locations (eg USA vs UK vs Australia), it is important to a) have a list of all the agreed terms in English that we have designated as the ones we want to use globally, and b) have the those terms linked to terms in local language where a local regulator or local business requirements necessitate articulating the corresponding local term. Don’t assume that all English (or Spanish or whatever the local language is) is all the same.

How many do we need and how do we control them?

Any institution or organization should have only one enterprise glossary (the clue is in the title!). The terms and their definitions that make up an enterprise-wide glossary should be mastered in a tool for this purpose (some people use commercial tools, some use hand-coded tools, doesn’t matter as long as you have something that tracks changes and manages published versions). There are different schools of thought on how to build up a glossary. You can crowd source it and have everyone enter their own terms. You can have a central team define all the terms and publish them. You can purchase a data model and use those terms across your enterprise. All of these have pros and cons and the best thing to do is to find the method that works for your organization. You will need some owners, some stewards and some governance, so think about how you want to manage changes.

Dictionary

What is a dictionary?

A Data Dictionary is an extract of structured data elements and their metadata, taken from a given Data Model or Data Architecture scope. A Data Dictionary is essentially a by-product of the data modelling process, and can be thought of as a data model in narrative form. It will include semantic name and definition detail like a Data Glossary, but will also include other structural characteristics over and above this, e.g. of what other elements is it a composite part (i.e. attribute x belongs to entity y), the length, data type, or allowable value constraints that that dictate valid content, or the relationships that exist between given dictionary elements.

Why do we need one?

The objective of a dictionary is to extend the glossary with additional information and to add other items that are not business terms. If the glossary is the ‘business relevant’ or ‘human readable’ articulation of our data, then the dictionary is the ‘technology relevant’ or ‘machine readable’ articulation of our data. The dictionary bridges the gap between the content as its defined and how that content is articulated in an implementation, be it a message, data flow, data structure, file, table, etc. When you want to know if ‘Account ID’ is a character field or a numerical field, you look it up in the data dictionary.

How many do we need and how do we control them?

Since the dictionary is built from the glossary, there will only be one enterprise-wide dictionary. We typically partition dictionaries down from enterprise-wide to a smaller set of attributes that cover the context/scope of the content specific for an implementation. The semantic/logical terms in the dictionary will be the same, but the physical names might vary depending on the limitations of the technologies they’re implemented in. For example, we might have limitations on physical names or levels of verbosity permitted within a protocol or message type (I can’t think of an example but I’m sure there will be some), but the requirement is that all physical attributes have a dictionary entry. If they are attributes that exist to support a business requirement (eg counterparty name), they will link back to a logical attribute (and probably a term). If they are a physical only attribute (eg last_update_time or prev_cpty_id and current_cpty_id) then they might only exist in the dictionary.

Taxonomy

What is a taxonomy?

Taxonomy is the practice and science of classification. A taxonomy, or taxonomic scheme, is a particular classification The word finds its roots in the Greek τάξις, taxis (meaning ‘order’, ‘arrangement’) and νόμος, nomos (‘law’ or ‘science’). Originally taxonomy referred only to the classifying of organisms or a particular classification of organisms. In a wider, more general sense, it may refer to a classification of things or concepts, as well as to the principles underlying such a classification. Specifically A Data Taxonomy is a defined classification of terms, organized hierarchically into any number of levels of category and sub-category as required, and to serve a given purpose.

Why do we need one?

In the most basic sense, taxonomies organize things into a system or hierarchy that makes it easier to find those things. The most famous taxonomy is the one designed by Swedish botanist Carolus Linnaeus in the 18th century, known now as the ‘Linnaean system’ of biological classifications using the familiar binomial nomenclature of species identification (hello fellow Homo sapiens). In the simplest terms, we need to know the classifications of things so we can decide what we’re looking for and be able to ask others for those same things. If we continued to name things each as a distinct instance, we wouldn’t have Homo sapiens, we’d have a set of beings each known by a distinct word (eg Shannon or John Smith or Superman) and lots of those words would repeat (eg how many John Smiths would we have?). What makes taxonomies challenging is that each project or program or regulation seems to start from a blank sheet of paper each time they try to classify things. What people may be unaware of is that there are standard taxonomies published via industry bodies, ISO and other standards providers that can be used instead of inventing their own taxonomy.

Ontology

What is an ontology?

An ontology, is a formal naming and definitions of the types, properties and interrelationships of the entities that fundamentally exist for a particular domain of discourse. It is a practical application of philosophical ontology, with a taxonomy. An ontology compartmentalizes the variables needed for some set of computations and establishes the relationships between them. In the semantic web, ontologies can then be used for inference. From the EDM Council: “The primary application of the business conceptual ontology is for data harmonization and for the unambiguous sharing of meaning across data repositories. This common language (or Rosetta stone) for the financial industry supports business process automation and facilitates risk analysis.”

Why do we need one?

We need an ontology to bridge the gap between interpreting data that is easily machine readable (think tabular data) and interpreting data that is textual (think web pages). We use it also for inference, meaning if something has a bill, webbed feet, feathers and quacks, we can infer from those characteristics that it’s a duck. An ontology makes any content you have (tables, documents, messages, logfiles) able to be read by a machine but more importantly it makes it interpretable by a machine without additional programming (think machine learning).

Once we’ve linked these data elements to the common language or ontology, we can use inference to determine what things are. The advantage of inference is that the thing is defined by its characteristics, not what it’s called or what structure someone’s bent it into to get it into a system. Inference is a key capability that powers linked data – once we know what things are and what they’re called, we can link things together to form a living picture of our landscape. This keeps all this information live and fresh; no looking at mapping diagrams or data flows that are so out of date no one knows by how much. The ontology becomes the lingua franca, the vernacular, the common set of words by which we understand what our data means. Technologies like graph databases are excellent for ontologies and inference, as they use the language of things to link them together instead of pointers that the underlying system has fixed (eg like a relational database or row-based structure).

How many do we need and how do we control them?

Again there should be one enterprise level ontology as by definition it’s the common language for the organization. We will control the changes through our change control process and in sync with our other artifacts. Mastering it is a bit tricky as there are not a lot of tools that support ontologies, but the best ones I’ve used TopBraid or Protégé (which is open source).

Here is a paper that has a similar comparison between these objects. Here is a paper that includes thesauri and knowledge graphs in the comparison.

The List – Topics To Be Covered

Whilst looking at ontologies, data and how to make data a ‘code-able’ thing, I jumped in and started by reading a book on provenance (more on that later). Reading it was more of a challenge than I had originally anticipated (provenance? how hard can it be?) and it highlighted a list of things that I will also need to learn in order to make use of any of technologies or concepts I’m interested in learning and applying.

Here’s a list of topics that I’m starting with and I’ll keep this updated as I go.

The Semantic Web – cited in this paper here by authors that include Tim Berners-Lee. It’s the idea that data creates a ‘web of knowledge’ and it is the thing that I’m trying to learn how to ‘do’.
Ontology – tellingly there are two entries for ontology in wikipedia: this one for the nature of being, as in philosophy and this one for ontologies for information science. While I would like to spend my time looking into both, I’ll concentrate more on the latter here.
OWL or Web Ontology Language (more on the misleading acronym later) is how ontologies are expressed.
RDF or Resource Distribution Framework and defined here is the W3C recommendation for semantic web data models. The Schema definition for RDF is called RDFS and is defined here.
I’ll need to refer to our old friends XML and XSDs here too. I haven’t done a lot of defining or designing in XML but it seems handy to use for some of the enumerated data formats. There might need to be some HTML and XHTML that needs to be referenced but we’ll cross that bridge when we come to it.
SKOS is a new and interesting thing I’ve come across. It’s used as a format for the FIBO vocabulary. I’ll see if it’s just for FIBO or if there are other vocabularies that use it.
Things like SPARQL will also be important and I haven’t tried working with it yet.

Tools, software, etc:

Ontology tools: Protege, TopBraid
Data storage technologies which make a long list to be explored later
Data science-y things like R and Python
Wolfram Alpha Mathematica looks interesting
SQL PowerArchitect looks like it could be a useful data modelling tool

and possibly many others…

The Basics – Principles and Guidelines

Before I got too far into posting things and writing things, I wanted to lay out the boundaries and parameters under which things would appear on this blog. I didn’t want to assume that we were all thinking the same things (of course that’s what you where thinking!). There are principles and guidelines I wanted to follow when posting and writing, and I wanted to those to be very clear to any reader. A list of assumptions or parameters written down is always good when starting any type of experiment.

Some points to remember when reading the posts under ‘The Basics’:

The definitions of ‘The Basics’ are meant in the context of this site. I could have just redirected you to wikipedia or wherever but I defined things here so I can include an interpretation as well as how things link to other things (the context of which may not be relevant for wikipedia and/or others).
Sometimes definitions and examples are both art and science. As you know with data models or any other types of design, there are many ways to develop something to achieve the same end result. Anything I design or define here is the way of doing it that worked for me. I welcome other ideas as someone else may have seen something that I did not see and therefore come up with a better process or result.
Open source is the preferred option. We all know why so let’s just not be proprietary about it, OK?
I reserve the right to change my mind about assumptions and opinions. Facts will remain facts (preferably with citations, wikipedia-style). Let’s try and keep it scientific method here people.
Share and share alike, credit where credit is due. I will be open about where I found something, who I heard something from and what are thoughts I believe to be my own. I will also update posts when I find that I haven’t given proper credit because that’s how things should be.

All appropriate caveats and exemptions apply, so let’s get on with it.

Standard form of contract 🙂

The Experiment – What are we trying to prove?

What am I trying to learn here? I don’t want this to be a fishing expedition with no results or just another blog of data tourism. We know that in order to learn any practical skill we must start as we mean to continue, or at least just start.

I’ve done some thinking about what is the goal I’m trying to achieve. What will I have when I’m finished? How will I know if I’ve succeeded or failed? I want to state a direction but I also want to be flexible enough to shift if I find that an assumption I’ve made at the beginning is wrong or not where I want to go. I want to keep track of all the things I had to do and learn to go from what I am now (i.e. a data architect who knows how to design data structures – glossaries, protocols, data models, databases, ETL functions) to someone who knows how to ‘codify’ data in the semantic web (i.e. how data is an object that is a thing in its own right instead of things that are held by other things).

If I restate that thinking as the question, then:

How can a person who’s learned traditional data architecture (defined as third normal form models, interface file formats, relational database tables, ETL, mapping, etc) achieve expertise in new semantic web technologies including ontologies, inference and data codification? What does that person need to learn, what are the steps, what are the languages and tools?

I know I could take a data science course or a programming course but I think it’s more than that. There’s a psychological shift that I need to make from thinking about data being decanted into containers to data being a thing or an object on its own. I’d like to see how inference works in a practical sense, do it for myself and understand the practical way to get it going. The end result might be more of a continuum of skills rather than a thing that I’ve built, but I suspect I’ll have to build something to get skilled in the technologies that I want to investigate.

The audience for my experiment would be people who have an interest in data and the future of data but don’t know where to start or what to do next (in a sense, the audience is me).

I welcome comments and suggestions. Starting off might be slow as I work out what I need to look at and then try to catch up to all the people who’ve been working on this for years before me. I want to make this a practical how-to so I’ll have to hit all the basics before I get fancy. Time to get started!

hello world

I am a data person. I love knowing the meaning of things and how they fit together. One of my favourite things to do in school was to diagram sentences to show how words form communication. Data is not just models and diagrams of structure, but also the meaning of the content within those structures. Containers for data do not in themselves mean anything. Remember the old days when switching phones meant re-entering all your contacts? So what if your new shiny phone could hold more entries, you’d still have to get them all in there somehow. What’s more important: more memory or more things to remember?

Obviously data itself changes all the time. For years data was incidental to the function and form of applications and software. Hardware was built to process it faster and hold more of it but no one really discussed what ‘it’ represented. Data was not the important thing on its own. Software and hardware were the important things because you can buy and sell them, less so with data. Now data is if not THE thing, at least it’s held in much higher regard than it was previously. What’s noticeable now is the huge shift in how we talk about data, how we treat it and what value it has to us.

What I want to learn is how do we shift from focusing on the containers that hold data and the wires that move it around to looking at the data itself as part of the code, part of the thing that makes up our universe of information. Tim Berners-Lee’s TED talk about the Semantic Web talks about how linking things to other things builds the Semantic Web. Making data an object in that universe is key to unlocking what data can do for us.

I don’t know how this will turn out. I don’t know where it will take me. I chose ‘hello world’ for the title of this post because anyone and everyone who’s programmed anything has written their first program to print ‘hello world’ on screen or wherever. This was true a few months ago when I was learning python – yes, python! the cool new language! We wrote ‘hello world’ and then proceeded to do some data science-y things that I can’t remember now as I’ve not touched it for 3 months, but there we are.

So here is my inaugural ‘hello world’ post, like so many programmers and bloggers before me. I hope you’ll find it as interesting as I do.

Creative Destruction

I started this site out with noble intentions and it’s sat here empty for months. I let myself believe I have been too busy living my life and keeping my head above water to do anything with it until this week. This week I was slapped in the face with a hard lesson about being complacent, trusting others and believing that I could exert some sort of control over my future. It was a very painful experience.

I am a movie person, which to me means that when confronted with various events in life I replay a clip in my head of something that reminds me of what I’m experiencing at that time. This clip from that old favourite The Fifth Element, up until the 1:07 mark, was how I felt that slap went this week: Zorg vs Bastiat. I didn’t get Bastiat’s chance to explain how I felt about this particular theory, as I had to run away before I started sobbing on what felt like one of the worst days of my career.

So what am I going to do about it? Complain here on my blog? Well yes, that’s what I’m doing now but what else? First thing I did was try to get it into perspective. Yes, I was being told I had failed and yes, some of that was in my control but the majority of it wasn’t and I can’t blame myself for that. People made their choices, I wasn’t one of them, fine, let’s move on. Do I feel crushed and hurt and utterly devastated? Yes. Does anyone but me care? No.

After letting myself feel bad about it I realised that in my head I’d thought of all these things to do that would have made a difference but I had done none of them. I’ve read all those articles about what successful people do, blah, blah, blah and all of that is very interesting but none of it matters unless you do it. I haven’t done anything about it because I couldn’t rouse myself from a paralysis that has slowly crept up on me these past few years. I knew it was happening and I could feel it progressing, freezing me in place but I felt helpless to stop it. Inertia has seeped into my very bones and I find it so hard to start anything, let alone do or finish anything. This week’s slap was the hardest one, but I have been getting three or four of those a year and I fear making any moves because of it. My defence mechanism against feeling the pain must have manifested as paralytic inertia. In the future if I’m going to get slapped for doing nothing, it must be more productive to get slapped for doing something, and hopefully less painful.

So if I want to make something of myself and find something interesting to do, I will have to change and take risks. Now that I know how it feels to hit the bottom I’m sure that any risk-taking I decide to do can’t feel as bad as this does.