Whilst looking at ontologies, data and how to make data a ‘code-able’ thing, I jumped in and started by reading a book on provenance (more on that later). Reading it was more of a challenge than I had originally anticipated (provenance? how hard can it be?) and it highlighted a list of things that I will also need to learn in order to make use of any of technologies or concepts I’m interested in learning and applying.
Here’s a list of topics that I’m starting with and I’ll keep this updated as I go.
- The Semantic Web – cited in this paper here by authors that include Tim Berners-Lee. It’s the idea that data creates a ‘web of knowledge’ and it is the thing that I’m trying to learn how to ‘do’.
- Ontology – tellingly there are two entries for ontology in wikipedia: this one for the nature of being, as in philosophy and this one for ontologies for information science. While I would like to spend my time looking into both, I’ll concentrate more on the latter here.
- OWL or Web Ontology Language (more on the misleading acronym later) is how ontologies are expressed.
- RDF or Resource Distribution Framework and defined here is the W3C recommendation for semantic web data models. The Schema definition for RDF is called RDFS and is defined here.
- I’ll need to refer to our old friends XML and XSDs here too. I haven’t done a lot of defining or designing in XML but it seems handy to use for some of the enumerated data formats. There might need to be some HTML and XHTML that needs to be referenced but we’ll cross that bridge when we come to it.
- SKOS is a new and interesting thing I’ve come across. It’s used as a format for the FIBO vocabulary. I’ll see if it’s just for FIBO or if there are other vocabularies that use it.
- Things like SPARQL will also be important and I haven’t tried working with it yet.
Tools, software, etc:
- Ontology tools: Protege, TopBraid
- Data storage technologies which make a long list to be explored later
- Data science-y things like R and Python
- Wolfram Alpha Mathematica looks interesting
- SQL PowerArchitect looks like it could be a useful data modelling tool
and possibly many others…
What am I trying to learn here? I don’t want this to be a fishing expedition with no results or just another blog of data tourism. We know that in order to learn any practical skill we must start as we mean to continue, or at least just start.
I’ve done some thinking about what is the goal I’m trying to achieve. What will I have when I’m finished? How will I know if I’ve succeeded or failed? I want to state a direction but I also want to be flexible enough to shift if I find that an assumption I’ve made at the beginning is wrong or not where I want to go. I want to keep track of all the things I had to do and learn to go from what I am now (i.e. a data architect who knows how to design data structures – glossaries, protocols, data models, databases, ETL functions) to someone who knows how to ‘codify’ data in the semantic web (i.e. how data is an object that is a thing in its own right instead of things that are held by other things).
If I restate that thinking as the question, then:
How can a person who’s learned traditional data architecture (defined as third normal form models, interface file formats, relational database tables, ETL, mapping, etc) achieve expertise in new semantic web technologies including ontologies, inference and data codification? What does that person need to learn, what are the steps, what are the languages and tools?
I know I could take a data science course or a programming course but I think it’s more than that. There’s a psychological shift that I need to make from thinking about data being decanted into containers to data being a thing or an object on its own. I’d like to see how inference works in a practical sense, do it for myself and understand the practical way to get it going. The end result might be more of a continuum of skills rather than a thing that I’ve built, but I suspect I’ll have to build something to get skilled in the technologies that I want to investigate.
The audience for my experiment would be people who have an interest in data and the future of data but don’t know where to start or what to do next (in a sense, the audience is me).
I welcome comments and suggestions. Starting off might be slow as I work out what I need to look at and then try to catch up to all the people who’ve been working on this for years before me. I want to make this a practical how-to so I’ll have to hit all the basics before I get fancy. Time to get started!