Database Refactoring and RDF Triples
One of the aspects of agile software development that may lead to significant angst is the database. Unlike refactoring code, the refactoring of the database schema involves a key constraint – state! A developer may rearrange code to his or her heart’s content with little worry since the program will start with a blank slate when execution begins. However, the database “remembers.” If one accepts that each iteration of an agile process produces a production release then the stored data can’t be deleted as part of the next iteration.
The refactoring of a database becomes less and less trivial as project development continues. While developers have IDE’s to refactor code, change packages, and alter build targets, there are few tools for refactoring databases.
My definition of a database refactoring tool is one that assists the database developer by remembering the database transformation steps and storing them as part of the project – e.g. part of the build process. This includes both the schema changes and data transformations. Remember that the entire team will need to reproduce these steps on local copies of the database. It must be as easy to incorporate a peer’s database schema changes, without losing data, as it is to incorporate the code changes.
These same data-centric complexities exist in waterfall approaches when going from one version to the next. Whenever the database structure needs to change, a path to migrate the data has to be defined. That transformation definition must become part of the project’s artifacts so that the data migration for the new version is supported as the program moves between environments (test, QA, load test, integrated test, and production). Also, the database transformation steps must be automated and reversible!
That last point, the ability to rollback, is a key part of any rollout plan. We must be able to back out changes. It may be that the approach to a rollback is to create a full database backup before implementing the update, but that assumption must be documented and vetted (e.g. the approach of a full backup to support the rollback strategy may not be reasonable in all cases).
This database refactoring issue becomes very tricky when dealing with multiple versions of an application. The transformation of the database schema and data must be done in a defined order. As more and more data is stored, the process consumes more storage and processing resources. This is the ETL side-effect of any system upgrade. Its impact is simply felt more often (e.g. potentially during each iteration) in an agile project.
As part of exploring semantic technology, I am interested in contrasting this to a database that consists of RDF triples. The semantic relationships of data do not change as often (if at all) as the relational constructs. Many times we refactor a relational database as we discover concepts that require one-to-many or many-to-many relationships.
Is an RDF triple-based database easier to refactor than a relational database? Is there something about the use of RDF triples that reduces the likelihood of a multiplicity change leading to a structural change in the data? If so, using RDF as the data format could be a technique that simplifies the development of applications. For now, let’s take a high-level look at a refactoring use case.
Imagine we are in the first iteration of a web-based on-line store and we decide to support only ordering one item. We opt to store that item in a table with the order header data. In the next iteration we decide to add support for multiple items in the shopping basket.
We’ll refactor the database to support a one-to-many relationship between the order header and the shopping basket items. There is nothing wrong with this approach; it is simply a part of refactoring the design. Beyond altering the database schema (adding the order_item table and removing columns from the order table), this change will necessitate a transformation of the existing order data into the two-table structure.
What is interesting to me is to look at this situation where the data is modeled using RDF triples. What data structure, if any, changes between these two iterations?
The semantic relationships don’t change. What does change? Very little, depending on the implementation. If the first iteration’s RDF triple for the order was an order (subject), itemOrdered (predicate) and item (object), then the data structure is unchanged for iteration 2. We will simply have to allow multiple itemOrdered predicates on the order instance.
Based on this conjecture we are off to a promising start in terms of simplifying data refactoring. In this case no transformation work was needed to the data itself. If many of our refactoring use cases look like this (changes to the restrictions and not the relationships) then the use of RDF triples as the data storage format offers an attractive alternative to relational databases when dealing with changing data needs.
Of course that concept (changing data needs) is a key aspect of the semantic web. The ability to add new data and structures without breaking old ones is a requirement for an extensible and decentralized web database. It makes sense that looking at a use case within a smaller scope would expose the same benefit.
Will database refactoring of RDF triple-based structures always be this simple? I don’t believe so. If the semantic relationship changes, due to our learning more about the domain, then there will be actual structural refactoring needed. If we start out using a datatype property for an object and then convert to an object property, we will have to restructure any existing instances.
To defend against these types of issues we need to focus on the correct ontology for our domain first, before building applications. Any application then built for that domain (company, industry) will benefit, regardless of the development process (e.g. agile, waterfall).
Have you worked with RDF triple-based data structures as part of an agile project? If so, do you have thoughts on whether the use of triples simplified data storage refactoring from iteration to iteration?
I’ll be trying these techniques on a POC and hope to have more concrete examples of the impact of this alternate data storage approach. It is just another example of where semantic technologies are positioned to significantly impact the ways that we design, develop and test software-based solutions.
Tags: agile development, efficient coding, enterprise applications, enterprise systems, Information Systems, linkedin, ontology, refactoring, semantic web, semantics, system integration
May 12th, 2010 at 23:42
[...] This post was mentioned on Twitter by SemanticBot. SemanticBot said: #SemNews : Dave's Reflections » Blog Archive » Database Refactoring and RDF … http://bit.ly/aC3Mza [...]