Impetus for Our Semantics and NoSQL Workshop at the 2015 SmartData Conference
I’m looking forward to being one of the presenters for infuzIT’s hands-on data integration and analysis workshop at this year’s SmartData Conference in San Jose. Giving people the opportunity to see the amazing power of semantics combined with NoSQL to quickly integrate and analyze data makes my day.
My background includes significant work with data, both as an application developer and data warehouse architect. The acceleration of data-centric hardware and software capabilities over the past 10 years now supports a very different paradigm for exploring, reporting and analyzing data. Processes and procedures for creating a data warehouse or mart, the accepted rules of the road for creating integrated data repositories, are no longer clear cut. The data federation debate is no longer Inmon or Kimball.
A significant shift in data integration revolves around the required lifespan of the integrated data. This lifespan has two key aspects whose evolution now allows us to rethink our approach to data federation. This permits us to be much more agile when bringing heterogeneous data sources together. The two aspects are reflected in these design questions: 1) what data, if any, will be rehosted; and 2) what relationships will be supported within the integrated data?
Rehosting Data
In a traditional data warehouse the data must be rehosted. The new repository is the target where transformed data (cleaned-up, standardized) exists. The queries that will be retrieving data from multiple sources are really pulling data from a single source that has been populated from multiple sources. It represents a heavyweight process, driven by Extract-Transform-Load (ETL) scripts and requiring space to host redundant information.
Relationships Between Data Elements
The target warehouse schema determines what relationships are defined between the data elements being combined. Getting this “right” requires careful planning and coordination between the various groups that will use the warehouse. Given the significant effort, represented as cost, organizations tend to design data warehouses to support broad constituencies as a way to amortize the investment across departments and projects.
Paradigm Shift
Semantics and NoSQL allow us to reduce the effort of integrating data by orders of magnitude. They support a completely different mindset for bringing data together. Instead of carefully designing a model that works well in the general sense (reducing the value in specific cases) we have environments that allow us to experiment, adjust and focus on each case.
Below are several drivers which allow us to approach data federation differently using semantics and NoSQL.
Query Data in Place
Semantic technology is designed to query data that is located across heterogeneous systems. The query language itself (SPARQL) includes constructs for querying multiple systems and federating the information in real time. Leveraging this capability is feasible due to advances in network bandwidth, server hardware and database tuning. Further, commercial offerings which wrap relational data sources as semantic endpoints continue to advance, allowing our existing databases to participate as semantic data sources.
Semantic Models for Agile Relationships
When modeling relationships in a relational data warehouse we use concepts such as foreign keys, join tables, fact tables and dimension tables. These are physical manifestations which require tables to be created and data to be loaded into them before we may benefit from the relationships being modeled. Changing these relationships requires redefining the tables and reloading the data.
When using semantics we create ontologies, logical relationship definitions. The ontology is able to express relationships between heterogeneous data sources, much like the ETL process would do when loading a warehouse. The advantage of being a logical model is that we don’t need to extract and load the data into a new physical model. We can use the federating capability of SPARQL along with the ontology to virtually integrate the data.
The federated relationship is ephemeral – existing as a logical construct and as a point-in-time result set. We can change the relationships and federate different subsets of data very quickly. Ontologies are logical expressions of relationships. The federated set of data is determined by a query. This means that we may adjust our model in minutes. We are able to rapidly prototype and enrich integrated datasets with very low risk.
NoSQL for Lightweight Warehouses
The transient nature of a semantically-derived federated result set may seem of limited value. However, this is where NoSQL data stores excel. Part of the agility afforded us by NoSQL environments is their support for schema-less data storage.
A schema-less database is able to host data with an arbitrary set of elements and relationships without requiring that those concepts be predefined. Even the data being loaded may vary from record to record. Fundamentally, NoSQL environments allow heterogeneous data structures to co-exist. This means that we may take the output of our semantically-driven queries and place it directly in a NoSQL data store without any penalty or upfront setup.
Although the schema need not be predefined, the environment is still able to optimize interactions with the data. For example, a NoSQL document store is able to parse the structure of the document and create indexes to speed query performance.
NoSQL for Analytics
If you’ve heard about NoSQL platforms it has likely been in the context of data analytics. This is for good reason; these platforms are built to work with very large volumes of data (and the other data Vs as well). Support for analyzing very large data sets is a key value proposition for NoSQL. This is a great reason to target these environments as the repository for lightweight federated data.
Standards and Tools
Several years ago using these technologies in tandem would have required significant effort. Querying relational databases as if they were semantic endpoints, persisting SPARQL query results within a NoSQL data store and building analytic processes in a NoSQL database would have required the IT team to build a lot of specialized integration software.
Thanks to the fact that these are well understood and shared challenges, a set of relevant standards and tools have evolved and matured. The protocols used for semantic federation (SPARQL) as well as the common data representation used by NoSQL document stores (JSON) are standardized. Vendors, using those standards, have created tools to integrate between these technologies. This puts us in a powerful position where we may leverage these tools within our own infrastructures and begin to gain experience with a new approach to data integration.
Our Hands-on Workshop
Attendees of our workshop will apply the agile data integration process described in this article. We will use relational, semantic and NoSQL platforms to integrate heterogeneous data sources and create analytic pipelines. This will allow participants to experience first-hand how these technologies work and, more importantly, provide them with a baseline for exploring these options further within their own organizations.
The infuzIT team looks forward to meeting you at the SmartData Conference in San Jose this August.
Tags: data, data integration, lightweight data federation, NoSQL, semantics, workshop