Dave's Reflections

Archive for the ‘infuzIT’ Category

Successful Machine Learning: Part 2 (What is Being Learned?)

Thursday, June 10th, 2021

Machine Learning Chip

For background on this post, please see my last entry, Part 1: Questions and Baselining.

What separates today’s machine learning from human learning? One word: concepts.

“How so?” you might ask. To see what I mean, let’s start by looking at standard machine learning inputs and outputs. I’ll focus on supervised learning.

Supervised Machine Learning

Supervised machine learning is an approach where we start with a set of records. In each record, one field contains the correct answer, known as the target attribute. The other fields in the record contain related information, formally known as descriptive attributes. For example, we might have a set of measurements for flower petals and the flower’s name for each set of measures. We want the computer to learn how to identify different types of flowers. For those of you with machine learning experience, you’ll recognize the Iris data set as the inspiration for my example.

Supervised machine learning is similar to how we might teach children some set of math facts. We give them many examples of addition problems and answers. Over time we would like them to understand the mechanics of addition and solve novel problems. We have a similar goal with supervised learning. We want to give the computer lots of examples with the correct answers and have it figure out how to answer new problems.

Decision Trees

Table 1: Flower Data
Petal Length	Petal Width	Flower (Answer)
2	1.7	Rose
2.5	2.1	Rose
3.2	0.5	Daisy
3.6	0.6	Daisy

Figure 1: Example Decision Tree

We’ll begin looking at what the machine is learning using a basic supervised approach, decision trees. In this case, the computer looks at the correct answer, the target attribute. It uses the descriptive attributes in the record to create a decision that would use that record’s data to arrive at the correct result. In Table 1, there are four records. For each, there are two measurements for rose petals and daisy petals. The resulting decision tree might look like Figure 1.

This is a simple example, but the interesting point is that the computer is limited to making a decision using the data in the record. As discussed in my previous post, the text “Rose” doesn’t mean anything to the system. We could add additional data to the record, such as details about petal color and whether the stem has thorns. But the machine learning process won’t have that information available without explicitly adding it to the data. Since the computer doesn’t know what the text “Rose” means, it can’t incorporate other knowledge about roses into its decision tree.

This is a considerable hurdle in machine learning. As people learn new information, they build a knowledge base and apply it to new learning. That isn’t how these discreet learning processes work. And that limitation is imposed chiefly because the computer isn’t using concepts.

(more…)

Tags: AI, artificial intelligence, business intelligence, data mining, machine learning, semantics
Posted in Artificial Intelligence (AI), Data Analytics, Data Mining, infuzIT, Intelligent Business Systems, Machine Learning, Semantic Technology | No Comments »

Successful Machine Learning: Part 1 (Questions and Baselining)

Monday, May 17th, 2021

Machine Learning Chip

In this series of posts, I’m delving into the limitations of machine learning and AI, hamstrung by current techniques, while considering technologies and practices to transform business intelligence efforts beyond the status quo.

Question of Intelligence

What is intelligence? What underlies intelligence? What aspects of intelligence do we want machine learning to demonstrate? What is artificial intelligence as opposed to intelligence? What capabilities does a computer need to achieve intelligence? Can programs be written to derive intelligence within a modern computer?

Questions delving into intelligent systems go on and on. I’m going to spend a few blog entries exploring machine learning and our quest to create and benefit from intelligent computer systems. Through this discussion, I’ll explore these questions.

Framing the Discussion

Note that my focus is business automation, what are organizations seeking to gain from machine learning and intelligent systems. I am purposefully avoiding a philosophical discussion of intelligence. To that end, a primary assumption is that we are interested in applying human-style intelligence to advance business or operational success. Put another way, animals and plants demonstrate intelligence of differing types; however, mimicking these is not an organization’s goal when employing machine learning.

Key Terms

To begin, I need working definitions for learning and intelligence. These will serve as touchstones for exploring computer-based learning and intelligence. Merriam-Webster’s dictionary provides helpful initial entries for each. The definition for “learn” is “to gain knowledge or understanding of or skill in by study, instruction, or experience.” While “Intelligence” is defined in two parts, “the ability to learn or understand or to deal with new or trying situations” and “the ability to apply knowledge to manipulate one’s environment or to think abstractly as measured by objective criteria (such as tests).”

The terms knowledge and understanding appear in both definitions and are vital to successful machine learning applications. Knowledge and understanding are based on information.

(more…)

MongoDB and Java – Powerful Complementary Platforms

Tuesday, May 31st, 2016

I have found that including MongoDB in the design of Java applications allows me a valuable level of flexibility in meeting client objectives. I have created an initial open source project on GitHub, JavaMongo, with the goal of providing working examples of Java and MongoDB integration. A secondary goal is to include development best practices, such as using testing frameworks and good coding style.

This posting is intended to give a little background on why I find Java and MongoDB to be useful tools in my software development arsenal and then to introduce the JavaMongo project. Future postings will include some videos walking developers through the examples as well as the frameworks being used (like JUnit, Cobertura and Checkstyle)

Background

Java is an ubiquitous platform for creating business applications. It has proven itself across a wide range of use cases from small point-based solutions to large generalized solution stacks. The variety of libraries, frameworks and tools for designing, building, testing and managing Java applications provides significant benefits to companies building solutions using Java. However, an application without ready access to data isn’t particularly useful. As enterprise-scale database options have broadened to include NoSQL, those individuals creating Java-based solutions must be sure to take advantage of new data options in order to benefit from the strengths of such components.

MongoDB is a great NoSQL platform that can be used to provide additional capabilities to your applications. MongoDB is a document store that has proven its reliability, scalability and integrate-ability across numerous small and large-scale applications. Its value and focus complements the way we use relational databases for online transaction-oriented processing (OLTP) and offers advantages over the way we use relational databases for data marts and warehouses.

A point of clarification before proceeding: I’m not here to say that MongoDB is better than some other data product, or, more generally, that document stores are better than relational databases. I find such arguments meaningless without a specific use case or project goal. These technologies are different and have individual strengths and weaknesses in the face of a specific set of project objectives.

I have found that MongoDB plugs in well when I need a place to federate data (structured, semi-structured and unstructured). Given a common platform, it simplifies the work required to build and alter connections between attributes. If you’ve looked at other information about my background you’ll see that I find the use of semantic technology to be incredibly valuable for data federation and classification. MongoDB as a flexible repository plays well with semantics. At the end of this post I’ll give you a small example of that.

JavaMongo Project

The JavaMongo project is intended to provide Java developers with working examples of Java and MongoDB integrations. Over time I expect a variety of common situations to be demonstrated, with associated documentation explaining the use case and the resulting implementation.

In order to have some interesting data to work with, I’m using data sets that my company releases to the public domain. In order to work with the JavaMongo examples you’ll need to import that data into your MongoDB instance. For more information about downloading and importing the sample data, see the discussion on MongoDB Collection of Honeypot Data on my NoSQL topic page .

The initial JavaMongo project contains a basic README file with information on running the example code. Instead of rehashing that information in this post, I’d like to walk through the basic operations being demonstrated in the example code. The main class we’ll explore is BasicStatistics (us.daveread.education.mongo.honeypot.BasicStatistics).

As you know, a Java program starts execution with the main() method. We see that the first step that the BasicStatistics’ main() method takes is to create an instance of the BasicStatictics class.

BasicStatistics Constructor

The constructor code goes through the entire process of connecting to a MongoDB database, accessing a collection and running a query on data in the collection.

First, an instance of MongoClientOptions is created. This class allows us to configure certain client side options related to the connection. I’ll get into more detail with this in future examples. In this case, the program is simply setting the connection timeout to 2000 milliseconds (2 seconds) so that if the instance is not available the program won’t hang for a long time. You wouldn’t make the timeout this short in a production environment but it helps for debugging our local environment by failing fast if something is wrong.

(more…)

Tags: Allegrograph, data, Java, lightweight data federation, MongoDB, programming, semantics
Posted in infuzIT, Java, NoSQL, Semantic Technology, Software Development, Tools and Applications | No Comments »

Impetus for Our Semantics and NoSQL Workshop at the 2015 SmartData Conference

Friday, May 15th, 2015

I’m looking forward to being one of the presenters for infuzIT’s hands-on data integration and analysis workshop at this year’s SmartData Conference in San Jose. Giving people the opportunity to see the amazing power of semantics combined with NoSQL to quickly integrate and analyze data makes my day.

My background includes significant work with data, both as an application developer and data warehouse architect. The acceleration of data-centric hardware and software capabilities over the past 10 years now supports a very different paradigm for exploring, reporting and analyzing data. Processes and procedures for creating a data warehouse or mart, the accepted rules of the road for creating integrated data repositories, are no longer clear cut. The data federation debate is no longer Inmon or Kimball.

A significant shift in data integration revolves around the required lifespan of the integrated data. This lifespan has two key aspects whose evolution now allows us to rethink our approach to data federation. This permits us to be much more agile when bringing heterogeneous data sources together. The two aspects are reflected in these design questions: 1) what data, if any, will be rehosted; and 2) what relationships will be supported within the integrated data?

Rehosting Data

In a traditional data warehouse the data must be rehosted. The new repository is the target where transformed data (cleaned-up, standardized) exists. The queries that will be retrieving data from multiple sources are really pulling data from a single source that has been populated from multiple sources. It represents a heavyweight process, driven by Extract-Transform-Load (ETL) scripts and requiring space to host redundant information.

Relationships Between Data Elements

The target warehouse schema determines what relationships are defined between the data elements being combined. Getting this “right” requires careful planning and coordination between the various groups that will use the warehouse. Given the significant effort, represented as cost, organizations tend to design data warehouses to support broad constituencies as a way to amortize the investment across departments and projects.

Paradigm Shift

Semantics and NoSQL allow us to reduce the effort of integrating data by orders of magnitude. They support a completely different mindset for bringing data together. Instead of carefully designing a model that works well in the general sense (reducing the value in specific cases) we have environments that allow us to experiment, adjust and focus on each case.

Below are several drivers which allow us to approach data federation differently using semantics and NoSQL.

(more…)

Tags: data, data integration, lightweight data federation, NoSQL, semantics, workshop
Posted in Agile Data Integration, Architecture, Data, Data Analytics, infuzIT, NoSQL, Semantic Technology | No Comments »

David S. Read