Submitted to NASA

Tuesday, July 06, 2004

Abstract

Our access to NASA Earth Observational data, and to some of the earth scientists working with this data will deliver an anticipatory technology to the academic and science communities early. Considerable effort is being made at NASA and in academic centers on various types of cyber infrastructure for the storage and use of Earth Observational Data. Our work brings a new set of tools that can duplicate some of the existing research processes, without interfering with those processes. Our conjecture is that things that are attempted by others using data mining, sub-setting, reformatting and projections tools; can be accomplished better if an Ontology referential base (Orb) encoding process is made widely and freely available. New foundational mathematics is extended based on Orb notation and the Evolving Transform System notational construction being developed by Lev Goldfarb.

Local Encoding of Earth Science Data,

Convolutional Organization of that Data,

and Ontology Development

OntologyStream Inc, Fairfax Virginia

Contact person and Principle Investigator: Dr Paul S. Prueitt

703-981-2676 Paul@ontologystream.com

Our concept is that NASA earth science data should be encoded as a set, or sets, with only local relational information

{ < a, r, b > } .

Sets of this type have been developed in various experimental applications by Prueitt and are called Ontology referential bases (Orbs). Subsets can be expressed as XML. Like XML, an Orb set can encode any type of relational database. Orbs have a natural notational compatibility with RDF and OWL. Orbs are more properly thought of as a type of binary XML that encodes in a regular way so that hash table type data can be located with offsets. The search function over Orbs sets is resolved in a surprising and simple fashion using a key-less hash table and a class of transforms defined on sets of locally defined referential information.

A mature commercially available software system, Primentia’s Hilbert Engine ™, is to be used to make the “key-less hash” encoding of this relational information. Experience with the Hilbert Engine ™ goes back several years. The engine is an enabler technology for mathematical processes that have been defined from the Principle Investigator’s work (2001 – 2004) on the temporal thematic analysis of linguistic variation in unstructured text databases.

OntologyStream Inc, which owns the core Orb technology, is proposed as the Prime contractor. One very qualified full time programmer, at OntologyStream Inc, will be dedicated to the project. Dr. Paul Prueitt will dedicate 50% of his time to the project. In addition, there will be four distinguished consultants, one of them from the University of Texas at Arlington, one from University of New Brunswick (Canada) and the other two independent consultants. The group has begun the process of reaching out to members of the community that has been involved in the development the various methods for Earth Observational Data (EOD). A list of discussions can be submitted.

The Orb construction represents a simplification of data encoding, in RAM and in virtual memory spaces, in such a way that allows linear convolutions to be defined over sets, not databases, and thus a simple match between fundamentals of mathematics and the data encoding exists. The encoding and the convolutional theory create significant formal and operational by-pass to processing and memory limitations. All aspects of this new methodology are expressed in a simple notation consistent with elementary topology, real analysis and set theory. A new correspondence is made between the way in which data is encoded in computer memory and the most elementary notions of mathematics. A further evolution of the notational system is being undertaken in the development of Professor Lev Goldfarb’s notational systems for inductive informatics.

http://www.cs.unb.ca/profs/goldfarb/index.html

Professor Goldfarb’s, Evolving Transformation System, notational system will extend the Orb Notational system by defining transforms over data sets using techniques that identify transformation primitives, and compositions of these primitives. The ETS notation fixes, or reifies, human supplied semantic interpretation of structural invariance as attempts are made to discover physical causes of observations made when the observations are not complete or have some level of noise. As ETS notation develops, related to Earth Observation Data, we may have the preservation of intuitive judgments by Earth scientists into a precise formalism. This work is thus deeply revolutionary and exactly what one expects if one is looking for a startling new mathematics-like formalism for studying empirical observation made with earth observation instruments. Historical roots are well delineated by the research team. The discussions that have been made within our group go back for over a decade.

The Orb notational system exists, has been published, has been developed into a software system, and several tutorials developed on the fundamental concepts associated to the identification of invariance in raw data and the creation of symbol systems that then control data processing and access. Our initial work did not use the Hilbert Engine, but the data encoding and the notation is consistent with how the Hilbert technology works. The Hilbert Engine is a commercial system whose essential market differentiator is consistent with our deep work in the foundations of mathematics and computer inference. Though the market differentiator is not well understood, that primary advantages are a compression of invariance into key-less hash table and the use of a Forth language mini-operating system that is not dependant on any of the large multiple purpose operating systems.

The Hilbert technology is embedded in a complete development environment and is available as off the shelf software. Our team estimates that the Hilbert technology is entitled to an positive TRLevel evaluation by NASA, perhaps at TRLevel 4. The Orb construction and the mathematics of Goldfarb’s notational system will be built as an application that uses this commercial of the shelf software system from Primentia. The TRLevel for the Orb technology is likely to be TRLevel 2. We expect to move both technologies separately to TRLevel 7 (System prototype demonstration in an operational environment) by the end of the first twelve months. We are therefore willing to consider this proposal to be for either twelve months or twenty-four months, depending on the interest of NASA scientists.

The scientific background to our approach

Our research group is moving ahead on related projects. In the next year, a complete distributed research environment will be developed for our small group collaboration on Orb technologies, and methodology for scientific inquiry over Orb encoding will be made available as a not for profit activity. Distance learning capability will be developed as part of the distributed research environment. The purpose of the research environment and distance learning environment is to engage Earth Science professional and students in a new type of agile analysis of various types of emergent structure. The structure is found using Orb encoding and mathematically defined convolutions over data sets. The convolutions that can be defined form a semi-group, and functions can be defined similar to what form Lie algebras in Hilbert space. Thus it is natural to apply various transforms on numerical data that has been parceled off due to some type of filter.

We conjecture that in all cases, results that can be achieved using custom coding, filtering, and iterated data mining can be achieved using Orbs. A formal proof of this may be available with six months and certainly we will meet any specific challenge to this conjecture. One part of the proofs that we have thought about is that there are optimal ways of doing certain types of mathematics on a serial computer. In these arguments, there is no need to make a distinction between parallelizable processes or single processor. The key is in the complete conceptual separation between structure and metadata indicating possible function of that structure. As is the case with natural processes, a separation exists between the invariances of sub-structure and the function that an aggregation of elements of substructure can develop within an environment. The primary issue is related to the nature of emergence, and the non-observation of many of the causes that are involved in producing what one wants to observe. By separating observed structure from implied function, one does not create complexity in the data encoding. A series of theorems, in information theory, are available on this from Dr. Richard Ballard, who is not listed as a consultant, but whom is part of our extended science community.

The issues, being discussed above, are known within the complexity community. Some individuals in the computational emergence community also take a position similar to ours, at least privately. But even in these communities our position has been controversial. Our work builds in Soviet era work in semiotics and foundations of computational theory in ways that are not well known. The way to set aside this controversy is to create a technology that works competitively with the existing technology and does some new things. Several other projects are planned or are underway, and these are designed to produce commercial grade, but Open Source software as well as curriculum support for the elementary mathematics and computer science needed to understand the work.

Our group is developing a complete literature review on time series analysis, correlation query, stochastic methods, pattern extraction and detection and forecasting methodology used within the EOD community. Our purpose is to create Orb based processes that produce the same results as the current methods and produce these results at a faster rate, using less memory and having a clearer conceptual foundation. This work will not be comprehensive, of course, but will address several specific data streams. Our group intends to make the Orb code base and methodology available as open source intellectual property, and to tie this system into several academic research environments. Investors are contemplating significant commercial activity, but this work will be in the area of an Anticipatory Commerce (AC) system.

The Orb anticipatory technology has been peer reviewed, by NIMA (2003), as being fundable. However the philosophical foundations to Orbs are resisted by many experts in knowledge representational theory, ontology, and by database vendors. This resistance is understandable because the Orb technology radically simplified data encoding, using a keyless hash table, while also correcting a notion that a computer can demonstrate intelligence of the nature and type that human beings demonstrate. The core research team has identified a number of the key procedural issues related to the use of Orb encoding in the measurement and analysis of massive unstructured data. Initial encoding of NASA data and research on this data can start within the first month.

Technical Summary

We propose that Orb triples be encoded into “Ontology referential bases”, Orbs, using the Hilbert Engine (using an interpretation of the ASCII string as a base 64 number) and thus the compact “key-less hash” encoding of terabits of data is quite possible. In previous results, the encoding is shown to be formally “fractal” in the sense that after an initial period of encoding, new data becomes less costly to encode than data already encoded. Fractal encoding has been shown in large Orb data sets derived from cyber security log files. The key-less hash table is a simple but deep innovation whose value becomes apparent in the context of semiotic interactions involving human cognitive acuity during the retrieval of data from an Orb construction. The semiotic interaction involves the development of visual iconic structure that renders the containers of the key-less hash tables as small topological neighborhoods defined over a set of graph constructions. The neighborhoods are differentially defined using variations on the topology derived from similarity analysis. When a human makes annotations the result is a formal construction where logical atoms are “reified” subject indicators. A loose correspondence is then made between data invariance and temporal evolution of invariance and the scientist’s tacit knowledge about the physical process being observed.

Time series analysis methods and methods related to theory of boundaries are to be used to identify significant beginning and ending times and locational focus over relatively small parts of Earth Observational data. Neural architecture known to be involved in human memory retrieval and selective attention (Levine and Prueitt) are to be used in the architectural design of the human/computer interface. The organization of raw data will then be quite unlike any data mining that can come from data that is first organized into highly structured relational databases. Since the object of investigation related to highly structured databases are real objects in the real world, we expect to measure phenomenon that has a stream of instrumented data collection into these highly structured databases. Comparison of the Orb and relational data outcomes is then anticipated.

The Orb technology is fully available as completed code with tutorials. The first applications of this technology were in studies of the structure of cyber events. The second area of application has been in the detection and inventory of linguistic variation in free form, unstructured, text data. Of course text data is not un-structured, but with the use of this term the data mining literature refers to data that is not directly encoded into a specific relational database schema. The key to the new Orb based technologies has to do with the schema independent encoding structure. This notion is not philosophically consistent with most information technologies that depend on the pre-organizational structure of schema.

Real time organization of Orb data into "information" is formative and usually developed in real time. Sub-structural organization is developed via some additional constraints on what is otherwise an underconstrainted logical entailment. The notational indication of an underconstrainted logical entailment implies that no purely deductive inference can occur without placing additional logical constraints on the computation. This constraint occurs holonomically by affecting all data elements “at the same time”. The formative organization is a true convolution operator that picks up some but not all of the localized data from the Orb set. Given the encoding structure, this convolution is very fast. Retrieval is not made using SQL, but using convolutional operators. The convolution operators are over a data encoding defined by the Hilbert Engine, and uses a patent awarded by PTO to Primentia in 2003.

In addition to schema independent encoding and fast convolutional organization and retrieval, the Orbs technology is anticipatory in nature and can depend strongly on real time perturbation of the formative process by human cognitive acuity. A visual interface to Orb constructions is available in the form of SLIP (Shallow Link analysis, Iterated scatter gather and Parcelation) technology first developed in 2001 for use in distributed cyber security event analysis and detection. The result of Orb convolutional organization is similar to what some in our group refers to as conceptual blending. The blending can be accomplished either with or without human intervention.

The use of this standard, very simple structure, to encode local relational data has been taught to graduate students in a scientific database course at George Washington University in 2003.

On the nature of natural ontology, supplemental note

Clearly the implied completeness of a solution to the issue of semantics is not achieved if all one has an OWL type representation of human knowledge. Presentation of OWL justification does indicate issues that do need to be address with reconciliation of terminology methods, but the criticality of symbol semantics reconciliation and non-stability on the meaning of structure, as environmental conditions changes, is stressed less that it could be. For example, in Raskin et al [1]

"Semantic understanding of text by automated tools is enabled through the combined use of i) ontologies and ii) software tools that can interpret the ontologies. An ontology is a formal representation of technical concepts and their interrelations in a form that supports domain knowledge. Generally an ontology is hierarchical with child concepts having explicit properties to specialize their parent concept(s)."

There is a sense given that standard OWL ontology is a rather completely satisfactory solution. An upper ontology for NASA is what Raskin et al, are proposing, and there is value in any upper ontology that is reasonably developed. But, the notion of ontology is not as developed within the Semantic Web community as one might suspect from the rhetoric. The Orbs are capable of producing hierarchical ontology, but this is not what is most natural for Orbs. What is most natural, not merely for Orbs but for a neurologically grounded model of mental event formation, is to not impose an inheritance theory simply because hierarchical structures allow the common types of OIL (Ontology Inference Language). Other types of inference awaits in the wings, as indicated in our work on mutual induction involving humans and computer representations of the invariance in data structure. Mutual indication occurs when deductive processes in a computer and the inductive capability of a human are taken together to create informational structure. An encoding of this type of mutual induction information structure is notationally captured by Lev Goldfarb’s notation and encoded using either the Hilbert Engine or a derivative of the Hilbert Engine, ie as Orb constructions defined as inductive informatics.

Orbs allow a more complex, as in underdetermined, structure to be defined when the data is first acquired. Later, when the analysis is desired, the convolution is selected and removes complexity to produce a single unambiguous informational statement. The primitive structure is a “simple graph” that is underconstrained in it's set form. When aggregated into a specific graph the process requires some type of convolution. Various classes of convolutions exists, and can be applied over data that has been preserved in the more complex form, ie as sets of n-tuples. The convolution that we have worked with most is one that is stochastic in nature, ie the scatter-gather in a topological space. This produces emergent feature patterns that are then encoded in our notation and in the data encoding accessable by the Hilbert Engine. Details are to be given about this technique. Given a different stochastic convolution the limiting distribution, or the "retrieval", is quite likely to be different in major and minor ways. Formative and differential ontology follows, as discussed in various presentations, the group has made on that use of Orbs in intelligence gathering. Differential ontology is a new development in pure mathematics, with application to computer science and to modeling complex natural systems. This work’s greatest development lies in the near future.

The convolutions can be as simple as the often-used subsetting procedures over NASA data. In the subset transform, a set of data is routed into two or more bins, while maintaining the same data format and data encoding. The purpose may be to get a random sample of massive data, so that data warehouse type data mining techniques might be performed on the smaller data set. A standard convolution would use the subsetting process to make decisions as to the final designation of a single data element. In this way the kernel of the convolution acts as a routing mechanism. This kernel can have, and does in many text-mining operations, knowledge in the form of OWL type ontology. The point, about the maturity of the Semantic Web concept of ontology, is that more can be easily done with the underconstrainted Orb sets. These Orb sets are more reasonable constructions, than OWL with OIL, in the context of modeling the nature of human knowledge processes. We do not see this as a philosophical issue, but one where natural scientists are simply trying to get the Semantic Web folks to see that the current standards processes are not aligned with what natural science knows about human language and knowledge sharing.

The Team

An association between six individuals has lead to a supporting Anticipatory Theory of Information. The individuals in this research association are:

Dr. Paul Prueitt (linguistics, semiotics, logic, computer science and mathematics )

Dr. Lev Goldfarb, (inductive informatics, computer science )

Dr. William Benzon ( semiotics, theory of information, neuroscience )

Dr. Alex Citkin ( semiotics, logic, computer science and mathematics )

Dr. Daniel Levine ( mathematics, cognitive neuroscience )

Nathan Einwechter (cyber security, computer science )

Daniel Levine and Paul Prueitt bring expert experience regarding conceptual coherence. Both researchers know the works of Karl Pribram well. Dr. Citkin brings experiences with what is called “open logic” and semiotics, as well as expert knowledge of distributed computational systems and system security. Nathan Einwechter is also an expert on security, collaborative systems and distributed peer-to-peer systems.

Dr. Prueitt’s contribution

Dr. Prueitt completed a PhD on the mathematical models of biological intelligence in 1989. Models were derived from artificial neural network and genetic algorithm literatures with focus on biologically feasible reaction and network models; and from theoretical immunology with focus on recognition of novelty and structural connections between recognition and response mechanisms found in plant and animal immunological systems. Postdoctoral work at Georgetown University engaged him in distributed computer processes, using transputers, and information theory related to a separation of a theory of function from a theory of structure. Quantum mechanical models were explored in the context of biological mechanisms involved in perception and reaction. After post-doctoral work, he worked in industry in the role of senior scientist for several DoD consulting software companies. Highlights of this work involved computational and algorithmic similarity analysis and ontology construction. Machine taxonomy and knowledge management architectures were reviewed and designed. A significant project was engaged on Intrusion Detection System event analysis. Computational linguistics and text and image understanding technologies were reviewed and several patentable mechanisms were discovered and published. Dr. Prueitt is a Research Professor at George Washington University and has served on four doctoral committees at George Washington University, in computer science.

We regard the anticipatory technology as having a theory of information that is grounded in the fundamental laws of physics, properly understood. As in quantum field theory, there are paradoxes. All things are local, except the holonomic constraints that act as non-local entailments. Using a stratified theory, where physical processes are organized into levels of interaction, we conjecture that all causes can be seen at all times using appropriate abstractions and notation. The elements of an Orb set are encoded to capture the local information that is known precisely and yet without a full contextual interpretation. At a different scale of observation, non-local entailments are modeled with convolution operators, defined over Orbs sets. A convolution operator then becomes a repository of “machine learning” due to re-enforcement theory, as seen in the field of artificial neural networks (Levine and Prueitt). The mathematics and the computer science fit well because of our use of the innovation of a key-less hash table. (A normal hash table will work in all cases, and only differs in performance from a keyless hash-table.) The answer to a question is then dependent on memory stores, the Orb constructions, as well as the formulation of a question, ie the anticipatory convolution. The result is a smaller Orb construction and a binding together, or conceptual blending, of the individual Orb “atoms” into a compound. This compound is then represented as a small graph structure and can be made to be interoperable with Ontology Web Language (OWL) standards. The tags of this small graph can be used to generate alerts and automated responses or to serve as cognitive primers for individual human experience.

Global organization occurs due to the addition of constraints as realized in the specific kernel of a mathematically defined convolution operator. Thus the mathematics is firm. The key-less hash table encoding allows both the fractal scalability we need as well as micro second performance of convolutional operations even over very large data sets.

The notion of convolution was what an intelligence community contractor Applied Technical Systems Inc's patent is developed around. The Orbs generalize that patent. The notation of the Orb generalization is given in a notational paper posted at:

http://www.bcngroup.org/area2/KSF/Notation/notation.htm

Dr. Prueitt and his student, Nathan Einwechter, developed this paper in 2003. We feel that the ATS patent is but one of several early views of Orb technology, and that Orb technology will be the information bases of the near future. The case has been made, mathematically, in cognitive neuroscience terms and as a matter of some machine encoding and retrieval process; that anticipatory technology can be created that allows data to be encoded into local information, like as XML, and then organized opportunistically at a later time. Dr. Prueitt’s research group’s work is consistent with neuroscientist Karl Pribram's notion of how the brain forms concepts.

Dr. Benzon’s contribution

Dr. Benzon will introduce formal structure designed to create various standard format ontologies. The Orbs are designed to find structure in raw data. Metadata over the contextualization and grouping of raw data is to be supplied by humans. There is, then, a two-step process,

(1) instrumentation/measurement of raw data feeds,

(2) encoding/interpretation of the data and structures identified using derivatives of text and image algorithms.

The two-step process can be applied to Earth Observational data acquisition design and scheduling. It and the ontology developed using it, can help develop new instruments and new types of measuring data, for example with Fourier transforms and image understanding using scatter gather and other algorithms. We do not want to limit the anticipated uses, and will find one or two specific areas of inquiry where two or three domain experts will agree to work with the new methodologies to produce new types of informational resources. But the focus of our effort is in developing out the complete life cycle for Orb encoding of “local” structure and the formative processes that allows differential interrogation of this structure using global convolutions.

On the one hand we have a community of physical scientists who are trying to understand earth processes. Members of the community have a "list" of the phenomena that interest them, some substantial understanding of those phenomena, and knowledge of physical indicators of those phenomena. Further, at least some of these investigators may be looking for new phenomena. On the other hand, we have a collection of instruments that measure various aspects of earth processes and are transmitting piles and piles of data to earth. What bit strings in what data streams, AND combinations of data streams, map onto phenomena of interest, known and unknown?

Presumably the investigators who have designed these instruments have done so with specific phenomena in mind. They know how the data in their instruments serves as indicators of those specific phenomena. But there is always the possibility that the same data will serve to indicate other known phenomena or that, in combination with data from some other instrument, it will do so.

How do we discover these new possibilities? How do we design the data structure to allow for this? How do we allow people to experiment with fitting various ontologies over the raw data? These are the fundamental question that our group will address.

Note that these are social-organizational and intellectual problems. For different people and institutions have ownership of different instruments and datasets. How can they interact so that everyone gets maximum benefit both from the data and from one another's expertise? Dr. Benzon will play the central role in sorting out some of the knowledge representation issues. He will identify and communicate with domain experts who are interested in deriving new types of empirical knowledge from the data.

Dr. Benzon has written a primer on writing SQL queries against geographic databases. Others in the team have worked on massive structured databases and are very familiar with how they work. We will appreciate and review this work, but our efforts will concentrate on demonstrating a human-centric information production (HIP) approach that does not assume the conventional highly structured relational databases, but which is completely compatible with such databases. We will clarify how the Orbs allow for fuzzy or soft chunking of the data, since this is the key to allowing for flexible ontology construction.

Budget

Equipment needs are for a total of 9 G5 dual processes each with 8 Gigs of RAM and will be expensed form project overhead. These will be developed into a single distributed processing core, based on a combination of a Linux kernel and the Forth based Hilbert Engine operating environment. Grid security and standard practices will be mirrored in this distributed processing core, thus providing a secure interface between external NASA data sets, such as at the various Distributed Active Archive Centers (DAACS). Our processing core will be separated from other types of Internet operations by a number of mechanisms, one being an internal binary “core talk” that reflects highly proprietary work on a type of binary XML virtual machine. This work is proprietary only because of investment discussions being carried on by one of our group, Sandy Klausner. Klausner in Founder of CoreTalk Inc and will not be compensated via this proposal. However, the principles of data regularity in context and design time binary encoding decisions are foundational to both the Orb encoding and a simple distributed operating system that we have developed for secure peer-to-peer Orb transactions. A long-term relationship exists between Klausner and our research group.

A graduate student stipend may support on doctorate student at George Washington University. We anticipate funding this stipend using other funds. Each of the six principle researchers will have one machine, dedicated to real time distributed processing of NASA data. New research nodes can be added by anyone willing to assign a similarly equipped Macintosh, running Linux, as a dedicated node in what will be a small virtual computing grid.

We will also license an inexpensive collaborative software system, called the Manor, and work with the owner of this system to create a stand-alone research environment based on open source code written in C, Python and PHP. The existing client code base is available on all operating systems and a well-defined server API has been developed within Linux. This research environment will link to both distance learning type curriculum modules about Orbs and NASA data analysis, and dedicated research platforms for Orb based analysis of Earth observation data. Estimated funding to be around for two year. In all cases, but one, the personnel has full time support already in place. Yearly personal support is estimated at this time to be:

Project management

Full time (OntologyStream Inc)

40K Nathan Einwechter (cyber security, computer science)

Part time (OntologyStream Inc)

60K Dr. Paul Prueitt (linguistics, semiotics, logic, computer science and mathematics)

Consultants (OntologyStream Inc)

30K Dr. William Benzon ( semiotics, theory of information, neuroscience )

10K Dr. Alex Citkin ( semiotics, logic, computer science and mathematics )

10K Dr. Daniel Levine ( mathematics, cognitive neuroscience )

10K Dr. Lev Goldfarb ( inductive informatics )

The estimated total for personal is 160 K per year, plus a 25% administrative overhead (40 K) on personal costs. Primentia software costs are 25K per year. Conference participation and travel is 10K.

Estimated total is 235K per year. However, these figures are adjustable depending on the support level that NASA may choose.

The core group is extending the Orb capabilities with or without NASA funding support. Our participation in AIST-NRA 2004 Mini-Solicitation (NNG04ZY4001N) may be at a much smaller funding level than suggested above (270K per year). Our interest is in applying TRLevel 2 technology to a broad range of data mining problems using Earth Observational Science data. The support we request may enable the development of TRLevel 7 systems within the first year. A full project would seem to justify the requested funding, but we are also prepared to engage on a one-year demonstration project at a figure of around 100K.