Saturday, July
10, 2004
Manhattan Project
to Integrate Human-centric Information Production
Ontology referential base (Orb) basic Notational Paper (2003)
Key questions on Common Upper
Ontology -> .
Note by Paul Prueitt on a related issue .
Discussion on:, a common language underlying Readware and InOrb Technology
Ken,
As you stated, Letter Semantics corresponds to lexical representation through a direct “instrumented” measurement [1] of the co-occurrence of special letter triples.
It
is vital that the analyst understand the origin and justification for each of the
Readware letter triples. Questions
like, “why three letters”, need to be answered. The answer is likely to involve some demonstration of a bell
curve where comparisons are made between different processes, each discovering
a set of structural primitives. Then
some numerical answer can be given that the Readware construction is optimal in
some specific sense.
Perhaps
the proper answer is “we found a process of producing three letter tokens that
measured, perhaps imperfectly but still quite useful, the invariances in human communication
needs.
At
this point, it might be useful to use the language of J. J. Gibson and others
in the ecological psychology community. They talk about linguistic affordances. As you and Tom have said, the nature of the need to communicate,
the physical properties of the human mouth and brain, and other factors are all
involved in created a very stable set of substructural elements from which
human though, in each case, must arise.
Of
course, the expression of individuality occurs as a perturbation from what is otherwise
a very predictable situation. The predictability
and the perturbation from what would be, otherwise, a purely deterministic flow
of events play against each other. Over
the long term the stability of the substructural elements is established. One need only look at the atomic periodic
table for chemistry to see this substructural stability and the great variability
that occurs in the chemical compositions we enjoy.
In
any case, the reason why we can talk about a national Manhattan-type project to
produce (Human-centric Information Production) HIP, is due to the possibility
that a simple computer-based methodology may map the substructure any type of
complex phenomenon. The possibility of
mapping is due to an “ontological claim” that a fixed set of affordances shape
the behavior of any type of complex system.
The formation and presence of specific types of terrorism cells is then brought
under the light of new information tools.
The War on Terrorism, the War on Drugs and several other types of
social-political conflicts would be provided a science that measures complex
phenomenon and would therefore provide a PUBLIC viewing of relevant information
about many types of threats to the democracy.
The
presence of HIP technology tools, as open source software, will allow the
application of HIP techniques to complex manufacturing processes, such as the production
of foods and medicines. The primary
purpose of the National Project is thus to create the HIP technology as public
domain technology and to provide a new K-12 curriculum in mathematics and
computer science.
The
answer to the question, “why three letter tokens” is then placed into the
context of creating a measurement device.
A different set of tokens might have been discovered, but the
ontological claim is that there is something “external” to the measurement that
is being measured.
As
I will propose in a later bead, [45], we can also
provide to the public a notational system and theory as to why our theory of stratified
semantics has a correspondence to how the real world works.
Before
any notational system and underlying explicit theory is developed scientists
work with intuitions. My work and Tom’s
work is very rigidly grounded into a “personal metaphysics” and a specific way
of thinking. Both he and I will give up
favorite terminology as we get some theoretical integration settled. We need to justify stratification theory,
and create some way to expose the specific validation exercises that he has
developed.
You say that “Letter Semantics is the measure.”, but saying this does not expose the detail. Our tutorials will explain precisely what this means. In the tutorials we will develop more detail about what is measured and how the interpretations about function are made.
Stratified theory suggests that the set of letter
triples is a substructure for “compounds” of these letter triples. The total set of these compounds then create
a measurement device. It is appropriate
to make a comparison between how this measurement is made and the well-known algorithms
related to latent semantic indexing.
The correspondence between compounds and human concepts
is then something that has can be justified based on some set of objective
metrics. I will speak to this some more
in bead [45].
Ballard’s work is grounded in an ordered n-tuple,
< r, a(1), a(2), . . . , a(n) >
It is unfortunate that and I often talk about the
ordered triple < a, r, b > as the most elementary construction. Ballard is right, but I am dealing with the
pragmatics of encoding relationships that are NOT semantic relationships. There is more to say on this.
The letter triples are not in them selves a relationship. You do not have the middle, or first,
element being a “relationship”. The
“relationships” in the Readware ConceptBase are established via the human
empirical work that Tom and you did over the past two decades. The ConceptBase ties together several of the
letter triples, in a way that is analogous to the way atoms are composed into chemical
compounds.
Examples in the tutorials are therefore required to
allow the expert user to understand the way our conceptual rollup engines
work. These examples are the only way
that the technology will find a market, because the conventional wisdom
marginalizes both the notion of substructural semantics and the need for
human-centric information production about the function of observed
aggregations of substructure (see Figure 6 - from Dimtri
Pospelov’s unpublished book in Section 6 of my Chapter 2 in “Knowledge
Foundations”.)
Some formal description of the set of letter triples
needs to be made and justified. A
precise number is important in this sense, because we need to reduce the
uncertainty as to what one is talking about.
We have to separate the details of description so that nothing is left
ambiguous in our description.
We are in agreement that our two companies may now, if funding is found, lay out an objective notation
is vital to our designing applications based on the merging of the technologies. The Readware Provenance ™ product
is the first of a number of vertical applications that we can promise to
investors.
But unless expert humans perform tasks related to the
Provenance ontology services, the pollsters will never be able to use the
pre-poll results that we make possible.
So the educational process has to occur, and before this educational
process can take hold there has to be a common language being used by Ballard,
Sowa, you, others and I.
You said: “Letter Semantics gave us the semantic distance between variable signs that make use of these representations for their encoding. “
I have a principled argument that this is not the
correct language for description.
I do not see what any concept of “semantic distance”
can relate to. This is because the
concept of distance does not apply to the native notions of relationships
between concepts. In fact the only
concept of “semantic distance” that does apply is that relationships, as
specific relationship types, are members of a set of specific relationships
that when aggregated together creates an experience of the concept, a specific
concept. So there is both a theory of
type and a theory of substructure.
Meaning does not have a distance. Not in reality.
But the phrase “semantic distance” is attempting,
poorly, to signify something. I think
that this something can be described better using Maturana and also Stu
Kaufman’s terminology. I will talk more
about this in bead [45].
The naturally occurring concept has to be internally
“related by co-occurrence” since this is something that can be measured as
being there or not. This is why InOrb
notation uses the language “subject matter indicator”. Of course, co-occurrence is a crude measure,
but with visualization and human reification this measure may be the next best
step. Ballard’s work may be beyond this
step.
And
we need to make clear what “variable signs” means. Ballard and others need to know that you mean. Why does a sign vary? What is a sign?
You said : “The measurements for all known concepts (words in the lexicon) are pre-computed and indexed to the general ConceptBase. The distance for a new concept is computed as it is detected in a system start and the indexes to the (new) conceptbase is re-compiled. “
We need to give a full explanation of a concept that is clear in which people can become comfortable with.
The pre-computing of the “known concepts” is why this
all works, and we have a lot here to talk about. There are some formal constructions that will make this
“landscape of concept representations” very clear.
As we get the vagueness out of the language being used,
we will find new science that everyone should be looking for.
Clearly again, the Precision/Recall metrics that you
have demonstrated are impressive, and yet the observation of good
Precision/Recall metrics does not prove an underlying theoretical
construction. Finding a common
theoretical construction and making it available to others has to be the
measure of success for our project in the IC.
[1] By this we mean the total process of parsing and applying rules to the parsing of text. In some sense, this is natural language processes by computer programs… even if how Readware does this is not exactly the same as traditional natural language parsing – as found in the literature.