Humans are classification machines. We love to categorize and segment our lives into neat little piles. Have you ever sighed in joy after finally getting around to cleaning up that cluttered garage? There are even games where the object is to build order from disorder. We solve Rubik's cubes and prune our shrubs and pick up our socks and put them in the hamper. As the father of two boys, I'm generalizing of course.
People are much better at seeing semantic relationships than any machine. We can tell latte from onion rings. We know how Paris Hilton is related to Larry King. We understand the concept and purpose of underpants. Despite decades of research, and many, many smart person-hours dedicated to cracking that nut, computers still have no clue.
Computers lack subtlety. Sure they can match patterns all day long, and generate statistical relationships among words, but they cannot understand meaning. Let me say that again: computers do not understand meaning. No amount of marketing hype change that fact.
However, what computers can do, they do really fast. They are fantastic arithmeticians. They don't understand the difference between salad tongs and cigars. Maybe they never will.
Since we are so far away from sentient algorithms, I propose that the best place to put our energy is in making it easier for humans to impart their wisdom into machines. This means tools that allow people to see how the information is distributed in an application. Visualization tools integrated with authoring tools that give instant feedback to let a real live person judge whether that document is a clinical study report or a movie review.
This is not a software review, so I'll leave the googling as an exercise.
Does that mean we give up all hope on extracting meaning from text? Of course not! Some day it will happen, and right now all the fun is in the learning. There are some wonderful research going on in reverse engineering the brain, and we may have a breakthrough sooner rather than later. I hope so. Have a look at Numenta for some of the exiting new approaches that are emerging for artificial Intelligence.
This is good news for librarians and their ilk. If there is any group who obsesses over organizing documents it is librarians. Organizations will do well to get some good librarians and supply them with cool tools, none of which even attempt to do the librarians' job. Nothing beats a good, well- though out taxonomy that fits the subject matter at hand.
Music is as good an analogy as any. Programmers can write algorithms that will generate entire, fully scored symphonies. Would you really want to listen to two hours of completely machine-generated music? No. You would not. Really, you wouldn't. While there are some really cool digital musical tools available, you need that human touch to do the actual composition. That's what librarians do. They apply their human knowledge to a well-explored domain and orchestrate access to that domain through a series of instruments.
Musicians have Pro Tools and Garage Band. Are there similar tools for Librarians? Someday. We got rid of the card catalog, but the cello is still around, so it's a long road.
[The Author: George Everitt, is President of Applied Relevance, LLC, a "boutique consulting firm concentrating on enterprise search"]
Dear George,
Chris Brockmann of brox pointed me to your Blogpost as my company connexor is specialized in language technology.
After a quick read, the article does not argue against automatic text analysis / classification in general, but against one method of doing it, namely the statistical.
Connexor works in the (minority) knowledge and AI based paradigm.
".. I propose that the best place to put our energy is in making it easier for humans to impart their wisdom into machines."
This is very much what Connexor (and its founders in their academic
careers) have done since the late 1980s:
- design a linguistic representation and description language to describe and analyse natural language in a formal computer-operable way
- design & implement an algorithm to execute a knowledge-based language model (hand-grafted lexicons, morphologies, grammars) on input text
- do this in a language-generic way: the same algorithm applies English language models against English texts, German language models against English texts etc, and produces its analyses in a highly uniform way for easier deployment e.g. in a multilingual application
(It is also possible that the author does not know about the knowledge-based approach to language analysis by computer, so s/he may think his/her argument is against computer analysis in general.)
In the 90's we published a lot on this topic. Here is one of the best-known.
Christer Samuelsson; Atro Voutilainen (1997), Comparing a Linguistic and a Stochastic Tagger http://www.aclweb.org/anthology-new/P/P97/
To put it short: one could consider using the negative review to put down most of the (statistical) competition and to differentiate the brox/connexor approach to text analysis. But not too much hype; semantic classification is a very demanding problem, but quite a lot of useful stuff can be done with what we already have.
Best,
Atro
Dear Atro,
Thank you for your comment.
My perspective is that of an implementer of Autonomy and Verity software, and an extrapolation of their technology to the industry as a whole. I can say that I've never seen a real-world implementation of automatic classification that worked as well as a human-centric approach in the long run. This is not an academic statement based on exhaustive research, it is an anecdote based on personal experience. I've also never seen a Yeti. This does not mean that it doesn't exist.
"(It is also possible that the author does not know about the knowledge-based approach to language analysis by computer, so s/he may think his/her argument is against computer analysis in general.)"
I am aware of the knowledge-based approach to language analysis by computer. However, I have no academic credentials regarding it, so I would defer to those who do. Clearly this blog post is not intended to be a well-researched thesis on comparative merits of various linguistic processing approaches. It is meant to be a shoot-from-the-hip provocation of some really smart people.
Like George Bush on the deck of the USS Lincoln, I say "Mission Accomplished".
In fact, with a quick look at Connexor, it is more the solution than the problem. It very much depends on human input to define the rules. That's the approach that I'm talking about. Most humans do not have extensive linguistic credentials; the ordinary corporate librarians I've met will struggle to understand the difference among "knowledge based", "rule-based" and "statistical" classification methods. Mention Bayes or Shannon's algorithm and their eyes glaze over. Yet they are dazzled by the magic.
If I'm railing against anything, it is the black-box approach to automatic classification. I'm not ready to trust any "magic" black boxes.
The "time files like an arrow" problem has not yet been solved in the general sense.
George definitely has a point when he states that we should use machines for machine oriented tasks and that we should support people in people oriented tasks. And yes: extracting meaning by machines without any human input at all – not even as a rule set that has to be followed - is definitely a bridge too far.
However that doesn’t mean that we have to limit ourselves to the restrictions of search, discovery and visualization tools. It also doesn’t mean that the best solution is to hire librarians to organize documents. Creating and maintaining thesauri as a way to merely classify documents is a very labor intensive process that only will deliver suboptimal results. The meaning added to the document is limited by a the small set of concepts and relations and by the time-factor. The greatest obstacle is however that the support it delivers to the user is quite rudimentary. It says “you don’t need to seek in this complete warehouse, all you need you can find in corridor 75, shelf 22-29; good luck”.
If we want to make it easier for humans to impart their wisdom into machines, as George rightfully suggests, we need a focus shift from document centric classification and retrieval towards case and event based interaction support. Let us put the human effort there where it is really valuable and where it is valued. There, where the business relevance is high and where the business is supported in achieving the goals your organization exists for.
That is possible by deriving concepts, relations and rules from data, documents, code, processes and humans and storing them in a model layer, consisting of ontologies. We can use semantic technology to add meaning to information and to reason over this meaning by computer. It helps us to answer questions like: “How can I do this”; ”What do I need to do”; “Which product should I choose”; “How do I apply this policy”…… in my specific context and situation.
Above this separate model layer we position the instruments that support all kinds of interactions, like navigating, calculating, checking, comparing and deciding. Of course the support delivered must be dynamic: based upon the current available information and input the behavior of your support service (or knowledge service if you prefer) will change triggered by the rules in your model layer. In this way we can reuse knowledge in models and functions for various purposes.
This doesn’t mean that you will not need in many situations a link to a basic document; for example in cases where you extract rules out of procedures and regulations. And that you can extend that document with a set of related documents for research purposes. Whether you will use human classification for these purposes should be driven by a realistic cost and risk – benefit analysis. And of course you should use this human capacity then also as much as possible at a meta level instead of at the document level.
So in my opinion and experience it is more worthwhile to focus on making information usable and direct applicable. Or in other words supporting humans in “knowing how to do” above “knowing what is where available”. It is not about access; it is about use.
Kind regards,
Thei Geurts
Business Consultant
Ordina Holding BV