« paper won't sit still | main | a walk down beautiful lane »
So I started a conversation with the clever and oh-so helpful Mike Steckel from International SEMATECH about thesauri and their kinfolk. It seems he learned a ton from the argus seminar, and was kind enough to share some of that learning with me.
It proved to be trendously helpful. You wouldn't not believe how little about organization tools is in english for ordinary people. Kudos to Mike and the former argonauts!
I reproduce it below in hopes it helps some other poor lost fool
Despite having read this, and having followed multiple links from google, I can't really sort out how a controlled vocabulary is so different from a thesaurus (they seem to be used almost interchangeably) and why is it useful. I seems to me-- please please correct me-- that a controlled vocabulary could hinder information retrieval if used without a thesaurus.. so is it just the basis for one? or?
Use "CONGO, Democratic Republic of" rather than ZAIRE
A thesaurus is a very advanced way of controlling vocabulary and in general shows:
1. Equivalence - variants and preferred terms
2. Hierarchical - Broader and Narrower
3. Associative - "see also" references
Hope that helps.
I find it interesting that thesaurus is a subset of controlled vocabularies.. what are other sort of controlled vocabularies? would a misspelling list be one?
An article: http://www.dlib.org/dlib/november98/11batty.html
Peter and Lou have written a lot on this.
A huge series of links on the subject.
I'm still trying to figure out relationships between controlled vocabs, thesauri, taxonomies, and keywords...
sigh. messy, innit?
Normally a taxonomy does add in hierarchy, but it does not attempt to be a
complete representation of something.
Check out the ASIS thesaurus:
http://www.asis.org/Publications/Thesaurus/isframe.htm
or the Art and Architecture Thesaurus (faceted! Cool!):
http://www.getty.edu/research/tools/vocabulary/aat/hierarchies.html
These attempt to be fully descriptive of the activities of their field. People often think of Roget's when they think about thesaurus, this basically is something different.
A taxonomy is smaller, but usually does contain hierarchies. "Taxonomy" is often thrown around in KM circles and is the most abused word in your list.
Both of these (taxonomy and thesaurus) are controlled vocabularies. In my case, I have a thesaurus of semiconductor manufacturing terms that I assign to documents for information retrieval. When I take a term from the thesaurus and put it on a document I am giving the document a keyword. When the user searches for a variant of the keyword, like, he calls it a reticle when we use the term MASK (they generally mean the same thing), we can pull the documents with MASK assigned and give them to him. The thesaurus tells us "when you see reticle, pretend it is MASK." A taxonomy would do the same thing.
I would say that this is a taxonomy that would be familiar:
http://www.usableweb.com/
Keith assigns keywords to the documents he puts here. The thing he draws from is the taxonomy. The terms are taken more or less from the material and organized, hard to do this without some hierarchy involved.
is yahoo a place where we can start to talk about relationships.. like I search, right/ well, my keyword is matched against what...? I get pages, but I also get categories..... what's going on here....
Equivalence - variants and preferred terms
Hierarchical - Broader and Narrower
Associative - "see also" references
I don't know what you mean by "taxonomy of organization tools"
controlled vocabulary ^ thesaurus taxonomy ^ ^ spelling synonyms category keywords
or some such... you know, a hierarchal relation demonstrating diagram of the organization tools....
Authority list -- lowest level -- no hierarchy, just preferred terms, a way to tell the system "CA" is the same as "California"
Taxonomy -- middle level -- hierarchy, pulled from material, may have gaps if there is no content. You would be able to tell that San Francisco is a narrower term and California is a broader term. If there is no content relating to Santa Clara, then Santa Clara would not be a term. This is the highest level necessary
for most websites.
Thesaurus -- Highest level -- Peter Morville called this the "Rolls-Royce of controlled vocabularies" at a seminar I went to. It would attempt to include all California cities as a subset of California. In other words each city would have California as a broader term. It would also show related terms such as cities that are near each other, or something like that. Generally useful only to very large sites. By the way there are two kinds -- pre-enumerative and post-enumerative (faceted), but don't worry about that yet.
"Big Blue" is the same as IBM. This says "when you see 'Big Blue' it means IBM"
If you had a medical site used by both doctors who might use "Nyctalopia" and consumers who might look for "Night Blindness," this would be a way to link them together.
I have a book here in my office that says "thesaurus" is a Latin form of a Greek word meaning "Treasure Store." I like that!
Also, FYI...for me, the process of finding a keyword from my thesaurus and applying it to some content is "indexing."
at which point I asked if I could reproduce this in the blog, and he graciously agreed. Thanks Mike!
This is how I understand how the different components (taxonomies, thesauri, keywords, metadata, controlled vocabularies) relate to each other:
Taxonomies are classifications (an arrangement in groups or categories with established criteria). Taxonomies can become navigation in a web site or tables and fields in databases.
Thesauri are controlled vocabularies. This is the framework that is often built from a combination of the metadata categories (or keywords) and the taxonomies. This is what I have put in screen and content decks to be entered by the engineers. Usually gleaned from the content in an assessment.
Keywords are words that are essentially repeated often or have a particular emphasis on a site or in a document or a database. These can be what librarians call access points to the information.
Metadata uses keywords and taxonomies to create a high-level view of the information on a site or in a database. The metadata is the information framework which everything on the site relates to. These can be what librarians call access points to the information. They can follow the Dublin Core element set (if you need a copy of the latest DC, I would be happy to send it to you).
Controlled vocabularies are thesauri. These are all the variants of a word or phrase the way it is found in the site or database (e.g. misspellings, words with similar meaning and inter-related terms).
This is the way I understand them. I have been doing quite a bit of research in this area since I have been a woman-of-leisure. I have been accepted into graduate school in the LIS program and imagine that I will get a fuller understanding as the next 2 years pass!
I hope that this has helped you. If I can clarify any more let me know. I have some docs to use and send if necessary. This has been straight from my head, so if you need documented quotes I may be able to scrounge them up. :>
Tess
How I understand it: (essentially the same)
- a controlled vocabulary is simply anything that says: "Use this word instead of this". It doesn't have structure. If you want to group your controlled vocabularies (like in: cities, things to do, ...) you need to make separate ones for each group.
- a thesaurus is a controlled vocabulary with some added stuff: it doesn't just say use this instead of this, but also gives a way of showing (roughly) hierarchy (broader and narrower terms), related terms and variants. However, it has the same problem that you can't group your terms into blobs.
- a topic map is yet more advanced than a simple controlled vocabulary or a thesaurus: it allows for a fantastic complexity of organising your metadata (things like scope, or relationships), and it's very strong in the way the data can be manipulated by a computer (You can automatically MERGE toicmaps for example, try that with a thesaurus). (see http://easytopicmaps.com for more info)
- a taxonomy is anyting that organises things into categories.
- metadata is just the generic term for data about data. All the above are forms of metadata.
So if you use a controlled vocabulary, or even a thesaurus, you will probably also need to build a taxonomy. A topic map can include all of that (and more)
I couldn't help myself, I elaborated a bit here:
http://easytopicmaps.com/index.php?page=TopicmapThesaurusOrControlledVocabulary
I love this conversation. I just want to clarify.
Tess:
"Taxonomies are classifications (an arrangement in groups or categories with established criteria)"
Peter:
"- a taxonomy is anyting that organises things into categories"
What are you organizing into categories here? The controlled terms. It is difficult to organize
your terms without establishing some sort of hierarchy eventually. The extent to which you
decide to do this determines whether you have a taxonomy or a thesaurus. The thesauri I
keep in my cube are huge, like textbooks. Taxonomies work for most sites perfectly well.
I have just started reading about topic maps and they look thrilling. Thanks for setting up
the site Peter!
Excellent comments. I'm with Christina on this -- still on the beginner level with taxonomy vs. controlled vocab vs. thesaurus -- but the one thing that I read a while ago that really helped clarify everything was this Argus presentation (PDF) by Chris Farnum on IA for Intranets. There's lots of great info there on general IA and applying it to Intranets, but pages 11-23 hit on all the stuff we're talking about here. It isn't as in-depth but it's a nice overview that helped sort out all of the terms for me.
"I'm a little late to this discussion...
Can anyone direct me to info about how Controlled Vocabs/Thesauri/ etc. are actually implemented in search engines? How do I get it from a spreadsheet onto the server--what does my database team need to do? Part of it is associating content items one-by-one with preferred terms, but how do I, for example, implement "correct spellings"? (i.e. "tilenol" is "Tylenol")"
Thanks
Re: tilenol.
I'm struggling with the same problem and have come to the conclusion that brute force is probably the only way to catch most mis-spellings or slight differences in terminology.
For smaller vocabularies,I had some success with comparing search words with what was in the database using the distance algorithm, but this wouldn't scale.
I can see the potential benefits of having a comprehensive vocabulary, but at the moment they are proving to be laborious to author, and having done the hard work, quite difficult to integrate with a search engine.
For example, I have a taxonomy which might have "Usability" and "Human Computer Interface" and "HCI" in it...and someone searches for "Human Computer Interface HCI Tools Software", how do I pick out which lumps of the search query are term and which are words?
Tough
What a bunch of self-important crap. How brilliant to just decide to redefine what common understanding and dictionaries say the meanings of words are, such that any educated person has to fight continual cognitive dissonance to glean any understanding of the bafflegab you all are talking about.
Controlled vocabulary? Thesauri are controlled vocabularies with added hierarchel information? Puh-leaze.
A taxonomy is a heirarchy, yes. But it consists of two different things: the decisions that classify (taxons), and the things being classified (infons). There is the terms and there is some logic.
The taxons define the categories. The infons occupy the categories. Taxons are the branches of the decision tree. Infons are the leaves of the decision tree.
Root
|
Color? [taxon]
/ | \
Red Green Blue [infon]
Infons can become taxons by adding resolution to the infon:
Root
|
Color? [taxon]
/ | \
/ Green Blue [infons]
Red [taxon]
/ \
Light Red Dark Red [infons]
And, taxons can become infons:
Root
|
Color [infon]
The roles depend on focus or ganularity. Am I shoping for clothes or shirts?
Knowledge grows by converting infons to qualitative and later quantitative taxons. As this knowledge is explicated, new terminology gets created.
On Topic Maps, scope on a Topic Map is context. A word like demo means different things depending on context. It's descriptive data in marketing. It is a tape or CD if you are a band.
Well, dah.
A few more points on taxonomy.
Shopping is said to be navigation. And, navigation is said to be taxonomy. So it's only about terms when we think about keywords and metadata. There is a reality about decisions we make in real time lining out a taxonomy.
Life is a taxonomy, a navigation. Make the good decisions end up, hopefully and not macroeconomically, where you wanted to go.
That taxon/infon terminology didn't come from me. I found them on the EDS website where they were evolving an ecommerce practice.
I've use a varient of them when I think about ontologies. Ontologies are related to taxonomies, but no, not the same thing. Ontologies are fuzzier. :)
Here is something I ran across. Sorry, don't know where.
Controlled Vocabulary - All of the following:
Authority - Preferance
Authority = Preferred Term + Variant
Taxonomy - Hierarchical
Taxonomy = Authority + Parent + Children
Thesarus - Associative
Thesarus = Taxonomy + Associated Terms
------
As to why anyone goes to the trouble? It is essential to find it before you can use it. Disambibuity abounds and the machines can't handle it. For example,
I looked at my demos today.
What the heck does that mean? Demos as in multimedia presentations. Or, Demos as in demographics.
Different scopes, different classifications, would have been needed to make the meaning clear. Read more on computational linguistics. They focus on ambiguity. They are not there yet.
This stuff is for machines. One of the advantages of SGML was the ability to do semantic-based post processing of text. Text is still being treated as a blob. We do not have any understanding of the content of strigs. All of this stuff is extrinsic. The next revolution will be the intrinsic understanding of text. But, to get there ambiguity will have to be conquered.
Controlled Vocabularies in all its forms are extrinsic.
Topic Maps are still extrinsic. They associate keywords to specific strings in the text using a link anchor. Many technologies today are about moving closer and closer to the intrinsic. The tag will, however, always be extrinsic.
Topic Map = Thesarus + RDF (or weaker URL)
Scope is hierarchical. Scope would let us decide what demo meant in that particular context.
If Scope(Marcom), then demo is unknown
If Scope(Marketing Research), then demo is demographics
If Scope(Selling), then demo is demonstration
If Scope(Promotion), then demo is demonstration
Now, what I said about RDF above is complicated by the fact that XTM and RDF are complementary. Some members in the TopicMap community want to use XTM, instead of RDF.
Ultimiately, TopicMaps are meant for machine consumption.
------
On getting dictionaries into databases, you need a license to do that unless you are writing your definitions from scratch. The dictionary publishers alrady have terminology databases. They should be selling their content as a webservice. Try WordNet.
One point on the maintenance of context:
Content management systems (CMS) deliver structured text. The text is stored in a blob in the database. Within that blob the content can be structured (tagged) or unstructured (untagged). If it is untagged, the CMS will still deliver it in a tagged form at the level of the container holding the text.
If the CMS stores paragraphs, then you get paragraphs. If the CMS stores pages, then you get pages. The CMS delivers textual blobs at some container-based resolution.
When you write structured text, you cannot assume that a specific container will follow or preceed your text. There are no transitions in structured or non-linear text. So making the assumption that you can determine the context of the text from adjacent text doesn't work. This is another reason why controlled vocabularies, taxonomies, and thesaruses are important. It's not just selling or purchasing or discoverability or wayfinding. It is meaning itself.
Digital-to-analog processes take linear things like music and turn them into non-linear based simulations of the real thing. This is so with text on the web if it is delivered by CMS systems.
The larger the storage granularity of that text, the more likely you will be able to disambiguate, but at the paragraph, sentence, or sub-sentence level clarity becomes more difficult.
The scope of a keyword in a Topic Map is trying to put back what the digitalization process removed.