Text-mining vs./and/or Linked Data?
Reading Jonathan Rochkind’s musings on using Wikipedia as an authority file (something I’m all for), I was struck by comment that
I think wikipedia-miner, by applying statistical analysis text-mining ‘best guess’ type techniques, provides more relationships than dbpedia alone does. I know that wikipedia-miner’s XML interface is more comprehensible and easily usable by me than dbpedia’s (sorry linked data folks).
XML-over-REST vs. SPARQL debates aside, I think there is an interesting issue here regarding the kind of relationships that statistical text-mining produces vs. the kind typically found in Linked Data. Linked Data favors “factoids” like date-and-place-of-birth, while statistical text-mining produces (at least in this case) distributions interpretable as “relationship strength”. The wikipedia-miner results aren’t “facts” in any normal sense, but as Rochkind suggests they may be more useful. Now sure, you could represent the wikipedia-miner results as Linked Data, but what I’m trying to get at here isn’t a question of data models or syntax. It’s about how and when we choose to treat the patterns in our data as facts, and when we are content to treat them as patterns. Thoughts?