Wednesday, February 25, 2009

DeepWeb exploration (blogpost#4)

The New York Times recently published an article called " Exploring a 'Deep Web' That Google Can't Grasp ," dealing with new technologies that are attempting to break into the "Web of hidden data."  Last year, Google added its one trillionth Web page, but as NYT points out,  still can't satisfactorily answer questions like "What's the best fare from New York to London next Thursday?"   The Web does contain this kind of data, but search engines often have trouble locating the answers in an efficient way.

Most search engines use spider/crawler type programs to find information, following trails of hyperlinks that link the never-ending Web.   But this leaves an almost infinite amount of data which lies below the surface unexplored.  Deep Websearch start-up companies are trying to develop programs that analyze search terms to then broker the query to relevant databases.  

Google's strategy sends a program to analyze every database's content, "define" its content, then hit it with related search terms to develop a predictive model of what the database contains.  Another start-up company is attempting to index every public database, hitting them with automated search terms to "dislodge" the information.  The goal is interconnected data... a cross-referencing of pre-analyzed information to best answer a specific query.  It's almost as if these Deep Web programs are alive, reasoning, and thinking for themselves.

The article mentions that in the future, Google may have problems implementing a "change."  There is a fear of overcomplication and driving away faithful users.  Also, I wonder if all of this will end up making things more efficient.  When described on paper, it seems very promising.  But, how does a program sift through an infinite amount of information to find, link, and cross-reference reliable and accurate information?  How does it link seemingly unrelated information?  How does it stay up-to-date?  If we are going to be able to answer questions about flight fares, etc.. those are changing minute-by-minute.  

Finally, it is mentioned that the long-term implications for something like this are directed more towards businesses and less towards individual web-surfers.  But, I think that it is also very important for libraries.  The article mentions health sites cross-referencing pharmaceutical companies and medical research, or news sites cross-referencing public records on government databases.  This could have a large impact on the types of information that libraries could provide.

5 comments:

  1. Good article!

    We've been providing deep web technologies for years (http://www.deepwebtech.com), and were a bit disappointed in the lack of depth to the NYC article.

    Providing useful access to the deep web is quite challenging, and real value is found for researchers looking for high-value needle-in-the-haystack content. For example, a medical researcher looking into a rare disease will not find what they are looking for in Google. They want want to search their esoteric databases and journals (i.e. fee-based content).

    The deep web means different things to different people. That same medical researcher, during a different search, may be interested in business information on the weekend. An attorney prosecuting a semiconductor intellectual property case will be interested in completely different deep web content from our medical researcher.

    The trick, then, is to provide adaptive technologies that works for individual deep web researchers, making the deep web easily available and accessible.

    Check out www.mednar.com and www.biznar.com for examples of what I'm talking about!

    Take care. Larry.

    ReplyDelete
  2. But, how does a program sift through an infinite amount of information to find, link, and cross-reference reliable and accurate information?

    I think this is the key here. A program can't know what "reliable and accurate information" is. People, however, can! Better yet, librarians can! This is why I don't think search engines will never be the best way to find information--there just isn't enough manpower to check out a trillion sites.

    ReplyDelete
  3. Good article and thoughts. Your mention of cross-referencing medical info sites with pharmaceutical companies and medical research sites made me wonder where the money is coming from for all of these new deep-web developments. The cost of programming and infastructure development for this kind of thing has got to be pretty high.

    My thought is that the funding is probably coming from businesses with tons of money and it may be that they are approaching this as an investment in future "advertising". That's not to say that we're not already inundated with advertising all of the time anyway but it wears one down after while.

    ReplyDelete
  4. Right, Becki, librarians are all over the Deep Web and have been for a long time. Witness Larry Donahue jumping right into your post.

    ReplyDelete
  5. could any of this be used in a next gen catalog? i wonder if we will see it in our lifetime? it is fun to think about!

    ReplyDelete