Tuesday, February 24, 2009

Exploring a ‘Deep Web’ That Google Can’t Grasp

One day last summer, Google’s search engine trundled quietly past a milestone. It added the one trillionth address to the list of Web pages it knows about. But as impossibly big as that number may seem, it represents only a fraction of the entire Web.
Beyond those trillion pages lies an even vaster Web of hidden data: financial information, shopping catalogs, flight schedules, medical research and all kinds of other material stored in databases that remain largely invisible to search engines.

Now a new breed of technologies is taking shape that will extend the reach of search engines into the Web’s hidden corners. When that happens, it will do more than just improve the quality of search results — it may ultimately reshape the way many companies do business online.

Search engines rely on programs known as crawlers (or spiders) that gather information by following the trails of hyperlinks that tie the Web together. While that approach works well for the pages that make up the surface Web, these programs have a harder time penetrating databases that are set up to respond to typed queries. To extract meaningful data from the Deep Web, search engines have to analyze users’ search terms and figure out how to broker those queries to particular databases. Google’s Deep Web search strategy involves sending out a program to analyze the contents of every database it encounters. As the major search engines start to experiment with incorporating Deep Web content into their search results, they must figure out how to present different kinds of data without overcomplicating their pages. This poses a particular quandary for Google, which has long resisted the temptation to make significant changes to its tried-and-true search results format.

Beyond the realm of consumer searches, Deep Web technologies may eventually let businesses use data in new ways. This level of data integration could eventually point the way toward something like the Semantic Web, the much-promoted — but so far unrealized — vision of a Web of interconnected data. Deep Web technologies hold the promise of achieving similar benefits at a much lower cost, by automating the process of analyzing database structures and cross-referencing the results.

1 comment:

  1. This was an interesting article, although Deep Web Technologies have been around for years. My company has been surfacing the deep web with sites like Science.gov, WorldWideScience.org and Mednar.com since 2002. Our twist- we don't index the web but search all the sources in real-time so data is not stale.

    ReplyDelete