Decentralized Meta-Data Strategies: Effective Peer-to-Peer Search

Publication Type: 
Journal: 
IEICE Transactions on Communications
Volume: 
E86-B
Number: 
6
Pages: 
1740-1753
Year: 
2 003
Keywords: 
Abstract: 
Gnutella's service announcement in March 2000 stirred worldwide interest by referring to P2P model. Basically, the P2P model needs not the broker "the centralized management server" that until now has figured so importantly in prevailing business models, arid offers a new approach that enables peers such as end terminals to discover out and locate other suitable peers on their own without going through an intermediary server. It seems clear that the wealth of content made available by peer-to-peer systems like Gnutella and Freenet have spurred many authors into considering how meta-data might be used to support more effective search in a distributed environment. This paper has reviewed a number of these systems and attempted to identify some common themes. At this time the major division between the different approaches is the use of a hash-based routing scheme.
Notes: 

Freenet forwards queries according to beliefs about the contents of other nodes; considering file similarity in terms of closeness in a "key-space" generated by a cryptographic hash. Users must know a file's key in order to retrieve it from the network. Files are inserted into particular locations (as opposed to just shared in the Gnutella network) and combined with aggressive caching activity the arrangement of files ends up reflecting that of the key-space.
Types of meta-data:

- Document Hash: id generated from the document contents via some hashing algorithm that ideally will be unique to each document;
- Document Id: id assigned arbitrarily to a document according to some scheme - different form a hash in that it must be generated by some authority;
- Statistical Representation: representation generated by performing a statistical operation on a document, that may involve statistics relating to a larger document collection, e.g. TFIDF.
- Human assigned: keywords or more complex statemens such as RDF.

Approaches to Markup:

- TFIDF (Term Frequency Inverse Document Frequency), an Information Retrieval approach of Salton e Yang's that rates the degree to which words are representative of a document. VSM (Vector Space Model) and LSI (Latent Semantic Indexing);
- XML and RDF.

Strategies:

- Semantic Routing: a query is routed according to the meta-data contained in that query.
- DHT (Distributed Hash Table): specifies a relation between entities (file, documents etc.) and a position in a distributed network.