- Length: 34 pages
- Subject: Education - Computers
- Type: Term Paper
- Paper: #27856246

The final deliverable will be the search time trial results, and the conclusions drawn with respect to the optimum algorithm designs. A definitive direction for the development of future design work is considered a desirable outcome.

Quality assurance will be implemented through systematic review of the experimental procedures and analysis of the test results. Achieving the goals stated at the delivery dates, performance of the tests, and successful completion of the project as determined by the committee members will provide quality assurance to the research outcomes.

CHAPTER 2

Background and a Review of Literature

Data Clustering is a technique employed for the purpose of analyzing statistical data sets. Clustering is the classification of objects with similarities into different groups. This is accomplished by partitioning data into different groups, known as clusters, so that the elements in each cluster share some common trait, usually proximity according to a defined distance measure. Essentially, the goal of clustering is to identify distinct groups within a dataset, and then place the data within those groups, according to their relationships with each other.

One of the main points of document clustering is that it isn't so much a matter of finding the documents off of the web- that's what search engines are for. A clustering algorithm is much more behind the idea of how that information is displayed; in what order they are displayed, what is considered relevant to the query, listing broad categories that can become narrower. Search engines do a fantastic job by themselves when the user has a specific query, and knows exactly what words will get the desired results. But the user gets a ranked list that has questionable relevance because it turns up anything that has the word in it, and the only metric is the number of times that word appears in the document. With a good clustering program, the user will instead be presented with multiple avenues of inquiry organized into broad groups that get more specific as selections are made.

Clustering exploits similarities between the documents to be clustered. The similarity of two documents is computed as a function of distance between the corresponding term vectors for those documents. Of the various measures used to compute this distance, the cosine-measure has proved the most reliable and accurate. [10]

Data clustering algorithms come in two basic types: hierarchical and partitional. In partitions, searches are matched against the designated clusters, and the documents in the highest scoring clusters are returned as a result.

When a hierarchy processes a query, it moves down the tree along the highest scoring branches until it achieves the predetermined stopping condition. The sub-tree where the stopping condition is satisfied is then returned as the result of the search.

Both strategies rely on variations of the near-neighbor search.

In this search, nearness is determined by the similarity measure used to generate the clusters. Cluster search techniques are comparable to direct near-neighbor searches.

Both are evaluated in terms of precision and recall.

The evidence suggests that cluster search techniques are only slightly better than direct near-neighbor searches, and the former can even be less effective than the latter in certain circumstances. Because of the quadratic running times necessary, clustering algorithms are often slow and unsuitable for extremely large volume tasks.

Hierarchical algorithms start with established clusters, then create new clusters based upon the relationships of the data within the set. Hierarchical algorithms can do this one of two ways, from the bottom up or from the top down. These two methods are known as agglomerative and divisive, respectively. Agglomerative algorithms start with the individual elements in the set as clusters, then merge them into successively larger clusters. Divisive algorithms start with the entire dataset in one cluster, and then break it up into successively smaller clusters. Because hierarchical algorithms must analyze all the relationships inherent in the dataset, they tend to be costly in terms of time and processing power.

Partitional algorithms determine the clusters at one time, in the beginning of the clustering process. Once the clusters have been created, each element of the dataset is then analyzed and placed within the cluster that it is the closest to. Partitional algorithms run much faster than hierarchical ones, which allows them to be used in analyzing large datasets, but they have their disadvantages as well. Generally, the initial choice of clusters is arbitrary, and does not necessarily comprise all of the actual groups that exist within a dataset. Therefore, if a particular group is missed in the initial clustering decision, the members of that group will be placed within the clusters that are closest to them, according to the predetermined parameters of the algorithm. In addition, partitional algorithms can yield inconsistent results- the clusters determined this time by the algorithm probably won't be the same as the clusters generated the next time it is used on the same dataset.

Of the five algorithms under scrutiny in this paper, two are hierarchical and three are partitional. The two hierarchical methods are suffix tree and single pass. The suffix tree is "a compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents."[1]

It is a divisive method; it begins with the dataset as a whole and divides it into progressively smaller clusters, each composed of a node with suffixes branching off of it like leaves. Single-pass clustering, on the other hand, is an agglomerative, or bottom-up, method. It begins with a single cluster, and then analyzes each element in turn to determine if it falls within a current cluster, or places it in a new cluster, depending on the similarity threshold set by the analyst.

The three partitional algorithms are k-means, buckshot, and fractionation. K-means derives its clusters based upon longest distance calculations of the elements in the dataset, then assigns each element to the closest centroid. Buckshot partitioning starts with a random sampling of the dataset, then derives the centers by placing the other elements within the randomly chosen clusters. Fractionation is a more careful clustering algorithm which divides the dataset into smaller and smaller groups through successive iterations of the clustering subroutine. [2] Fractionation requires more processing power, and therefore time.

Clustering is a methodology for more effective search and retrieval functions pertaining to datasets, and it has been investigated in great depth in the literature. The principle is simple enough- documents with a high degree of similarity will automatically be sought by the same query. By automatically placing the documents in groups based upon similarity (e.g. clusters), the search is effectively broadened.

The buckshot algorithm is simple in design and intent. It chooses a small random sampling of the documents in the database, and then apply the cluster routine to them. The centers of the clusters generated by the subroutine are returned. This use of a rectangular time clustering algorithms makes the buckshot method fast. The tradeoff is that buckshot is not deterministic, since it initially relies on a random sampling process. Repeated use of this algorithm can return different clusters than previous searches, although it is maintained that repeated trials generate clusters that are similar in quality to the previous set of clusters. [2]

Fractionation algorithms find centers by initially breaking the corpus of documents into a set number of buckets of predetermined size. The cluster subroutine then is applied to each bucket individually, breaking the contents of the bucket into yet smaller groups within the bucket. This process is repeated until a set number of groups is found, and these are the k centers. This method is like building a branching tree from bottom up, with leaves as individual documents and the centers (clusters) as the roots. The best fractionation methods sort the dataset based on a word index key (e.g. based on words in common between two documents).

It is important to note that both buckshot and fractionation rely on a clustering subroutine. Both algorithms are designed to find the initial centers, but rely on a separate algorithm to do the actual clustering of individual documents. This subroutine can itself be agglomerative, or divisive. However, in practice, the subroutine tends to be an agglomerative hierarchical algorithm.

Buckshot applies the cluster subroutine to a random sampling of the dataset to determine the centers, while the fractionation algorithm uses repeated applications of the subroutine over groups of fixed size in order to find the centers. Fractionation is considered more accurate, while buckshot is much faster, making it more suitable for searching in real time on the Web. [2]

After the centers have been determined, each algorithm proceeds to place the documents with their nearest center. After this step is completed, the resulting…