Main Managing gigabytes: compressing and indexing documents and images Managing gigabytes: compressing and indexing documents and images Ian H. Witten , Alistair Moffat , Timothy C. Bell In this fully updated second edition of the highly acclaimed Managing Gigabytes , authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. Managing Gigabytes: Compressing and Indexing Documents and Images is a treasure trove of theory, practical illustration, and general discussion in this fascinating technical subject.
|Published (Last):||3 August 2007|
|PDF File Size:||19.64 Mb|
|ePub File Size:||16.51 Mb|
|Price:||Free* [*Free Regsitration Required]|
Manufactured in The Netherlands. Ian H. Witten, Alistair Moffat and Timothy C. Books on general information retrieval IR have a variety of foci. Some cover the con- ceptual aspects of IR, Baeza-Yates and Ribeiro-Neto , Lancaster and Warner , Meadow while others focus on algorithms and code for implementing systems Frakes and Baeza-Yates and other still focus on specific domains.
Hersh This book by Witten, Moffat, and Bell is algorithm-oriented, but provides a nice conceptual overview of IR as well. The title of the book implies that its focus is on compression, and that is certainly a major theme that runs through the various chapters. However, the book also covers general IR, describing automated indexing, querying, basic principles of evaluation, and even a modestly visionary view of the information explosion.
The book begins with an introductory overview chapter. This is followed by a chapter on text compression which presents a comprehensive discussion of different basic approaches.
The subsequent chapter covers indexing, beginning with a general discussion on inverted indexes and then details on the compression of such indexes. After this is a chapter on querying which discusses general retrieval principles and efficient means for carrying them out, including though the use of the compressed indexes described in the previous chapter.
The next four chapters cover a variety of compression methods for different information types—inverted indexes, pictures, text images, and mixed picture and text images. The discussion covers not only the major standards but also research work in their own labs at the cutting edge. These are followed by two summary chapters. The first brings all of the indexing, compression, and query methods together to describe overall system implemen- tation.
The book ends with two appendices, one describing the MG system which implements many of the techniques described in the book with a pointer to source code on an FTP site and the other describing the various collections publicly available in the New Zealand Digital Library Project. The book is generally well-written and produced. Most technical discussions are easy to read and comprehend. Both the text font and the images are visually pleasing.
With the source code freely available, and this book describing the rationale behind it, the bar is lowered for learning about IR and performing research. Those who design and implement IR systems and wish to know more about algorithms and coding will find it valuable.
And certainly those interested in compression of text and images will find it desirable to have as well. Those who teach IR courses will also find the book useful, especially those who teach about IR algorithms and coding. In all, this is a well-written book that describes how to build IR systems, with a strong focus on compression methods. But its coverage of general IR should also lead others to take a look at it as well.
Christopher D. Natural language processing has long seemed to be the magic bullet that will bring infor- mation retrieval much closer to human capabilities. It is rather frustrating that 20 years of work along these lines has not produced the best information retrieval systems.
Generally speaking, systems based entirely on natural language concepts are not at all competitive with systems based on statistical analysis of texts. In addition, although adding natural language features appears to improve the performance of poor systems, no one has yet shown a way to add these features to the best systems and generate any further improvement.
There is a growing suspicion that in fact the river is currently flowing the other way, and that ways of thinking about text that have been developed for the purposes of information retrieval have more to contribute to the problem of natural language processing than vice-versa.
The existence of the large and rapidly growing body of material on application of methods of numerical computation to the analysis of natural language texts is the motivation for this excellent book. The book is intended to serve as a reference manual for researchers, sup- plemented by a page bibliography, and as a textbook for advanced students in computer science. In spite of the direction of influence mentioned above, the authors are open minded and point out, for example, that a non-quantitative tagger developed at the University of Related Papers.
Managing gigabytes: compressing and indexing documents and images
INDEX Description In this fully updated second edition of the highly acclaimed Managing Gigabytes, authors Witten, Moffat, and Bell continue to provide unparalleled coverage of state-of-the-art techniques for compressing and indexing data. Whatever your field, if you work with large quantities of information, this book is essential reading--an authoritative theoretical resource and a practical guide to meeting the toughest storage and access challenges. It covers the latest developments in compression and indexing and their application on the Web and in digital libraries. Key Features Up-to-date coverage of new text compression algorithms such as block sorting, approximate arithmetic coding, and fat Huffman coding New sections on content-based index compression and distributed querying, with 2 new data structures for fast indexing New coverage of image coding, including descriptions of de facto standards in use on the Web GIF and PNG , information on CALIC, the new proposed JPEG Lossless standard, and JBIG2 New information on the Internet and WWW, digital libraries, web search engines, and agent-based retrieval Accompanied by a public domain system called MG which is a fully worked-out operational example of the advanced techniques developed and explained in the book New appendix on an existing digital library system that uses the MG software Details.
Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition)
This version adds recent techniques such as block-sorting, new indexing techniques, new lossless compression strategies, and many other elements to the mix. In short, this work is a comprehensive summary of text and image compression, indexing, and querying techniques. The history of relevant algorithm development is woven well with a practical discussion of challenges, pitfalls, and specific solutions. In addition to diagrams and example text transformations, the authors use "pseudo-code" to present algorithms in a language-independent manner wherever possible.