Prof. Shmuel Tomi Klein is an expert in lossless data compression and Information Retrieval (IR), and in particular, the intersection of these two areas.
He is the author of the compression tool incorporated into MS-DOS, and recently co-invented a deduplication method for IBM – a technology that allows clients to store significantly increased backup data on disk and has the added effect of reducing the network bandwidth.
Klein is a former Chief Scientist of Bar- Ilan’s Responsa Project, an initiative that was awarded the Israel Prize in 2007. One of his ongoing areas of research is dedicated to improving the project’s software.
Klein and his team have been working on the compression of major files found in large full-text IR systems.
This involves, first and foremost, the text itself. They have made many contributions to Huffman coding, one of the major algorithms for text compression. Moreover, table-driven decoding was first mentioned in one of Klein’s early works.
He has also suggested several novel applications to an alternative coding method known as Fibonacci coding. In addition to texts, Klein’s group has demonstrated efficient compression of more structured files in IR systems such as dictionaries, concordances, and various kinds of long sparse bit-vectors referred to as bitmaps.
They have also made many contributions in the areas of pattern matching algorithms and the emerging field of compressed matching.
In compressed matching, certain patterns may be searched for, or certain processing tasks may be directly performed within a compressed file, without first decompressing. Klein’s works encompass different compression methods such as Huffman or Lempel-Ziv, as well as different file types such as natural language text or dictionaries.
Klein’s work on Information Retrieval has been influenced by his involvement with two of the world’s largest full-text IR systems – the Bar-Ilan Responsa Project in Hebrew, and the University of Chicago’s Trésor de la Langue Française in French.
His work focuses primarily on the technical part of the algorithms required for an efficient retrieval process, in addition to the methods dealing directly with the definition of an appropriate query language and the automatic extraction of good query terms.
Specifically, he has suggested methods for the treatment of metrical constraints in XML files and has conducted a comprehensive study of the use of various aspects of negation in IR queries.
In addition to the above, current research topics in Klein’s laboratory include lossless image compression, compression of traffic flow data, and improvements of deduplication techniques for large scale storage systems.