Prof. Amir's Lab
Tel (Assistant): 972-3-531-8866
Prof. Amihood Amir is a Professor in the Department of Computer Science. Together with his team, which has produced 20 Ph.D. students and 18 M.Sc. students, he researches algorithm design and analysis (specifically, pattern matching algorithms) computational molecular biology, and knowledge discovery in databases.
Computer Science: Pattern Matching
Pattern matching is one of the most widely studied problems in computer science: every person who uses a text editor, and every user of a digital library or search engine, needs to find patterns in a text.
Currently, the Boyer-Moore algorithm is directly implemented in the search command of practically all text editors, and the longest common subsequence dynamic programming algorithm is implemented in system commands that test differences between files.
However, advances in multimedia, digital libraries and computational biology have shown that a much more generalized theoretical basis of pattern matching could be of tremendous benefit.
Amir and his team, in collaboration with other computer scientists at Bar-Ilan University, work on problems related to pattern matching problems. In such pattern matching, the input is a text, a pattern and a “matching” relation.
The output is all locations in the text where the pattern “matches” under the given definition of a match. The different applications define the matching relation, and under the given matching relation there is still a distinction between exact matching and approximate matching. In the latter case, a metric is defined on the text. A text location is considered a match if the distance between it and the pattern, under the given metric, is within the tolerated bounds.
As technology develops in diverse areas, from medicine to multimedia, there is a continuous increase in the amount of stored digital data.
This increase has made it critically important to store and transmit files in a compressed form. The need to quickly access this data has given rise to a new paradigm in searching, that of compressed matching.
In traditional pattern matching, the pattern and text are explicitly given, and all occurrences of the pattern in the text are sought. In compressed pattern matching the goal is the same; however, the pattern and text are given in a compressed form. This is necessary in case of searches done by small isolated units (e.g. communication satellites in space) and in cases where local searches need to be conducted on the web.
Amir and his group defined the model and tools of compressed search over a decade ago, and the field has since expanded into an extremely active international field of research.
Recently, Amir and his team have focused on research in compressed images. If searching for appearances of a pattern in an image can be done without decompressing, then compression becomes a time saving tool in addition to being a space-saving device.
Amir and his group have developed algorithms for search in images compressed by run-length compression (such as fax transmissions), and LZ-78 (such as the gif standard). The group is now working on search in jpg images, a task that has not yet been successfully accomplished by anyone.
Amir and his team have been at the forefront of work on the theory of multidimensional matching. Their efforts have led to seminal work on scaled matching, rotated matching, indexing multidimensional images, and efficient matching from a pre-processed vast dictionary of images. The latter models biological vision, since all biological beings have a huge dictionary of all objects they recognize.
The search through this dictionary is extremely rapid. One important application of multidimensional matching lies in computer vision, whose possible applications could be revolutionary (e.g., a car capable of automatically analyzing the road conditions and driving autonomously could significantly reduce car accident fatalities).
Looking to the Future
The main modern challenge to searching is the huge amount of heterogeneous data that need to be approximately searched under a variety of measures. Size does matter, and the staggering amount of data renders many known algorithms almost ineffectual.
The difficulties presented by huge data can be divided into two realms — Space and Time.
Traditional solutions to the space problems can be grouped into three main directions:
(1) Compression (both of data and indexes), (2) Sampling (only a sample is actually stored), and (3) hierarchical structures (an efficient data distribution and access, in cache, memory, external memory, and distributed devices).
The solutions to the time bottleneck can also be divided into a number of broad directions:
(1) Approximations (quickly find approximate solutions to problems whose exact solution is too expensive, e.g. edit distance, or approximate periodicity), (2) Streaming (analyze data without storing it), and (3) Efficiently working in all three models of solutions described for space.
Within these models, there are two types of problems that need to be addressed: search and discovery. They are, in essence, two sides of the same coin.
All these issues will be addressed by Amir's group, which will continue to spearhead pattern matching research into the next generation of computer technology.