Making wildcard queries scalable
By Kirk J Krauss
Most general-purpose databases and search engines don’t handle wildcards very well. In a lot of the big name search engines, databases, and analytics tools, there’s no wildcard query feature at all. And in some popular current search engines that do offer wildcard support, the results of a wildcard search aren’t always as useful as could be.
Enabling wildcard queries across a large data set commonly involves techniques known as permuterm indexing or k-gram indexing. A permuterm index requires a dictionary that tends to become large, because it includes rotations of each term that can be queried. A k-gram index is smaller, but it can all too readily allow for invalid matches. Using either technique at scale can call for significant memory overhead and fairly sophisticated filtering steps to arrive at useful results.
Another technique involves stemming of alternate search terms from a given set of terms. The stemming technique can be used for both wildcard and regular expression queries. But useful wildcard searches based on stemming can be dependent on explicitly setting up “generic” arrangements corresponding to particular types of wildcards, such as entity wildcards, physical object wildcards, or verb wildcards. With these systems, wildcard queries rely on an ontology of the sort used in natural language processing.
In other words, there aren’t any established indexing concepts, for use with wildcards, that are prepared to easily scale into general-purpose solutions that work well in any situation. Yet wildcard search is increasingly necessary as document-oriented databases and search engine techniques are brought to bear on semi-structured data at scale. Some kinds of data that can benefit from reliable wildcard queries include lists, such as address lists, as well as geospatial or temporal ranges, log file content, or registries of devices or of real-world items in the Internet of Things. What’s called for is a scalable, general-purpose way to index terms broken down for use with wildcard search, with minimal memory overhead, and with none of the extra filtering steps required to weed out false positives.
Two kinds of tracking
Here are some thoughts on how to make wildcard search manageable at scale. One thought is to break down the indexing system of conventional search without wildcards, to do two kinds of tracking. Adding support for wildcards brings us into either of two situations: where wildcard search applies to whole word (delimited) terms, and where a wildcard can match any number of characters including spaces. Though the implementation details might be different in those two situations, an overall indexing system might involve an index of whole-word terms and another index of character sequences. The character sequences can include substrings, as with an ordinary k-gram index.
Another thought is to use ranking to limit the growth of at least one index. So when data is ingested, an index of whole-word terms, or references to those terms, can be built and ranked for use with an index of character sequences. The terms can be delimited by spaces or other tokens. The character sequence index is a likely candidate for pruning based on ranking.
The index of whole-word terms can represent some or all of the terms in a set of documents, such as web pages, and can be created via the inverted index strategy typical of conventional search engines. That is, an index of complete search terms can be stored in a search tree, skip list, or other structure organized for quick term lookup. For instance, it can be organized according to the alphabetical order of the terms. The index of character sequences can include references to characters within the terms (i.e. offsets, or substrings delimited by non-space characters). The character sequence index may or may not be organized based on the same structure as the term index. For instance, it might make sense to use two skip lists to handle the two kinds of tracking, both sorted in alphabetical order.
Entries in the character sequence index can be limited — to 4 characters in length, for example — so the index can fit into a memory footprint no larger than that of the term index. Entries in the term index, or references to those entries, can be ranked according to their relevance to entries in the character sequence index. The number of whole-word terms referenced by a character sequence can be limited to those references of the highest rankings, for scalability. That way, a search query that includes wildcards can return the results of the highest rankings associated with the relevant entry (or entries) in the character sequence index.
At wildcard query time, using the character sequence index and term index, there are at least a couple of ways the query can be handled:
- Each entry in the character sequence index can refer directly to the document(s) containing the corresponding sequence. As shown in Fig. 1, the term “*sissi*” can refer to a document containing the word “mississippi” based on finding at least part of the sequence in the index. A wildcard match for the entire 5-character query sequence can be done either using the index itself or by performing a wildcard match across a cached copy of the document. In case queries are anticipated to involve only matching wildcards, this may be the most straightforward thing to do.
- Each entry in the character sequence index can refer to one or more entries in the term index. For example, as shown in Fig. 2, the term “*sissi*” can refer to a document containing the word “mississippi” based on these steps:
- Tokenize the search string by ‘*’ characters.
- For search string tokens longer than the character sequence index length limit (e.g. 4), break down the search string into sub-sequences of that length, including at least a first and last sub-sequence.
- For a first sub-sequence, e.g. “siss”, find all the matching terms in the term index, and in case the search token is longer, keep a reference to each location in the term index at which further matching will continue. So, for example, if there are at least 4 more characters, then matching continues from the next character in each matching term; otherwise a last match can be performed on the last 4 characters of both the query token and each term in the index that has matched so far. But there will first be an intermediate result set that includes matches on just those first 4 characters.
- For each remaining sub-sequence, e.g. “issi”, eliminate any terms from the result set, in case they don’t match; otherwise update the references from which matching will continue for each term in the index that still is a match so far.
- Return references to the documents that contain the remaining terms, as with an ordinary search engine query.
Because the search string is tokenized by the wildcard character (‘*’), wildcards that appear in the midst of the search string result in a narrowing of search results to those that match both the sequence before the wildcard and those that match the sequence after it. As shown in Fig. 3, a search on “miss*ippi*” could break down into sub-searches for terms matching the first token (“miss”) and the second token (“ippi”). The first sub-search, based on the first token, would find matches on both “mississippi” and “missouri”, but the second sub-search, based on the second token (“ippi”), would eliminate the match on “missouri”.
What about single-character wildcard cases, e.g. where the ‘?’ character serves in place of any one other character? This can be handled based on the index structures and wildcard tokenization described above, plus a matching rule that requires exactly one character to occur between the portions of a matching string in the term index.
The data structures we’ve been discussing aren’t too different from those used for garden-variety k-gram indexing, but we’ve arranged references between the character sequence index and the term index with scalability in mind. As with conventional search engines, the number of references per any item in any index must necessarily be limited for scalability. Conventional limits can involve relevance factors such as page ranks (assuming a web search engine) or event significance (assuming a log file use case). But for matching wildcards it’s not only the indexed terms themselves, but also the references from the character sequences to the terms or documents that contain them, that can be ranked. The number of references associated with a character sequence index entry can be limited to a fixed cap. Or there can be any number of references that have at least a sufficient ranking. Either way, the most highly ranked references can be stored, and the rest thrown away for scalability.
Terms can be ranked in a way reminiscent of a conventional search engine. When a wildcard-enabled search engine crawls the web, the terms that appear in the most highly ranked pages, or on the greatest number of pages, can get the highest rankings. If “mississippi” appears in more pages, and/or in more highly ranked pages, than “clarion”, then “mississippi” gets a higher ranking than “clarion” in the term index.
A sequence such as “ario” might be associated with “lariot” and “mario” but not necessarily “clarion”, unless a result set is left short or empty because of failure to find matches on a search query such as “the *ario* river”. If one or more wildcard searches is failing, then the rankings and/or references can be recomputed to allow for more matches. This need not involve newly crawling to create a new index, but merely rediscovering the set of references associated with each character sequence in the index. The process can take place while the user waits, or it can happen offline based on analytics involving queries that return few or no results.
Keeping the character sequence index scaled down might involve ranking not only terms but also references to them. This ranking of references to terms can be arranged in several ways. The references to terms that have high rankings, or in which a sequence occurs at the beginning, end, or in multiple places, can be preferred. So the reference associating the sequence “issi” with “mississippi” can get a high ranking based on the fact that it appears twice in that term. This can allow for the most relevant results to be shown for a wildcard search, even independently of other relevancy scoring techniques that might be used in search engines.
The rankings can apply either to references to terms in the term index (which would fit with the query handling of Fig. 2), or to references to terms in cached copies of the original pages (which would fit with the query handling of Fig. 1). If a short set of references is kept for each cached page among a bunch of indexed web pages, then matching wildcards in a search query can be as fast and direct as matching space-delimited terms and can achieve results ranked by relevance, just as with a conventional search without wildcards. It’s only the combination of both matching wildcards and matching space-delimited terms, in a single query, that would have to involve both forms of indexing described here. That’s why the query handling of Fig. 2 is relatively general-purpose.
Either form of scalable wildcard query handling can be used in a web search engine, a file system search mechanism, as part of a database, or in an analytics engine applied to log files, data streams, etc. In most any case, a maximum sequence length can be imposed on the character sequence index to limit its size.
What about content that isn’t space-delimited, like some log files or spreadsheets full of identifiers and codes? If an ordinary search engine was to index a corpus like that, the resulting term index might look like a list of long cryptic strings. To enable wildcard queries based on the concept envisioned here, the character sequence index can be built up based on an algorithm that counts occurrences of each short sequence among all the identifiers and codes. Then it can be culled, so that only relatively common sequences are represented in it, before it gets deployed to handle wildcard queries. In this situation and possibly others, creating the index to enable wildcard queries at scale might involve a memory-intensive process, but once it’s set up and put to practical use, the overhead might be about the same as if the corpus contained ordinary text.
Control flow in query processing
We’ve already introduced a brief set of steps for matching a search query containing wildcards to the character sequence index. Those steps are workable for search queries where the wildcard character (‘*’) appears at the beginning or end of a query term and can be used with the query handling mechanics of either Figs. 1 or 2. We originally left it at that, so as to move on to the primary concept of ranking terms for use with wildcard-based queries. But there are a few details left to consider.
The handling of wildcards appearing in the middle of white-space delimited search query content can be done in several ways. Which ways? That partly depends on whether we’ve got the query handling of Fig. 1 or Fig. 2. It also might depend on the meaning of the wildcard character in the midst of a set of non-white-space characters in a search query.
That is, does the ‘*’ incorporate white space? If so, then the query may specify those documents in which the character sequence before the wildcard appears somewhere before the character sequence after the wildcard – even though these sequences may appear with any number of intervening characters, or potentially other intervening non-character content, in a given document. Or does the occurrence of the ‘*’ in the midst of a term indicate that the term itself, and nothing more, is variable? If so, then the query indicates a range of variability by which indexed terms may match the set of white-space-delimited query terms.
Four styles of implementation come to mind.
- Implementation #1: Perform the query handling of Fig. 1, where the ‘*’ may incorporate white space. Because the character sequence index refers directly to some corpus of documents or other content we’re going to search, a reference data structure can be set up to describe not only the existence of the reference itself, but also the relative position (e.g. offset) of any character sequence in a document in which it appears. In this example, if a search term with wildcards is entered, such as “miss*ippi”, the results can include (among one or maybe two other possibilities) any document where there’s a word beginning with “miss”, followed by a word ending with “ippi”, and where any number of other words may appear between the two. A search query containing leading and trailing wildcards can be even more inclusive.
- Implementation #2: Perform the query handling of Fig. 1, where the ‘*’ may not incorporate white space. The space-delimited search term “*miss*i*” would allow for matches on “mississippi”, “missouri”, and “commission”, but not “miss by an inch” (or even by a centimeter). Unlike implementation #1, a reference data structure need not include an offset into a document.
- Implementation #3: Perform the query handling of Fig. 2, where the ‘*’ may incorporate white space. There’s no one data structure by which an entry in the character sequence index, which refers to a term index, can reflect offsets into documents in the corpus. One way to implement this is to build both implementations #1 and #4, for use with various queries, so that every possible query is covered one way or another.
- Implementation #4: Perform the query handling of Fig. 2, where the ‘*’ may not incorporate white space. The brief set of steps we’ve introduced, for matching a search query containing wildcards to the character sequence index, falls a little short in some scenarios. One of them: how to find matches when wildcard-delimited tokens have more characters than the character sequence limit (e.g. on “missi*ippi”)? Here’s a flow diagram describing a reasonably speedy, general purpose algorithm for use with the query handling of Fig. 2:
The procedure flowcharted above handles wildcards anywhere in a query term, and it covers the case where a query term contains more characters (not including spaces or wildcards) than the index’s maximum character sequence length.
How might an incomplete character sequence index may be automatically kept up to date, in the face of a changing corpus and/or changing query trends? For one thing, the index can be updated when queries get few or no results. The index also can be updated based on trends, or on time. Tracking search query trends, such as with the Yahoo! “Trending Now” listings, is nothing new. A more thoroughgoing arrangement can reflect not only the most frequently used terms, which can be increasingly ranked and prospectively represented in the index, but also the decreasing use of terms, which can be decreasingly ranked and prospectively discarded from wildcard-oriented referencing.
For example, if a once-trendy search query term was only occasionally input yesterday and has not been input today, then references to it can be removed from the character sequence index. If a character sequence (i.e. the index entry itself) has no references to any terms, or no references to any frequently indexed terms, then it can be removed. On the other hand, if a term is newly being queried by the users, then a new ranking for it can be established, references to it can be added, and the character sequence index can be updated accordingly. There are many prospective scenarios; for instance, a user might specifically request an update, either if expected search results aren’t showing up, or if unexpected results are.
If the user has to wait for that update, it may take some time, in case the entire term index is traversed in search of character sequences that match the user’s request. But a way could be provided for a user to suggest a term and its wildcard counterparts, or these may be derived via natural language processing or other techniques. Then only a small subset of the term index need be traversed in order to add an entry to the character sequence index.
But probably a more typical case would involve automatic updates based on trends. The update frequency may be established differently, for different terms, based on the nature of the documents in which they most often are found. For example, a term found specifically in breaking news, such as Habibzai (the name of a police officer killed near Kabul in mid-2017), might suddenly turn up in many searches, in a trend that peaks over a matter of days, whereas a term used in a profession such as medicine, e.g. Cipro, might lose interest only over a period of years (e.g. as antibiotics with less side effects come onto the market). Then there is the all-too-common case of the uncommon term, such as duhSIVILEYEzum, that appears in only one immortal written work that may be “recognized” as such — that is, not re-ranked based on any identifiable trends at all. (That term, by the way, comes from an amusingly rendered fictional conversation in a dialect totally familiar to those of us who come from a certain northeastern part of the United States.) Machine recognition of the nature of documents of various types may prove no match for a wildcard search in any of the cases just mentioned.
The implication of these trends for matching wildcards seems clear: people may wonder how to spell the ending of Cipro (which by its sound might as well have ended with an ‘h’ or a ‘w’). A search on “cipr*” will eliminate the need to flush out all these variants, which is helpful when natural language techniques can’t make sense of the term. The character sequence index of Fig. 2 need only track one or two references between a “cipr” entry and corresponding terms in the term index (e.g. for Cipro and Ciprofloxacin, which is another name for the same thing). Similarly, it would be helpful for a search engine to allow “duhs*” to match duhSIVILEYEzum (and thus the work that contains this term). Otherwise this term would match on little other than the Dow University for Health Sciences — and that’s another term also unlikely to be involved in trending.
So there can be some terms that get updated rankings regularly, even daily, others possibly yearly, and yet others not at all. For implementation #1, above, the reference data structure (shown in the middle of the figure accompanying the example) can be extended to include a ranking frequency member. This, as well as each of the offset members, can be a few bits in size, and can be handled with bit masking logic that accesses this field as quickly as an unmasked memory access on modern processors. The total size of this update frequency field and the two offset fields can be, say, 16 or 32 bits. Even if the reference also includes an entire URL, the scalability impact still can be minimal. Similar reference extensions can be applied to implementations #2 – #4, to enable the variable ranking update scheme with a similarly modest overhead.
An entry in the character sequence index can be updated in place at any time, such as during an update established according to a ranking update frequency. In other implementations, an entirely new index can be built, for example when the corpus is newly re-crawled. The “hot swap” of a search index, in a live scenario, is something the existing web search engines already do; the same techniques can be used for the character sequence index.
Natural language processing is surely useful, but it doesn’t entirely fill the role of wildcards, which are becoming only more relevant as big data gets bigger. Wildcard search can be especially helpful not only when unusual natural language terms come into play, but also when log files or specialized streaming data need to be parsed, or in any case when people don’t want to leave all the possible matches up to machines to figure out. For wildcard queries to work at a big data scale, something more than a simple index arrangement is in order. A ranking scheme for the index entries might be just what’s needed to make wildcard search a bigger thing.
Copyright © 2018 developforperformance.com.
Develop for Performance