What is a symbol that stands for one or more unspecified characters in the search criterion?

How to use Advanced Search

Unlike the Quick Search, the Advanced Search lists only those terms that satisfy the search criteria exactly. For example, if you search for smoked, you will find the term smoked, but you will not find a link to the term smoke – although it is in the database. This search does not disregard punctuation. If you search for rain-forest (with a hyphen), you will not find it as the database only contains rain forest (without a hyphen). Neither of the searches is case-sensitive, however – so whether you enter téarma or TÉARMA (or even TéArMa) you will get the same results.

The following criteria can be used individually or in combination to narrow your search results:

Length (single-word/multi-word)

This criteria allows you to limit your search to either single-word or to multi-word terms. The default option includes both.

Extent

This criteria allows you to limit your search to terms beginning with, ending with or incorporating a particular word or string of characters.

Language

This criteria allows you to limit your search results to a particular language. The default option includes terms from all languages.

Part of speech

This criteria allows you to limit your search results to a particular part of speech. For example, you can request a list containing nouns only.

Domain

This criteria allows you to limit your search results to a particular domain. For example you could search the word bat and restrict your results to the ‘Sports’ domain, thus filtering out results relating to nocturnal flying mammals.

Combining Criteria

You can select any combination of the above criteria for your search. The search engine will return terms which satisfy all of those criteria selected.

If your search yields no result, you may have selected criteria that are too narrow. In that case, it is worth looking back on your criteria and broadening them in some way.

Wildcards in Advanced Search

A clever way to make your Advanced Search more flexible is to use wildcards in the 'Foclaíocht/Wording' box. A wildcard is a symbol which stands for one or more unspecified characters. These are the available wildcards:

Underscore: _

This represents any single character. For example, if you search for l_w, you will find law and low, as well as the abbreviations LBW and LLW.

Percentage: %

This represents any number (including zero) of any characters. For example, if you search for met%rology, you will find metamorphic petrology and meteorology, as well as metrology. This wildcard, when used on its own, will produce a list of every entry that satisfies the selected search criteria.

Regular expressions is a convention of using some characters instead of unspecified letters or numbers. They are used to set criteria for strings of characters, e.g. words or tags, which have a common pattern, e.g. start the same way, finish the same way or contain certain characters.

Regular expressions are used mainly inside CQL, in word lists and n-grams.

This page only gives a few basic examples, please refer to Wikipedia, try our regular expressions exercises or this interactive course.

Wild cards

Wild cards are not regular expressions but users know them from other software. They are only supported in the simple concordance search.

Using wild cards in simple concordance search

Only in simple concordance search, the asterisk (*), question mark (?) and double dashes (--) can be used like this:

asterisk (*) stands for zero or more characters
test* will find
test, tests, tested, testing

c*t will find
CT, cut, cat, craft, construct

question mark (?) stands for exactly 1 character
test? will find
tests, Testa, testy
but will not find
test

c?t will find these lemmas
cat, cut
BUT! simple search always treats each search word as a lemma, thus c?t will search for the lemmas cut, cat and cot. These lemmas will produce results which include all word forms. The final concordance will thus show: cut, cutting, cat, cats, cot, cots, etc.

To search for the asterisk and question mark, use backslash (\) such as \* and \?

double dashes (--) stands for dash, space or none character
multi--million will find
multi-million, multi million, multimillion

vertical bar (|) stands for OR

cat|dog|horse will find

cat, dog, horse

Regular expressions

Regular expressions (not wildcards!) are used in all the other concordance searches, in CQL to specify patterns for values and with wordlists to only include/exclude certain types of items.

Regular expressions and CQL

Regular expressions are used in CQL to specify patterns for values.

[word = “dis.*“] [tag = “V.*“] finds words beginning dis- followed by a verb

[tag=”J.*“] [word=”[[:upper:]]*“] finds adjectives followed by an acronym (=word in capitals)

To copy & paste, use these:

[word = "dis.*"] [tag = "V.*"]
[tag="J.*"] [word="[[:upper:]]*"]

Spaces in CQL and regular expressions

Spaces are used in CQL to make the code easier to read for the human eye. The use of spaces in CQL does not have any effect on the result.

In regular expressions, a space refers to a real space, e.g. space between two words. Since CQL criteria are set for individual tokens separately, the use of a space is generally a mistake and will not produce the required result.

CQL tutorial – introduction to corpus query language

dot ‘ . ‘

A dot stands for a single unspecified character.

regular expression	matching result(s)
w.n	win won wen wun wan
ca.	cat car cap cab can

question mark ‘ ? ‘

A question mark stands for zero or 1 occurrence of the preceding character

regular expression	matching result(s)
be?t	bt bet (but will not find beet beeet beeeet)
bet?	be bet (but will not find bets betting)
.?at	at hat bat cat mat (zero or one unspecified character at the beginning)

asterisk ‘ * ‘

An asterisk stands for zero or more occurrences of the preceding character.

regular expression	matching result(s)
co*l	CL col cool coool cooool
hallo*	hall hallo halloo hallooo halloooo
c.*ing	words startin with c- and ending with -ing (i.e. having any number of unspecified characters between c and ing) cycling camping cutting cooking contemplating
*ool	produces error, no character precedes the asterisk
c.*	word beginning with the letter c (c is followed by any number of any character)
.*ed	word ending with -ed (the word starts with any number of any character)

range ‘ [ ] ‘

use square brackets to specify a list or range
[bmpg] stands for b OR m OR p OR g
[a-d] stands for a letter between a and d
[3-5] stands for a digit between 3 and 5

regular expression	matching result(s)
[mpgb]et	met pet get bet
m[2-5]	m2 m3 m4 m5
m[2-5]*	m m22 m52 m3425 m23453234 m222345 (m followed by zero or more digits between 2 and 5)

not ‘ ^ ‘

use ^ to indicate that the character(s) should not be included, the characters have to be enclosed in square brackets

regular expression	matching result(s)
[^m]et	pet get bet let (but will not find met)
[^mpg]et	set let (but will not find met pet get)

letters and digits

letters can be specified by a range or by character class

regular expression	matching result(s)
[A-Z]	finds any upper-case character (of the English alphabet, not charactes such as é í č ß etc.)
[a-z]	finds any lowercase character (of the English alphabet)
[A-Za-z]*	finds any word consisting of upper-case and lowercase characters (of the English alphabet)
[[:alpha:]].*	finds a word consisting of letters of any alphabet including accented characters and special characters, see character classes further below

\d stands for a digit, i.e. characters 0-9, \D stands for any non-digit character

regular expression	matching result(s)
b\d	b1 b2 b3 b4
b\d*	b b1 b12 b89 b43958 (zero or more digits after b)
\d\db	58b 46b 89b (b preceded by two digits)

character classes

Character classes are special codes used to refer to a group of characters.

character class	meaning
[[:alpha:]]	any letter including accented and special characters, equivalent only for English is [A-Za-z]
[[:digit:]]	any digit, equivalent to [0-9] or d
[[:alnum:]]	any alphanumeric character, equivalent only for English is [0-9A-Za-z]
[[:lower:]]	all lower case characters [a-z]
[[:upper:]]	all upper case characters
[[:punct:]]	punctuation [-!”#$%&'()*+,./:;<=>?@[]_`{
[[:space:]]	whitespace character (space, new line, tab, carriage return)

Example:

[[:alpha:]]* finds all words composed of letters
[[:alpha:]][[:alnum:]]* finds all words starting with a letter and then composed of letters and numbers, eg. H2SO4 but not 4you

or ‘ | ‘

the pipe | is used to indicate OR

regular expression	matching result(s)
get\|met	will find lines which contain the word get OR the word met

plus ‘+’

the plus stands for ‘one or more repetitions of the preceding character’

regular expression	matching result(s)
hallo+	hallo halloo hallooo hallooooooooo (but not hall)
.+at	bat, great, format, cat (but not ‘at’, to include ‘at’, use .*at)

case sensitivity switch (?i)

regular expressions are always case sensitive, i.e. Bill is different from bill. To make the whole regular expression case insensitive, put these four characters at the beginning (?i)

regular expression	matching result(s)
(?i)monday	Monday monday MONDAY

repetition { }

use curly brackets to indicate repetition of the preceding character

regular expression	matching result(s)
halo{3}	halooo (exactly 3 repetitions of the letter o)
hallo{2,4}	haloo hallooo hallooo (from 2 to 4 repetitions of the letter ooo)
.{6}	anyone playmat bottle (words consisiting of any 6 characters, it is equivalent to typing 6 dots …… )
[a-z]{4,}	bake mother corporation (words consisting of 4 or more letters)

grouping ( )

any part of a regular expression can be surrounded by parentheses to make it a single unit onto which other regular expressions can be applied

regular expression	matching result(s)
(dis)?connect	connect disconnect (question mark makes the preceding element ‘(dis)’ optional)
(bla){3,4}	blablabla blablablabla

escaping

to search for characters . ? * which already have a special function in regular expressions, you have to put a backslash in front of them, this is called escaping (e.g. you have to escape a question mark) Characters $ and # in part of speech tags also have to be escaped.

regular expression

ok?

ok\?

matching result

a b c d e f g h etc. (all alphanumeric characters)

o ok (question mark makes the preceeding character optional)

ok?

produces error, backslash escapes the following character but no such character exists

not starting with ‘ ?! ‘

Use ?! to say “not starting with”, also called negative lookahead. The brackets are required. The brackets have to be followed by a regular expression defining what the token should consist of. Use .* for any token. Use … for 3-letter tokens. Use [[:upper:]]* for tokens consisting of uppercase characters, etc.

regular expression	matching result(s)
(?!NP).*	all POS tags not starting with NP
(?!th)…	all 3-character words not starting with “th”

backreferences

since manatee 2.65 It is possible to place brackets around one or several parts of a regular expression and refer to those parts later. The first part in brackets is referred to with number 1, the second with number 2, etc. (This only works within one token, e.g. [word=”(ba)..\1..*”] to find baseball, basketball, etc. N-grams tool supports also backreferences in different tokens, e.g. (.*) or \1 to find occurrences such as may or may, do or do, etc.

regular expression	matching result(s)
`(abra)kad\1` (the number must be escaped)	abrakadabra
`(a)(b)(c)\3\2\1`	abccab

Is a symbol that stands for one or more unspecified characters in the search criterion?

A wildcard character is used as a symbol that stands for one or more unspecified characters.

Which wildcard character replaces a single character?

Alternatively referred to as a wild character or wildcard character, a wildcard is a symbol used to replace or represent one or more characters. The most common wildcards are the asterisk (*), which represents one or more characters, and question mark (?), which represents a single character.