Linux regular expressions

Linux Text File Search with Regular Expressions for Matching Text Patterns

We can use grep command line to do word and phrase searches. Also, you can use grep to search for complex text patterns called regular expressions. A regular expression—or “regexp”—is a text string of special characters that specifies a set of patterns to match.

In a regular expression, most characters—including letters and numbers—represent themselves. For example, the regexp pattern 2 matches the string ‘2’, and the pattern apple matches the string ‘apple’.

There are a number of reserved characters called metacharacters that do not represent themselves in a regular expression, but they have a special meaning that is used to build complex patterns. These metacharacters are as follows: ., *, [, ], ˆ, $, and \. It is good to note that such metacharacters are common among almost all of common, special and secure Linux distributions.

The following table describes the special meanings of the metacharacters and gives examples of their usage.

Metacharacters Meaning
. Matches any one character, with the exception of the newline character. For example, . matches ‘a’, ‘1’, ‘?’, ‘.’ (a literal period character), and so forth.
* Matches the preceding regexp zero or more times. For example, -* matches ‘-’, ‘–’, ‘—’, ‘———’, and so forth. Now imagine a line of text with a million ‘-’ characters somewhere in it, all marching off across the horizon, up into the blue sky, and through the clouds. A million ‘-’ characters in a row. This pattern would match it. Now think of the same long parade, but it’s a million and one ‘-’ characters—it matches that, too.
[] Encloses a character set, and matches any member of the set—for example, [abc] matches either ‘a’, ‘b’, or ‘c’. In addition, the hyphen (‘-’) and caret (‘ˆ’) characters have special meanings when used in- side brackets:
The hyphen specifies a range of characters, ordered according to their ASCII value. For example, [0-9] is synonymous with [0123456789]; [A-Za- z] matches one uppercase or lowercase letter. To include a literal ‘-’ in a list, specify it as the last character in a list: so [0-9-] matches either a single digit character or a ‘-’.x
ˆ As the first character of a list, the caret means that any character except those in the list should be matched. For example, [ˆa] matches any character except ‘a’, and [ˆ0-9] matches any character except a numeric digit. Matches the beginning of the line. So B7matches ‘a’ only when it is the first character on a line.
$ Matches the end of the line. So a$ matches ‘a’ only when it is the last character on a line.
\ Use \ before a metacharacter when you want to specify that literal character. So \$ matches a dollar sign character (‘$’), and \\ matches a single backslash character (‘\’).In addition, use \ to build new metacharacters, by using it before a number of other characters:

 

Here is different variation of using \ in regular expression searches

  • \|
    Called the ‘alternation operator’; it matches either regexp it is between—use it to join two separate regexps to match either of them. For example, a\|b matches either ‘a’ or ‘b’.
  • \+
    Matches the preceding regexp as many times as possible, but at least  once. So a\+ matches one or more ‘a’ adjacent characters, such as ‘aaa’, ‘aa’, and ‘a’.
  • \?
    Matches the regexp preceding it either zero or one times. So a\? matches ‘a’ or an empty string—which matches every line.
  • \{number\}
    Matches the previous regexp (one specified to the left of this construction) that number of times—so a\{4\} matches ‘aaaa’.  Use

      • \{number,\} to match the preceding regexp number  or more times,
      • \{,number\} to match the preceding regexp zero to number times, and
      • \{number1,number2 \} to match the preceding regexp from number1 to number2 times.
  • \(regexp\)
    Group regexp together for an alternative; useful for combination regexps. For example, while moo\? matches ‘mo’ or ‘moo’,

      • \(moo\)\? matches ‘moo’ or the empty set.

Regexps for Common Situations

The following lists sample regexps and describes what they match. You can use these regexps as templates when building your own regular expressions for searching text. Remember to enclose regexps in quotes.

To Match  Lines  That=> Use This Regexp

contain nine zeroes in a row=> 0\{9\}
are exactly four characters long=> ˆ….$ or ˆ.\{4\}$
are exactly seventy characters long=> ˆ.\{70\}$
begin with an asterisk character=>        ˆ\*
begin with ‘tow’ and end with ‘ing’=> ˆtow.*ing$
contain a number=>      [0-9]
do not contain a number=>       ˆ[ˆ0-9]*$
contain a year from 2011 through 2017=>        201[1-7]
contain a year from 1957 through 1969=>        \(195[7-9]\)\|\(196[0-9]\)
contain either ‘.txt’ or ‘.text’=>             \.te\?xt
contain ‘cat’ then ‘gory’ in the same word=>    cat\.\+gory
contain ‘cat’ then ‘gory’ in the same line=>      cat\.\+\?gory
contain a ‘q’ not followed by a ‘u’=>   q[ˆu]
contain any ftp, gopher, or ‘http’ URLs=>        \(ftp\|gopher\|http\|\)://.*\..*

 

Additional Linux Resources
Here is a list of resources for learning Linux:
Resources for System Administrators

Resources for Linux Kernel Programmers

Linux File System Dictionary
Comprehensive Review of How Linux File and Directory System Works