Useful regular expressions

Regex:	\{\d*\}
What it does:	This Regex will match Déjà Vu X3's tags. You can use this, for example, to quickly find and replace Déjà Vu X3 tags in an alignment project before sending it to a TM.

Regex:	<[^>]+>
What it does:	This finds any HTML, such as <a>, <b>, <img />, <br />, etc. You can use this to find segments that have HTML tags you need to deal with, or to remove all HTML tags from a text.

Regex:	https?:\/\/[\w\.\/\-?=&%,]+
What it does:	This will find a URL. It will capture most URLs that begin with http:// or https://.

Regex:	'\w+?'
What it does:	This finds single words that are surrounded by apostrophes.

Regex:	([-A-Za-z0-9_]?([-A-Za-z_][0-9]\|[0-9][-A-Za-z_])[-A-Za-z0-9_])
What it does:	Alphanumeric part numbers and references like: 1111_A, AA1AAA or 1-1-1-A, 21A1 and 10UC10P-BACW, abcd-1234, 1234-pqtJK, sft-0021 or 21-1_AB and 55A or AK7_GY. This can be very useful if you are translating documents that have a lot of alphanumeric codes or references in them, and you need to be able to find them easily.

Regex:

What it does:

This finds text that begins with the or The and ends with stop words such as is, are, was, can, shall, must, that, which, about, by, at, if, when, should, among, above or under, or the end of the segment.
This is particularly useful when you need to extract terminology. Suppose you have segments like these:

The Web based look up is our new feature.
A project manager should not proofread...
Our Product Name is...

The Regex shown above would find anything between The and is, or should. With most texts, there is a good chance that anything this Regex finds is a good term that you can add to your Termbase.

Regex:	\b(a\|an\|A\|An)\b.*?\b(?=\W?\b(is\|are\|was\|can\|shall\|must \|that\|which\|about\|by\|at\|if\|when\|among\|above\|under\|$)\b)
What it does:	This works much like the Regex shown above, except that it finds text that begins with a or an, rather than the. This can also be very helpful when you need to extract terminology from a project.

Regex:	\b(this\|these\|This\|These)\b.*?\b(?=\W?\b(is\|are\|was\|can\|shall\|must \|that\|which\|about\|by\|at\|if\|when\|among\|above\|under\|$)\b)
What it does:	This works much like the Regex shown above, except that it finds text that begins with this or these. This can also be very helpful when you need to extract terminology from a project.

Regex	Description
\b(our\|your)\b.*?\b(?=\W?\b(is\|are\|was\|can\|shall\|must\|that\|which\|about\|by\|at\|if\|when\|among\|above\|under\|$)\b)	Find any text starting with "your" or "our" and ending with stop words such as "is", "are", etc. (or the end of the segment).
\b(all\|every)\b.*?\b(?=\W?\b(is\|are\|was\|can\|shall\|must\|that\|which\|about\|by\|at\|if\|when\|among\|above\|under\|$)\b)	Find any text starting with "all" or "every" and ending with stop words such as "is", "are", etc. (or the end of the segment).
\b(the\|an\|a\|all\|every\|this\|these\|our\|your)\b.*?\b(?=\W?\b(is\|are\|was\|can\|shall\|must\|that\|which\|about\|by\|at\|if\|when\|among\|above\|under\|$)\b)	Find any text starting with "the", "an", "a", "all", "every", "this", "these", "our" or "your" and ending with stop words such as "is", "are", etc. (or the end of the segment).
\b(?:high\|low\|medium)\b	Finds segments containing “high”, “low” or “medium”
Find colour or color	\bcolou?r\b
\b\w*phobia\b	Words ending with “phobia”
\b(?!low\b)\w+	Find any word containing “low” except the word “low” - Finds: flow, lower, lowest, allow
\b\w+\b(?=\W+are\b)	Find any word followed by a specific word (here, finds any word followed by "are")
(?<!\balarm)(?:\W+\|^)(\w+)	Find any word not preceded by a specific word (here, “alarm”)
(?<=\balarm)(?:\W+\|^)(\w+)	Find any word that is preceded by a specific word (here, "alarm")
\b(?:acknowledged\W+(?:\w+\W+){0,5}?alarm\|alarm\W+(?:\w+\W+){0,5}?acknowledged)\b	Find two words near each other (up to 5 words apart) (Finds:"alarm can and must be acknowledged", "alarm is acknowledged" and "acknowledged an alarm")

Useful regular expressions

Comments