Complete List of Regular Expressions Basics - ByteScout
  • Home
  • /
  • Blog
  • /
  • Complete List of Regular Expressions Basics

Complete List of Regular Expressions Basics

In computing, we spend an enormous amount of time working with text. And for a reason: computers communicate a lot with other computers and needs also to interact with humans. And one of the first ways to communicate is via text.

But you’ll often need to check if what the user typed in your super form is appropriate. You will also want to extract data from text or even filter text to replace some things with another thing (for example mass replacing a variable or function name in source code).

Regular expressions are invaluable in all of these tasks: it features the most advanced techniques to accomplish your objective in all of these cases. Below, I will show you 20 regular expression examples to showcase how it can be useful to you.

Introducing Regular Expressions

Every language, and sometimes, different libraries in the same language can have its own regular expression syntax. However, the most common symbols tend to be shared. For example, many languages (PHP, Python, JavaScript, sed, etc.) defines the + symbol as “repeat at least once, or more”.

Then, regular expression programming works in two modes: match and replace. Both are linked, but they’re not doing the same thing. The match mode is intended to check if some text respects a specific regular expression pattern and to extract data from it. Replace, on the other side, is intended to convert text from one format to another.

In this section, we’re going to focus mainly on the match mode and on regular expression pattern matching.
First, you may want to match a phone number by checking if it’s in US format. In Python regular expression, we might do this:

re.search("\([0-9]+\) [0-9]+-[0-9]+", phoneNumber)

Well, let me explain. A regular expression is a string containing what should contain the text you want to match, and then special characters allow you to match dynamic things. Like the symbol + allows here to say “match one or number” in reg expression syntax.

[0-9] means “any number”. It’s using a technique called character classes. Character classes allow matching a character if they’re one of the specified in the brackets. We could also have written [0123456789] but that’s far much longer and it brings no benefit.

Then the parentheses have a backslash before them because they have a special meaning in regex language. This way, you can say to the computer “I want to find the character parenthesis.” instead of the special meaning they have.

Okay, but as you can test it in a Python terminal, you can actually see that it matches numbers as “(3) 234-3493” as we only specified in the regex string “one or more”, and not a specific number. This can be fixed as follows:

re.search("\([0-9]{3}\) [0-9]{3}-[0-9]{4}", phoneNumber)

With that, it should be more than enough to restrict users. However, as you can see, it’s more complex to write regular expressions like this and when you’re using them for a quick task you won’t likely go that far, it’s more for production usage.

Now, if you build a web app, you’ll probably have to check for the format of usernames so it doesn’t go out of bounds. A general rule for usernames is that they should only contain alphanumeric characters. Display names may contain a space but usernames generally do not.

Let’s do that in Python again:

re.search("^[a-z0-9]+$", username, re.IGNORECASE | re.ASCII)

In Python, [a-z] in ignore case might match some non-ASCII characters while programmers expect to get an ASCII string. Also, when a regular expression starts with ^ and ends with $, it tells Python to only match if the whole string follows the regexp’s string pattern.

This means also that our phone regex above matches if it finds a phone number in the middle of a string. It’s convenient for data extraction as we’ll see soon.

Now, if you register a user, you’ll ask for an e-mail address, right? Then let’s run a basic check:

re.search("^[a-z0-9!#$%&'*+/=?^_`{|}~.-]+@[a-z0-9-.]+\.[a-z0-9-]+$", emailAddress, re.IGNORECASE | re.ASCII)

Okay, that one isn’t so basic, I agree, but in fact, it’s simpler than it appears and, after all, that’s why it’s an even more useful regular expression. The first part represents the name of the mailbox (on Gmail, your username basically) and the Internet standard for e-mail allows a lot of special characters in this part.

Then you find the commercial at-sign (@), and finally the domain name. To avoid common mistakes, it checks if the domain names have at least a dot, such as gmail.com.

If you insert this e-mail address in a SQL database, please make sure to put your value using a prepared statement or escape properly the value due to the special chars in the e-mail address.

Now, let’s check out a date via a regex. We’ll assume there the format is MM-DD-YYYY and we don’t support years in 2 digits because that’s just a nightmare to manage:

re.search("[01]?[0-9]-[0-3]?[0-9]-[0-9]+", theDate)

While the regex alone can’t check on its own if a date provided is valid (and you would probably want them to exclude some years range as well if you’re not doing an archive), but it helps to filter a lot.

Here we filtered dates with MM above 19 as it’s probably people who typed the date in JJ-MM-AAAA instead. Same kind of filter for days where it can’t be above 31. Zerofill has been made optional thanks to the ? the operator which means “it may appear once or not at all”.

Now let’s match time instead of date. Here’s the regex implemented in Java this time:

import java.util.regex.Pattern; // I'll omit it in next snippets for readability

// Wrap snippets inside a class and a function
Pattern timeRegex = Pattern.compile("[01]?[0-9]:(?:[0-5]?[0-9])+ [AP]M", Pattern.CASE_INSENSITIVE);

timeRegex.matcher(timeString).find();

Yes, it’s not a one-liner like in Python but that’s how Java regular expressions are designed after all. I don’t use the Pattern.matches static method as you can’t provide a regular expression flag to Java, something you do often (ignore case, etc.).

There’s a new syntax I used here: (?:something). I have used it because minutes and seconds share the same syntax. So I can use + to avoid typing the same rule twice in the regex. Shorter regular expressions tend to be more readable. The power of regex can be combined, and sometimes borrowed from another language with some luck:

Pattern timeRegex = Pattern.compile("[01]?[0-9]-[0-3]?[0-9]-[0-9]+ [01]?[0-9]:(?:[0-5]?[0-9])+ [AP]M", Pattern.CASE_INSENSITIVE);

timeRegex.matcher(timeString).find();

Sometimes regular expression better applies to files than to user input. Let’s find JavaScript declarations of a variable in a regex:

Pattern varRegex = Pattern.compile("(var|let) ([a-zA-Z0-9-_$]+)(?: = .*)?");

varRegex.matcher(lineString).find();

I’ve used new operators, the . dot and the * asterisk. Dot means in regular expression “any character” while * means “zore, one, or more occurrences”. Note that developers who aren’t using spaces around the equal sign won’t see their line matched, while this syntax is valid in JavaScript.

What about matching MIME types? You often use them when setting up your HTTP server, in HTTP requests or when you write HTML.

Pattern mimeRegex = Pattern.compile("([a-z0-9.-]+)/([a-z0-9.-]+)");

mimeRegex.matcher(mimeString).find();

For our last Java regex in this regular expressions list, let’s showcase a common regular expressions example: match a file name with a proper extension. Here’s the code:

Pattern fileRegex = Pattern.compile("([a-z0-9._ -]+)\\.([a-z0-9]+)", Pattern.CASE_INSENSITIVE);

fileRegex.matcher(filename).find();

We’ve matched enough things for now. There’s also another powerful side of regex: data extraction and replacement! Time to take a look more closely at this convenient capability offered by regular expressions!

Replaces with Regular Expressions

Replace in regular expression lets you extract data and change the format of the data. And if you start to fear because you mainly used your search & replace text editor’s function and find it just simple, regular expression search & replace is far more powerful here.

Let’s take an example in JavaScript regular expressions. You want a date to be transformed from MM-DD-YYYY to DD/MM/YYYY? Well

let dateRegex = /([01]?[0-9])-([0-3]?[0-9])-([0-9]+)/g;

dateString.replace(dateRegex, "$2/$1/$3");

JavaScript has a special operator to type easily literal regex without using strings, as you can see in the example above. The small letter g after the regex means “replace all occurrences, not the only the first”. Because yes, the replace will replace any dates, even in the middle of a document.

Well, if you use it in the middle of a document, however, you may want to replace it only if there are only spaces before and after your regex. Because otherwise, you may literally match dates in the middle of a number. So:

let dateRegex = /\b([01]?[0-9])-([0-3]?[0-9])-([0-9]+)\b/g;

dateString.replace(dateRegex, "$2/$1/$3");

The \b operator is the tool used to only allow spaces/word separators around the match, so we know we’re not in the middle of a number. Note it also works in Java and Python regex, so don’t hesitate to use it, but as you write a regex in the string you need to double the backslash: “\\b”.

Next example of our regular expression list: Uniformization of file extensions. For example, .jpg and .jpeg are equivalent but you may not want to use both on your server, so you do that replace:

let jpgRegex = /\b([a-z0-9._ -]+)\.jpe?g\b/g;

filename.replace(jpgRegex, "$1.jpg");

It’s also common for programmers to replace things in URL. In some cases, you might use a URL parsing library, but you can also use a regex. Say if we need to replace Google’s Spain or Mexico URLs with google.com:

let urlRegex = /https?:\/\/(?:www\.)?google\.(?:com\.mx|es)\/?/g;

url.replace(urlRegex, "https://www.google.com/");

How about replacing one HTML tag with another? There are many situations where the HTML tag used isn’t just the most appropriate but you realize it too late. Well, not that much late in fact. This one replaces div with span:

let htmlRegex = /<(\/?)div/g;

htmlString.replace(htmlRegex, "<$1span");

Some of us are more using PHP than JavaScript (we’re looking at you, WordPress plugin & theme developers!), and after all, any check must be done on the server-side as well. In the US, the dollar sign is before the number, not after. Let’s enforce this:

preg_replace("/\\b([0-9]+) ?\\\$/", "\\\$\$1", $textContent);

As the $ is a special character (it’s not space or punctuation) the \b operator doesn’t work with it, that’s why the \b is only at the beginning. This regular expression rule is pretty hard to guess.
If you’re doing a radio or have a list of songs, you may have written the titles in this format:
Title - Artist
But you may wish to reverse it, as do media players in general. PHP regular expressions can help you with that:

preg_replace("/^([^ ]+) - ([^ ]+)\$/", "$2 - $1", $songTitle);

PHP is often used with user-generated content websites and sometimes you want to do formatting. If you want to allow users to use HTML IMG tags in their content, you may still want to filter the attributes:

preg_replace("/<img[^>]*src=[\"'](https:\\/\\/[^'\"]+)[\"'][^>]*>/", "", $userContent);

This one is long and complex enough that it deserves the extended regular expression badge! Note it filters URLs to avoid any unintended or hacky URLs and enforces HTTPS for security.

Regular expressions not only work in specific computing formats, but you can also use them in free text. Say you want the brand Coca-Cola to always appear in its right case:

preg_replace("/\\b(?:(?:Coca)?[- ]?Cola|Coca|Coke)\\b/i", "Coca-Cola®", $userContent);

Please note that ordering matters. By putting first the Cola check with the optional Coca before it, it avoids the string “Coca-Cola” to be replaced twice, as Coca and Cola can also be matched separately.

For our last example of this regular expressions guide, you may want to bold out a specific word a user searched. As PHP regular expressions are strings, it’s really convenient to do:

preg_replace("/\\b${wordSearched}\\b/", "\$1", $content);

Conclusion

You knew the power of Superman, now you know the power of regular expressions! Don’t hesitate to use them, but keep in mind that regular expressions generally checks many cases, but not all, except if you make them really complex.
Thanks for reading this article – stay tuned with ByteScout!

 

   

About the Author

ByteScout Team ByteScout Team of Writers ByteScout has a team of professional writers proficient in different technical topics. We select the best writers to cover interesting and trending topics for our readers. We love developers and we hope our articles help you learn about programming and programmers.  
prev
next