RegEx¶
Settings¶
Setting | Description |
---|---|
re.match(pattern, string) |
First match in string from start |
re.search(pattern, string) |
First match in string |
re.findall(pattern, string) |
Find all matches in string |
re.sub(pattern, repl, string) |
Replace all matches in string |
re.compile(pattern, flags=0) |
Compile str to RegEx object |
Meta Characters¶
Meta Character | Description | Example |
---|---|---|
[ ] |
Match any character in the brackets | [a-c] |
[^ ] |
Match any character except in [ ] | [^5] |
\ |
Escape character to escape class or specific match | \d |
\d |
Escape character for digits | \d |
\w |
Escape character for word characters | \w |
\s |
Escape character for whitespace | \s |
\S |
Escape character for non-whitespace | \S |
* |
Character occurs 0 or more times | a* |
+ |
Character occurs 1 or more times | a+ |
? |
Character occurs 0 or 1 times | a? |
{n} |
Character occurs exactly n times | a{3} |
{n,} |
Character occurs n or more times | a{3,} |
| | Either or | a|b |
Flags¶
Flag | long name | Description |
---|---|---|
re.A |
re.ASCII |
ASCII only matching |
re.I |
re.IGNORECASE |
Case insensitive matching |
re.M |
re.MULTILINE |
Multiline matching |
re.S |
re.DOTALL |
. special character, matches all characters |
re.X |
re.VERBOSE |
Verbose RegEx |
re.L |
re.LOCALE |
Matching based on locale language |
Common Patterns¶
Remove redundant whitespaces
text = "if you want to remove redundant whitespaces"
re.sub(r"\s+", " ", text)
# >>> 'if you want to remove redundant whitespaces'
Remove special characters
text = "remove # special @@ characters % ..."
re.sub(r"[^a-zA-Z0-9]+", "", text)
## >>> 'remove special characters'
Keep only numeric/alphabetic characters
text = "numbers 123 and letters abc"
re.sub(r"[^a-zA-Z]+", "", text)
re.sub(r"[^0-9]+", "", text)
# >>> 'numbersandletters'
# >>> '123'
Get URLs from text
text = "This is a text with a URL www.google.com and another URL https://www.zhaw.ch"
re.findall(r"https?://\S+|www\.\S+", text)
# >>> ['www.google.com', 'https://www.zhaw.ch']
Get email addresses from text
text = "This is a text with an email hallo@gmail.com"
re.findall("[\w\.-]+@[\w\.-]+\.\w+", text)
# >>> ['hallo@gmailcom']
Get HTML tags from text
text = "This is a text with a <b>bold</b> tag and a <a href='https://www.zhaw.ch'>link</a>"
re.findall(r"<.*?>", text)
# >>> ['<b>', '</b>', '<a href='https://www.zhaw.ch'>', '</a>']
Remove HTML tags from text
text = "This is a text with a <b>bold</b> tag and a <a href='https://www.zhaw.ch'>link</a>"
re.sub(r"<.*?>", "", text)
# >>> 'This is a text with a bold tag and a link'