RegEx¶

Settings¶

Setting	Description
`re.match(pattern, string)`	First match in string from start
`re.search(pattern, string)`	First match in string
`re.findall(pattern, string)`	Find all matches in string
`re.sub(pattern, repl, string)`	Replace all matches in string
`re.compile(pattern, flags=0)`	Compile `str` to RegEx object

Meta Characters¶

Meta Character	Description	Example
`[ ]`	Match any character in the brackets	[a-c]
`[^ ]`	Match any character except in [ ]	[^5]
`\`	Escape character to escape class or specific match	\d
`\d`	Escape character for digits	\d
`\w`	Escape character for word characters	\w
`\s`	Escape character for whitespace	\s
`\S`	Escape character for non-whitespace	\S
`*`	Character occurs 0 or more times	a*
`+`	Character occurs 1 or more times	a+
`?`	Character occurs 0 or 1 times	a?
`{n}`	Character occurs exactly n times	a{3}
`{n,}`	Character occurs n or more times	a{3,}
\|	Either or	a\|b

Flags¶

Flag	long name	Description
`re.A`	`re.ASCII`	ASCII only matching
`re.I`	`re.IGNORECASE`	Case insensitive matching
`re.M`	`re.MULTILINE`	Multiline matching
`re.S`	`re.DOTALL`	. special character, matches all characters
`re.X`	`re.VERBOSE`	Verbose RegEx
`re.L`	`re.LOCALE`	Matching based on locale language

Common Patterns¶

Remove redundant whitespaces

text = "if     you    want    to    remove    redundant    whitespaces"
re.sub(r"\s+", " ", text)

# >>> 'if you want to remove redundant whitespaces'

Remove special characters

text = "remove # special @@ characters % ..."
re.sub(r"[^a-zA-Z0-9]+", "", text)

## >>> 'remove special characters'

Keep only numeric/alphabetic characters

text = "numbers 123 and letters abc"
re.sub(r"[^a-zA-Z]+", "", text)
re.sub(r"[^0-9]+", "", text)

# >>> 'numbersandletters'
# >>> '123'

Get URLs from text

text = "This is a text with a URL www.google.com and another URL https://www.zhaw.ch"
re.findall(r"https?://\S+|www\.\S+", text)

# >>> ['www.google.com', 'https://www.zhaw.ch']

Get email addresses from text

text = "This is a text with an email hallo@gmail.com"
re.findall("[\w\.-]+@[\w\.-]+\.\w+", text)

# >>> ['hallo@gmailcom']

Get HTML tags from text

text = "This is a text with a <b>bold</b> tag and a <a href='https://www.zhaw.ch'>link</a>"
re.findall(r"<.*?>", text)

# >>> ['<b>', '</b>', '<a href='https://www.zhaw.ch'>', '</a>']

Remove HTML tags from text

text = "This is a text with a <b>bold</b> tag and a <a href='https://www.zhaw.ch'>link</a>"
re.sub(r"<.*?>", "", text)

# >>> 'This is a text with a bold tag and a link'