Analyze Data/Python Libraries

regular expression

Naranjito 2022. 4. 26. 13:58
  • re

Regular Expression.

<1. Data>

0                   (직) 데친고사리 1kg(냉장}
1                   (직) 깐도라지채 1kg(냉장}
2                   콩나물 박스 4kg(상 곱슬이)
3                   *삼색수제비 1kg(동성 냉동)
4                   *)자숙바지락살 350g(냉동)

<2. The empty substitutes for special characters>
ptn = "\([^)]+\)}"
prodlist = [re.sub(ptn, "", str).strip() for str in prodlist]


['데친고사리 1kg(냉장}',
 '깐도라지채 1kg(냉장}',
 '콩나물 박스 4kg',
 '*삼색수제비 1kg',
 '*)자숙바지락살 350g']
<3. The empty substitutes for Numbers>
prodlist = [ re.sub("([0-9]+)?(k)?g", "", str).strip() for str in prodlist ]

['데친고사리 (냉장}',
 '깐도라지채 (냉장}',
 '콩나물 박스',
<4. Replace special characters to empty>
prodlist = [ str.replace("*)","")  if str.startswith("*)") else str for str in prodlist ]
prodlist = [ str.replace("*","")  if str.startswith("*") else str for str in prodlist ]


['데친고사리 (냉장}',
 '깐도라지채 (냉장}',
 '콩나물 박스',

is same as below.

<1. It will be substitute except for below>
ptn1 = re.compile('[^ ㄱ-ㅎㅏ-ㅣ가-힣\(\)\{\}\[\]]')

<2. The empty will be substitute>
ptn2 = re.compile(' +')

<3. Replace special characters to empty>
ptn3 = re.compile("[\(\{\[].*[\)\}\]]")

<4. It will be substitute to empty>
ptn4 = re.compile(' +박스| +한판| +마리| +개| +망| +단위발주량| +단위발주| +봉| +추천| +내외| +통| +과일| +직| +공급중| +월요| +입고불가| +매| +할| +ea|[km]?[gl]| +완제품|할+ ')

def processing(text):
    result = ptn1.sub("", text)
    result = ptn2.sub(' ', result)
    result = ptn3.sub("", result)
    result = ptn4.sub("", result)
    return result.strip()

from tqdm.notebook import tqdm
result = [processing(t) for t in tqdm(prodlist)]


1. Search and Match

search : Scan through whole string.

match : It need to be matched from the beginning of the string.

import re
>>><re.Match object; span=(3, 6), match='abc'>


>>><re.Match object; span=(0, 3), match='abc'>


2. split

re.split(pattern, string)

<Split by whitespace>

text='Family is not an important thing. It\'s everything.'
re.split(' ',text)

>>>['Family', 'is', 'not', 'an', 'important', 'thing.', "It's", 'everything.']

<Split by '+'>
text='Family +is +not +an important thing. It\'s everything.'

>>>['Family ', 'is ', 'not ', "an important thing. It's everything."]


3. findall

Return all non-overlapping matches of pattern in string.

d : Decimal number

example="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes. 1, 2, 3"

['1', '2', '3']


4. sub

Substitute the pattern to other string.

re.sub(pattern, replace)

# Give me only alphabet, in other word, return the empty except for alphabet
re.sub('[^a-zA-Z]',' ',example) 
'Complete in all things  names and heights and soundings  with the single exception of the red crosses and the written notes         '
#Remove punctuation and convert to the small letter.
for string in sent_text:
     tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())


  • escape

Escape special characters.

pattern = r'((\d)\2{4,})'




  • RegexpTokenizer

Tokenization using with Regular Expression.

gaps: True if this tokenizer's pattern should be used to find separators between tokens.

from nltk.tokenize import RegexpTokenizer

tokenizer2=RegexpTokenizer('\s+', gaps=True)


['Complete', 'in', 'all', 'things', 'names', 'and', 'heights', 'and', 'soundings', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '1', '2', '3']
['Complete', 'in', 'all', 'things--names', 'and', 'heights', 'and', 'soundings--with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes.', '1,', '2,', '3']


reference :