- re
Regular Expression.
<1. Data>
prodlist
>>>
0 (직) 데친고사리 1kg(냉장}
1 (직) 깐도라지채 1kg(냉장}
2 콩나물 박스 4kg(상 곱슬이)
3 *삼색수제비 1kg(동성 냉동)
4 *)자숙바지락살 350g(냉동)
<2. The empty substitutes for special characters>
ptn = "\([^)]+\)}"
prodlist = [re.sub(ptn, "", str).strip() for str in prodlist]
prodlist
>>>
['데친고사리 1kg(냉장}',
'깐도라지채 1kg(냉장}',
'콩나물 박스 4kg',
'*삼색수제비 1kg',
'*)자숙바지락살 350g']
<3. The empty substitutes for Numbers>
prodlist = [ re.sub("([0-9]+)?(k)?g", "", str).strip() for str in prodlist ]
prodlist
>>>
['데친고사리 (냉장}',
'깐도라지채 (냉장}',
'콩나물 박스',
'*삼색수제비',
'*)자숙바지락살']
<4. Replace special characters to empty>
prodlist = [ str.replace("*)","") if str.startswith("*)") else str for str in prodlist ]
prodlist = [ str.replace("*","") if str.startswith("*") else str for str in prodlist ]
prodlist
>>>
['데친고사리 (냉장}',
'깐도라지채 (냉장}',
'콩나물 박스',
'삼색수제비',
'자숙바지락살']
is same as below.
<1. It will be substitute except for below>
ptn1 = re.compile('[^ ㄱ-ㅎㅏ-ㅣ가-힣\(\)\{\}\[\]]')
<2. The empty will be substitute>
ptn2 = re.compile(' +')
<3. Replace special characters to empty>
ptn3 = re.compile("[\(\{\[].*[\)\}\]]")
<4. It will be substitute to empty>
ptn4 = re.compile(' +박스| +한판| +마리| +개| +망| +단위발주량| +단위발주| +봉| +추천| +내외| +통| +과일| +직| +공급중| +월요| +입고불가| +매| +할| +ea|[km]?[gl]| +완제품|할+ ')
def processing(text):
result = ptn1.sub("", text)
result = ptn2.sub(' ', result)
result = ptn3.sub("", result)
result = ptn4.sub("", result)
return result.strip()
from tqdm.notebook import tqdm
result = [processing(t) for t in tqdm(prodlist)]
1. Search and Match
search : Scan through whole string.
match : It need to be matched from the beginning of the string.
import re
r=re.compile('ab.')
r.search('kkkabc')
>>><re.Match object; span=(3, 6), match='abc'>
r.match('kkkabc')
>>>
r.match('abccc')
>>><re.Match object; span=(0, 3), match='abc'>
2. split
re.split(pattern, string)
<Split by whitespace>
text='Family is not an important thing. It\'s everything.'
re.split(' ',text)
>>>['Family', 'is', 'not', 'an', 'important', 'thing.', "It's", 'everything.']
<Split by '+'>
text='Family +is +not +an important thing. It\'s everything.'
re.split('\+',text)
>>>['Family ', 'is ', 'not ', "an important thing. It's everything."]
3. findall
Return all non-overlapping matches of pattern in string.
d : Decimal number
example="Complete in all things--names and heights and soundings--with the single exception of the red crosses and the written notes. 1, 2, 3"
re.findall('\d+',example)
>>>
['1', '2', '3']
4. sub
Substitute the pattern to other string.
re.sub(pattern, replace)
# Give me only alphabet, in other word, return the empty except for alphabet
re.sub('[^a-zA-Z]',' ',example)
>>>
'Complete in all things names and heights and soundings with the single exception of the red crosses and the written notes '
#Remove punctuation and convert to the small letter.
for string in sent_text:
tokens = re.sub(r"[^a-z0-9]+", " ", string.lower())
- escape
Escape special characters.
pattern = r'((\d)\2{4,})'
print(re.escape(pattern))
>>>
\(\(\\d\)\\2\{4,\}\)
- RegexpTokenizer
Tokenization using with Regular Expression.
gaps: True if this tokenizer's pattern should be used to find separators between tokens.
from nltk.tokenize import RegexpTokenizer
tokenizer1=RegexpTokenizer('\w+')
tokenizer2=RegexpTokenizer('\s+', gaps=True)
print(tokenizer1.tokenize(example))
print(tokenizer2.tokenize(example))
>>>
['Complete', 'in', 'all', 'things', 'names', 'and', 'heights', 'and', 'soundings', 'with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes', '1', '2', '3']
['Complete', 'in', 'all', 'things--names', 'and', 'heights', 'and', 'soundings--with', 'the', 'single', 'exception', 'of', 'the', 'red', 'crosses', 'and', 'the', 'written', 'notes.', '1,', '2,', '3']
reference : https://wikidocs.net/21703
'Analyze Data > Python Libraries' 카테고리의 다른 글
numpy-empty, where, allclose, dot, argsort, corrcoef, astype, nan, hstack, argmax (0) | 2022.12.07 |
---|---|
numpy-ndim, ravel, permutation, clip, subtract (0) | 2022.05.10 |
collections-Counter, most_common, FreqDist, defaultdict (0) | 2022.03.04 |
pandas-5. json_normalize (0) | 2021.10.25 |
mlxtend-TransactionEncoder, association_rules (0) | 2021.06.23 |