-
Notifications
You must be signed in to change notification settings - Fork 95
Replacement function for list of stuff. #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Perhaps there could be an argument to the |
The following code might be able to be included in function replace_many(s::String, replacements::Dict{Regex, String})
for replacement in replacements
pattern, replacement = replacement
s = replace(s, pattern, replacement)
end
s
end
function replace_many!(d::FileDocument)
error("FileDocument cannot be modified")
end
function replace_many!(d::StringDocument, replacements::Dict{Regex, String})
d.text = replace_many(d.text, replacements)
nothing
end
function replace_many!(d::TokenDocument, replacements::Dict{Regex, String})
for i in 1:length(d.tokens)
d.tokens[i] = replace_many(d.tokens[i], replacements)
end
end
function replace_many!(d::NGramDocument, replacements::Dict{Regex, String})
for token in keys(d.ngrams)
new_token = replace_many(new_token, regex, replacements)
if new_token != token
if haskey(d.ngrams, new_token)
d.ngrams[new_token] = pop!(d.ngrams, token) + d.ngrams[new_token]
else
d.ngrams[new_token] = pop!(d.ngrams, token)
end
end
end
end
function replace_many!(crps::Corpus, replacements::Dict{Regex, String})
for doc in crps
replace_many!(doc)
end
end It handles simple replacements quite well, but I'm not sure if Julia's REPLACEMENT_PATTERNS = Dict{Regex, String}(
r"(?i)won\'t" => "will not",
r"((?i)can\'t|(?i)can not)" => "cannot",
r"(?i)i\'m" => "i am",
r"(?i)ain\'t" => "am not",
r"\'cause" => "because"
# doesn't support these
#r"(\w+)\'ll" => "\g will",
#r"(\w+)n\'t" => "\g not",
#r"(\w+)\'ve" => "\g have",
#r"(\w+t)\'s" => "\g is",
#r"(\w+)\'re" => "\g are",
#r"(\w+)\'d" => "\g would",
)
doc = StringDocument("i won't have seen any of this, yet. i'm ready though. i'll be waiting.")
replace_many!(doc, REPLACEMENT_PATTERNS)
text(doc) # => "i will not have seen any of this, yet. i am ready though. i will be waiting." |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
We need some kind of thing where you input a dictionary or list and have words get replaced. This is important for standardizing synonyms, dates, and contractions. I think code like this make it easier to tokenize stuff especially since you eliminate apostrophes.
In python had code that replace contractions using regex.
'''
replacement_patterns = [
(r'(?i)won't', 'will not'),
(r'((?i)can't|(?i)can not)', 'cannot'),
(r'(?i)i'm', 'i am'),
(r'(?i)ain't', 'is not'),
(r'(\w+)'ll', '\g<1> will'),
(r'(\w+)n't', '\g<1> not'),
(r'(\w+)'ve', '\g<1> have'),
(r'(\w+t)'s', '\g<1> is'),
(r'(\w+)'re', '\g<1> are'),
(r'(\w+)'d', '\g<1> would'),
(r''cause', 'because'),]
'''
Not sure if it is possible to do this in julia
The text was updated successfully, but these errors were encountered: