Skip to content

Replacement function for list of stuff. #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ccsv opened this issue Oct 6, 2014 · 2 comments
Open

Replacement function for list of stuff. #23

ccsv opened this issue Oct 6, 2014 · 2 comments

Comments

@ccsv
Copy link

ccsv commented Oct 6, 2014

We need some kind of thing where you input a dictionary or list and have words get replaced. This is important for standardizing synonyms, dates, and contractions. I think code like this make it easier to tokenize stuff especially since you eliminate apostrophes.

In python had code that replace contractions using regex.

'''
replacement_patterns = [
(r'(?i)won't', 'will not'),
(r'((?i)can't|(?i)can not)', 'cannot'),
(r'(?i)i'm', 'i am'),
(r'(?i)ain't', 'is not'),
(r'(\w+)'ll', '\g<1> will'),
(r'(\w+)n't', '\g<1> not'),
(r'(\w+)'ve', '\g<1> have'),
(r'(\w+t)'s', '\g<1> is'),
(r'(\w+)'re', '\g<1> are'),
(r'(\w+)'d', '\g<1> would'),
(r''cause', 'because'),]
'''
Not sure if it is possible to do this in julia

@timClicks
Copy link

Perhaps there could be an argument to the TokenDocument constructor that allows for further pre-processing than what is provided by default.

@timClicks
Copy link

The following code might be able to be included in preprocessing.jl:

function replace_many(s::String, replacements::Dict{Regex, String})
  for replacement in replacements
        pattern, replacement = replacement
        s = replace(s, pattern, replacement)
  end
  s
end

function replace_many!(d::FileDocument)
    error("FileDocument cannot be modified")
end

function replace_many!(d::StringDocument, replacements::Dict{Regex, String})
    d.text = replace_many(d.text, replacements)
    nothing
end

function replace_many!(d::TokenDocument, replacements::Dict{Regex, String})
    for i in 1:length(d.tokens)
        d.tokens[i] = replace_many(d.tokens[i], replacements)
  end
end

function replace_many!(d::NGramDocument, replacements::Dict{Regex, String})
  for token in keys(d.ngrams)
    new_token = replace_many(new_token, regex, replacements)
    if new_token != token
      if haskey(d.ngrams, new_token)
        d.ngrams[new_token] = pop!(d.ngrams, token) + d.ngrams[new_token]
      else
        d.ngrams[new_token] = pop!(d.ngrams, token)
      end
    end
  end
end

function replace_many!(crps::Corpus, replacements::Dict{Regex, String})
  for doc in crps
    replace_many!(doc)
  end
end

It handles simple replacements quite well, but I'm not sure if Julia's replace function knows how to deal with regular expressions that refer back to groups.

REPLACEMENT_PATTERNS = Dict{Regex, String}(
  r"(?i)won\'t" => "will not",
  r"((?i)can\'t|(?i)can not)" => "cannot",
  r"(?i)i\'m" => "i am",
  r"(?i)ain\'t" => "am not",
  r"\'cause" => "because"

  # doesn't support these
  #r"(\w+)\'ll" => "\g will",
  #r"(\w+)n\'t" => "\g not",
  #r"(\w+)\'ve" => "\g have",
  #r"(\w+t)\'s" => "\g is",
  #r"(\w+)\'re" => "\g are",
  #r"(\w+)\'d" => "\g would",
)

doc = StringDocument("i won't have seen any of this, yet. i'm ready though. i'll be waiting.")
replace_many!(doc, REPLACEMENT_PATTERNS)
text(doc) # =>  "i will not have seen any of this, yet. i am ready though. i will be waiting."

@Ayushk4 Ayushk4 mentioned this issue Jun 23, 2019
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants