2009-03-01, 02:21   #1
ixfd64
Bemusing Prompter

"Danny"
Dec 2002
California

2×3×397 Posts
regular expression help

I'm having a bit of trouble with regular expressions in R.

I know the following regular expression

Code:
'^.*([-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+).*$' is supposed to return e-mail addresses using backreferences. However, Quote:  emailpat = '^.*([-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+).*$' gsub(emailpat, '\\1', 'adadasd>aaaaaaabbbbbbb@cccc.com
returns

Quote:
 [1] "b@cccc.com"
and cuts off the first part of the e-mail except for the first letter. Does anyone know what I'm doing wrong?

Thanks.

 2009-03-01, 03:33 #2
wblipp

I've run into different definitions of regular expressions from time to time. But assuming your situation is like described here: http://www.regular-expressions.info/reference.html your problem is the the first ".*" is greedy, trying the longest possible matches first. I think you need to make it lazy, changing the .* to .*?

Code:
emailpat = '^.*?([-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+).*$'

If your environment doesn't support lazy, then I'd suggest a "not a regular character" before the "one or more regular characters and outside the parenthesis. so

Code:
emailpat = '^.*[^-A-Za-z0-9_.%]([-A-Za-z0-9_.%]+@[-A-Za-z0-9_.%]+\\.[A-Za-z]+).*$'

good luck

William
 2009-03-01, 06:19 #3
ixfd64

The first one didn't work but the second did. Thanks so much!

