[NTLK] Removing mail addresses (Was: I want to publish NewtonTalk since 2000)

Frank Gruendel newtontalk at pda-soft.de
Mon Jan 9 14:32:58 EST 2012


> just run a find-replace command on the text files replacing the @ sign
with something meaningless and machine-unreadable.

I've briefly contemplated this, but decided against it for two reasons:

1) There are a lot of @s in our digests that aren't part of an E-Mail
address. It wouldn't exactly improve comprehensibility if those were simply
removed or replaced with something else.

2) I'm pretty sure that modern spam-harvesting software would have a fit of
the giggles if it encountered this kind of obfuscation.

For example, in our digests each post has a line that looks like this:

	From: "Enrico Caruso" <enricocaruso at sopranoharem.com>

If we just removed the @, we'd get

	From: "Enrico Caruso" <enricocarusosopranoharem.com>

If yours truly were to write E-Mail addresses harvesting software, it would
be coded like this:

a) Find a line that starts with "From:"
b) Find the character "<"
c) Remove this character and all that's to the left of it.
d) Check that the last character of what remains is ">". If it is, remove
it.

OK, that's the address that our stupid author obfuscated. Let's see what we
can do to resurrect it...

e) Find the rightmost occurrence of the character ".". 
g) Everything to the right of this position is the domain suffix.  Put it
away for later.
h) Remove the suffix including the ".". The remainder is the concatenated
mail and domain name.

For performance reasons the following two steps are restricted to the
namespace depicted by our suffix.

i) Feed what we have left into a reasonably current domain list.
j) Search for a domain whose name is a right-aligned substring of what we
have left.

k) If there is, remove the domain name from the right. What's left now is
our E-Mail name.
l) Add a "@" between the E-Mail name and the domain name. Add a "." to the
end, then add the domain suffix.
m) Check if this is a valid address.

Of course we could put more effort into obfuscating the address. But if we
did so, more of what should remain untouched would change, too. Apart from
that, people who write E-Mail address harvesting software are paid for
putting more effort into outsmarting us, and more often than not they
succeed.

In my opinion the safest way is finding EVERY E-Mail address and simply
replacing it with a fixed address that is... well... not what spammers
really want. For example 

	death_to_mail_address_harvesting_software_programmers at fbi.gov

Provided, of course, the American gouvernment doesn't have a problem with it
and is willing to establish this E-Mail account. Alternatively one could use
something like

	what_a_pity_you_did_not_find_an_address_here

Frank

-- Newton software and hardware at http://www.pda-soft.de





More information about the NewtonTalk mailing list