Suspicious Blog Activity – any advice?

May 15th, 2012

I’ve been noticing a number of odd things happening surrounding my blog lately, and I thought it’s about time to figure out what’s going on and how to stop it.

The first problem is that people are illegally copying my posts, probably using RSS scraping, and putting them up on their own ad-infested sites. It is trivial to find them using Google for any somewhat unique word or phrase in one of my posts. Lately one of them, linux-support.com, actually sends me pingbacks announcing the fact that they’ve scraped me! Most of these sites seem to be nothing but content farms for selling ad impressions, and almost none of them have any identifiable names for the owners.

(There is an exception: I have specifically set up sites like Planet Debian and Goodreads to copy my blog posts.)

I’m obviously an advocate of open content, but I do not feel it right that others should be profiting by putting photos and stories about Free Software, or photos of my family, on their ad farms. While I release a great deal of content under GPL or Creative Commons licenses, I have never done so with my blog – an intentional decision.

What should I do about this? Is it worth fighting a battle over, or is it about as useless as trying to block every spam follower on my twitter account?

So that’s the first weird thing. The second weird thing just started within the last few weeks. I have been getting a surprising amount (a few a week) of email addressed to me. It does not bear the appearance of being 100% automated spam, though it is possible that it is. It’s taken a few forms:

  • Someone wanting to buy an ad on my blog
  • Someone wanting to send me a story hyping their product (and intending me to pretend that I wrote the story)
  • Someone wanting me to write a story about their website and link to it

The profit motive in all of these is high, and in at least the second and third, so is the sleaze factor.

I’ve gotten two emails lately of this form:

Hi John,

I am curious if you are the administrator for this site: changelog.complete.org/archives/174-house-outlaws-fast-forwarding-senate-pres-next

I am a researcher / writer involved with a new project whose mission it is to provide accurate and useful information for those interested in the practice of law, whether as a lawyer or paralegal. I recently produced an article detailing the complex relationship between law and technology and the legal implications on personal privacy and free speech. I would love to share this resource with those who might find it useful and am curious of you are the correct person to contact about such a request?

Thank you!

All my best,

The details vary – the URLs appear to be random (the one cited above was little more than a link to an article), the topics the website claims to discuss range from law to schizophrenia (that one actually came with a link to the site, which again seemed to be a content farm). I am slightly tempted to reply to one of these and ask where the heck people are getting my name. It seems as if somebody has put me into a mailing list they sell containing sleazebag bloggers.

Frankly, I am puzzled at this attention. I guess I haven’t checked, but I can’t imagine that my blog has anything even remotely resembling a high PageRank or anything else. It’s not high-traffic, not Slashdot, etc. Either people are desperate, naive, failing to be selective, or maybe working some scam on me that I don’t know yet.

In any case, I’m interested if others have seen this, or any advice you might have.

Categories: Online Life

Leave a comment

Comments Feed17 Comments

  1. Shae Erisson

    Perhaps your blog has a high page rank, and the spammers want to exploit that?
    Perhaps the spammers are hoping for a reply to see if your email is real and read?
    I’m not sure what else might be going on here.

    Reply

    John Goerzen Reply:

    I’m guessing it’s the page rank, but hard to know…

    Reply

  2. Ingo

    Recently I found some of my blog post content (not design) copied on to me unknown websites. WIthout my permission or without even asking me.
    Like yours, my blog has very modest traffic (about 8000 views per year).

    Reply

  3. kevix

    http://www.educause.edu/blog/crevier/blockdisplayofyourimagesonothe/164959
    that might be one minor way.

    Reply

  4. Flameeyes

    I’d suggest you to just set up ModSecurity with proper antispam rules, I have my own set published at https://github.com/Flameeyes/modsec-flameeyes and it solves a lot of issues of scraping and spammers.

    Reply

  5. Anonymous

    If you can choke back the bad taste it’ll leave you with, DMCA notices work quite well to eliminate copies of your content. Send them to the hosting company, not just to the content farm.

    Reply

  6. Steve

    You could go the route of trying to get content removed from sites, but ultimately that is going to end up in a game of wack-a-mole. I think you would be better off introducing technology to prevent the web scraping in the first place. I’m not an expert in this and don’t have specific suggestions. You will likely have to balance that with the ease of content availability to your legit readers (not unlike the captcha form that I am going to have to fill out to post this message). As to understanding the rationality of this activity and of the email messages, that is probably futile. One thing I learned from this business is to never underestimate the stupidity of spammers…

    Reply

    John Goerzen Reply:

    That (not underestimating the stupidity of spammers) is sage advice. I’m going to have to see what Flameeyes is doing with mod_security here too.

    Reply

  7. nobody

    check you logfiles, maybe you found overlapping between the ip accessing to your rss and ip or iprange of the contentfarmers … just block them (or even better: deliver most useless content you can imagine)

    Reply

  8. Alan Knowles

    I noticed this a while back with my content, in the end I just added a line at the top of the rss feed of each article that is a link Pointing to the original page saying this article was originally published here…

    If I was really smart I would probably keyword link spam in it, beat the spammers at thirty own game….

    Reply

  9. Justin Dugger

    I figure there’s not much you can do to stop the copying if you want to keep publishing RSS. So instead, you just abuse it. How many of those copy farms are altering your links? By simply referring to previous posts in your new posts, you might end up with a dozen copycat sites linking to yours.

    We did get an odd spam the other day. It claimed to be someone trying to get off google’s spam list, and pointed out that one of our hosted project’s trac was full of spam links. But a) it didn’t link to them and b) the author’s homepage was clearly spam. I guess it worked from a ‘get clicks’ perspective but wtf.

    Reply

  10. me

    I really like @nobody’s solution. If you find an IP overlap, you could implement a system where those IPs get tons of p0rrn and other non-safe content when they access your website. This way, they will probably be excluded from most Google (and other) searches by the “safety filter”. This will reduce their visibility to ther Internet, totally making their plans fail :)

    Reply

    John Goerzen Reply:

    It is a cool idea, but the problem is that it isn’t a “fire and forget” sort of solution. If I have to keep putting time into it every so often, it’s not going to work out too well for me.

    Reply

  11. Andreas

    Lacking expertness in licensing-issues, I would guess that if you do not license your blog-posts at all, the default is that they are in the public domain, so everyone can do whatever they like with your content legally.

    Reply

  12. ZERUCH

    Ironically, I found this blog post looking into the very thing its about. I got a similar email just today and I found it…not right.

    Thanks for helping confirm my suspicions.

    Reply

  13. Tom (amindfv)

    @Andreas: that’s not right. For all creative works in the U.S. the default (if you write no notice) is copyright ownership by the author.

    Reply

  14. J S

    You recieved the requests because your blog is linked into Planet Debian which has high ranking itself without a method to post links directly there.

    I did some backlink checks on Goodreads and found Planet Debian is one of their higher ranked backlinks due to a couple of posts, like yours.

    (I’m an indie fiction author who uses Linux and thought I’d follow the rabbit hole a little).

    Reply

Leave a comment

 

Feed

http://changelog.complete.org / Suspicious Blog Activity – any advice?