Dec 8, 2009

Feed Scrapers: Guest Post By Dave Townsend

Permission. It's not that long of a word, only 10 letters. My 4 year old daughter understands the word but unfortunately it's completely unfamiliar to many people online. Who am I talking about? Feed scrapers. If you've been blogging for any length of time you've probably found your blog somewhere else that you didn't expect it to be. Maybe you wondered how that happened? Why it happened? Or who did it? It's an insidious problem that just seems to be getting worse. Over the last three months I've found my blog (The Home Garden) snuggled into strange URL's in strange places.





Here's how it happens. A blogger sets up a feed for their blog so that subscribers can have the convenience of reading their material in a consolidated feed reader at home. Some choose to use email subscriptions but most blogs have both options readily available. Feeds are a nice convenient way of providing your material but this is where feed scrapers practice their abuse. Scrapers subscribe to the blog and use a bot or a spider (a bit of fancy computer coding that grabs the text and pictures of the blog) to harvest the unsuspecting blogger's posts then regurgitate it back on another website. It is extremely hard to stop them through a feed service since the subscriptions may not reveal the identity of the website taking the content. Scrapers then use the stolen material as ready made, text rich content perfect for search engines. When the search engines find them they can make money through ads posted on the site. They can make money through the ads but at least two of the last four scrapers I have seen lately are attempting to build the pagerank of the site presumably to resell it later to a high bidder.


To an unsuspecting blogger feed scrapers won't even be noticed. They accomplish their thievery behind the scenes without asking permission and can easily get away with it unless you add some protections on your blog. Nothing is fool proof but there area couple ways you can find a feed scraper.
  • The first way is to regularly check your links by using link:www.yourblog.com in a Google search. This will show you anyone who is linking to your URL and will only indicate a feed scraper if they accidentally left a link to your blog somewhere in the original post. Usually they just remove the links.
  • You can also highlight a random section of a post, paste it into a search engine with quotes around the phrase. The quotes tell the engine to look for everything inside the quotations and will match the random text with any site that has been indexed.
  • The next way I'm about to tell you works well but takes a few more steps to implement. First go into your blog and insert a post feed footer of some kind that contains text and a link to your blog. Something like "This post was written by ________ for the blog www.yourblog.com Copyright 2009" can work. The more unique you make it the better. Then sign up for Google alerts and copy your whole post footer to use as the search term. Anytime the text of your blog is found by Google you should see an Alert appear in your inbox. If it's your blog ignore it, if it's not your blog it's time to investigate.

No method is foolproof and many of these feed scrapers will do everything they can to make your blog nondescript by removing links and accreditation. Recently I began watermarking my photos with the URL of my blog to ensure that whoever is looking at my pictures knows where they originated. One feed scraper removed all the links from my blog and posted Mr. Gardener as the author. (It was very disturbing to find pictures of a family vacation with my children in it on someone else's site.) If by some chance the scrapers leave the links intact you may benefit down the road from the extra links coming to your site but it is still theft. Copyright has no meaning where they are concerned.

Once you find them, what then? Prepare for battle. It's not an easy thing to get yourself removed and even harder if they are in another country. Some countries recognize copyright law while others don't and in those cases it may be extremely hard to do anything. The first step that most people take is to contact the scraper and ask for removal - that has never worked for me. The first scraper I removed myself from ignored my repeated attempts to contact via contact their website form. Then I moved to commenting on my stolen posts but of course they were all moderated and the comments never appeared on the site. Finally I looked up the Whois information and was fortunate to find a name and email listed as a contact. (Whois is simply who owns the URL and you can find it by looking it up through many Whois finders. I bought my domain through Godaddy which provides a Whois search, many places do.) I contacted the email address and soon my blog was no longer in use. Today the whole site has been parked. It's very likely that you will run into a roadblock called a Private Domain and you will have to find out the web host for the site. Contact the webhost explain the situation and ask them how to proceed. Most should contact the site owner for you or at least forward your email. The second feed scraper I removed myself from was in Australia and had a privacy service on their Whois listing so I had to contact the host.

Lastly if they don't respond to any of the aforementioned methods you should construct a Cease and Desist Letter to send to the host. I've never had to take this step and hopefully won't have to. I suspect that most feed scrapers would rather concede to a lone blogger than risk their blowing their whole feed scraping enterprise. I've also used the Google spam report through Webmaster Tools to report the feed scraping sites for stealing content. I can't verify it's effectiveness but I've reported two out of the four scrapers and both have been removed so maybe it worked.

At some point you may be forced to prove you actually are the owner of the copyright and it's a good idea to take screen shots of your blog and the scrapers site where your stolen articles are. Match the articles and save them so that you have them if you need them.

I was fortunate to find a passionate and extremely helpful fighter of plagiarism in Johnathan Bailey who contacted me through Twitter. He runs the site www.PlagiarismToday.com and gave some great advice to me for dealing with these people. His site is filled with good information about combating plagiarism and well worth your time to visit if you are concerned about your content being stolen.

I have one last piece of advice that may help you find scrapers: get involved in a community of bloggers that watch out for each other. I can't stress this enough. When your friends see your posts somewhere they'll be happy to let you know. There is always another scraper around the corner so be watchful, be wary, and don't give up the fight!

Dave Townsend is an avid gardener, stay-at-home dad, and garden blogger (www.GrowingTheHomeGarden.com). He's appeared on Better Homes and Gardens (BHG.com), talks occasionally on local radio, and is active in the local garden club. On his blog he discusses vegetables, plant propagation, and pretty much anything garden related!


Related Post: Feed Scrapers II

11 comments:

  1. Dave,

    Thanks for writing such a useful post! It seems like these blog scrapers are especially active this time of the year.

    ReplyDelete
  2. I hope everyone benefits from it and thanks for the invitation to guest blog. 2 Scrapers down for me, 2 to go!

    ReplyDelete
  3. Dave or MBT, you might want to tell folks how to find the webhost URL when looking up the Whois info. Most of these sites have no way of contacting them so looking them up via Whois is about the only way to go. Thanks for the help!

    ReplyDelete
  4. Good info. Thanks for posting it!

    Two comments:

    1. Many sites (article directories) provide free content for republishing, on condition that original author is credited and links are retained. This has created and fueled the misperception that content on the Internet is free. Some "republishers" might not even realize they're violating copyrights. Yes, probably, most know.

    2. Many hosts have a posted policy and procedure for reporting copyright violations. A typical example is Google's "Blogger" policy, available here: http://www.google.com/blogger_dmca.html I know from experience that this works, but it's a rather slow process.

    ReplyDelete
  5. Dave

    Informative stuff as ever.

    Thanks for this, usefull.

    All the best and cheers for pointing me to this great blog.

    Rob

    ReplyDelete
  6. Jean,

    That is good info to add isn't it! There are probably many sites that do this but one is www.whoishostingthis.com. There are some pop ups but otherwise I think it's free.

    Chuck,

    I don't mind these sites so much if they get permission. The problem is that some of them (even the "good" ones) repost despite explicit statements not to from the bloggers. I know in my cases I stated "not to to be reused or reblogged without permission" on every feed. You're right that the more docile ones definitely cause confusion. When they choose not to reply to you that puts them in a whole new category for me.
    That's good info on the DMCA stuff. I looked into it but never had to go that far, at least not yet.

    Thanks Rob!

    ReplyDelete
  7. Excellent info and quite a bit was new to me even though I've had a few scrapes myself!

    I did manage to get myself removed from one of these, just by emailing. However, the scraper first of all said he was doing me a favour and driving traffic to my blog by scraping my feed, the cheek :o

    ReplyDelete
  8. Thanks, Dave, that's very comprehensive info. No one scraping me yet, but I'll keep on the lookout. Need to find an easy way to watermark photos. Perhaps I'll actually have to break down and get Photoshop.

    ReplyDelete
  9. VP,

    I've heard the same excuse. Doesn't quite make up for the theft part to me!

    Helen,

    I just use the text option in Picasa. The only problem is I have to remember to do it. It's not a habit yet.

    ReplyDelete
  10. I'm sick and tired of blog scraping.

    If you feel the same way, perhaps you would like to join us in trying to combat the problem at http://bit.ly/69UMW8

    We're trying to expose the cheat. Actually, we're tricking them into exposing themselves! Could be fun...

    ReplyDelete
  11. That's funny, I'm always amused when they end up publishing posts that are negative. Sure way to know if anyone at those scraper sites are actually reading what you wrote.

    ReplyDelete

I hope you find this blog a useful garden blogging resource. Sometimes I may reply to comments with my MrBrownThumb account or I may reply with my Garden Bloggers account. Hope this isn't confusing. If you're looking for gardening information check out "Google For Gardeners"

Note: Only a member of this blog may post a comment.