How to Keep Feed Scrapers, Spammers, and Splogs From Stealing Your Content

You may not know it, but criminals and thieves are waiting for you to write your next blog post. They’re waiting so they can scrape your feed and weave it into their spam splog hoping to either build links for illicit sites or make quick bucks doing nothing on Made for Adsense of MFA sites. I’ll show you how to beat them and use your already existing .htaccess file. Whether you’ve been blogging for a long or short time you should be able to spot spammers a mile away. If not – I’m going to help you figure it out.

There are 3 ways I have to check for spam. The first is in my WordPress dashboard “incoming links”. If you don’t use WordPress, that’s ok – all you have to do is go to Google Blogsearch and do a search for “link:www.yoursitename.com”. That’s where WordPress gets it’s results.

incoming links

You can’t always tell what’s spam just from the titles, but in this case – I think the first result is plainly spam. Here’s what I saw when I clicked on the link…

Splog example

The splog is a scraper of the worst kind – the kind that steals your entire posts (images and all). This post contains a link to my blog, but actually it’s not a post of mine at all. This is a post written by Alan over at Affiliate Confession, and he just happenned to link back to my site. You can find lots of spammers and scrapers by looking at who’s linking to your site.

The second way I check for scrapers leeching my rss feed is by the comments I receive. Sure, you can receive spammy comments on your blog, but scrapers leave spammy “trackbacks” just like the droppings from a foul animal. I can hear a few people in the background asking “what’s a trackback?”. A trackback is kind of like an “auto-comment”. A scraper site steals your content using your RSS feed or directly from your page, and then it attempts to communicate with your blog by sending a “ping” for a trackback. Normally a blogger (or his blog software) might send out a “ping” manually or automatically to every URL linked in the latest post. The trackback comes into your blog as a comment that you have to approve.

trackbacks example

In the example above I actually received 3 trackbacks during the night I caught this morning – which prompted my to write this article. The first one was obvious without even clicking that is was a spam trackback from the inappropriate keywords in the link title. The third was a post I recognized posting a comment on myself yesterday, so I knew that one was good. The second one I had to click on to see if it was spam or not.

stolen content exmaple

Once I clicked on it, at first I didn’t think it was my content at at all. You can see in the first paragraph above the first paragraph is about sports. But the second paragraph (and the rest of the page) was clearly from one of my posts. You can see in the earlier trackback example image that after each trackback is a link to the page that was “tracked back to” (stolen). The content that was stolen on this splog was from my How to Earn Money Using Affiliate eBay WordPress plugin BayRSS post.

So – why was the first paragraph not my content? Look at it, that first paragraph has a link to auto insurance, and further down in the content (not pictured) are links to home equity loans, and business hosting. This is what they call an “auto-blog” or “re-blog” what “spins content”. I will mix your original posts (and links and images) and “spins” it together with some real content it gets from another source (to make it’s page “original content”) and then inserts important links in key points. The spammer either makes money from link building (as in this example) or from adsense (in the earlier example).

The third and last way I check for scrapers stealing my blog content is by using “google alerts”. It basically works the same as the google blogsearch earlier, except it can search a bit more AND it sends you automatic emails daily with the results. All you have to do is set a google alert for the name of your domain. I set one for “www.jtpratt.com”. You can set your own at google alerts.

google alerts example

The example above is a google alert I got last weekend. Look like it’s just an alert of my own post, until you look at the URL below (videositemap). I know this is spam without even clicking.

So now we’ve identified spam and some scrapers stealing our content. What to do about it? I’m going to show you how to use your .htaccess file. Read this post about .htaccess at Plagiarism Today. It explains what an .htaccess file is, and how to use it to prevent people from not only stealing your content, but also images and files (and your bandwidth) as well. If you use WordPress (or other blogging software), chances are you already have an .htaccess file in the root of your site, because that’s how your blog changes the pages or ?p=382 into pretty permalinks or URL’s like /my-post-about-dogs. All you have to do is add some additional code to that file directing your web server who to let in, and who to throw out! If scrapers can’t get to your content, they can’t scrape it!

There are many, many ways to block, redirect, and stop scrapers by putting code in your .htaccess file, but I prefer the method in that article…


order allow,deny
deny from xxx.xxx.xxx.xxx
allow from all

Now, the x’s aboe need to be turned in to numbers (you can use multiple lines). The number needs to be the IP address of the server you want to turn away. So – we need to find out the IP addresses of the 2 scraper splogs we found earlier. There are many ways to do this, I do mine on the command line using “nslookup” – however you can use a web based tool, like the one from zoneedit.com for free. Just enter the domain you want to lookup. I found the IP of videositemap.com is 70.87.226.18. Now I need to get the one for fantasyfootballpassport.com, which is 216.139.234.32. Now I update my code to add to my .htaccess file with that information like this…


order allow,deny
deny from 70.87.226.18
deny from 216.139.234.32
allow from all

I added that code to the top of my .htaccess file before anything else, uploaded it back to my web site root, and then visited my blog in both firefox and IE to make sure it loaded properly. Now, those 2 scrapers won’t be getting to my content again. While I prefer to block the scrapers I know regularly come to my blog, you can be proactive and use block lists of known spam and scraper sites to prevent plagiarism before it happens.

Some site owners and bloggers prefer to block “user agents” instead of IP address of computers, because IP’s (when found out) can be changed. This is a little different, because you have to have access to your “raw access log” on your server to search for bad user agents crawling your feed or content. What is a “user-agent”? Simple, when you visit a site it may say that your “user-agent” is a paticular version of Firefox or Internet Explorer. Google’s search crawler comes in on the user-agen “googlebot”. Nefarious scraper robots and indexers have known names, and you can block them by their user-agent name instead of their IP address.

As I said, I am blocking scrapers using the method I just showed you, but there are many other ways to do it – all from your htaccess file. I’m going to give you a list of resources you can check out to get more information if you’re interested.

Fighting Scrapers and Splogs Resource List

How to Block Bots, Ban IP Addresses with .htaccess
.htaccess – Blocking IP Addresses, Robots, and Offline Browsers
Blocking Bad Bots and Site Rippers (Offline Browsers)
Ultimate .htaccess Blacklist 2
Joe Maller .htaccess blacklist
How You Can Stop Dirty Feed Scrapers in 3 Easy Steps
Block Website Content Thieves, Proxy Services & Exploited Servers, with this Apache Server “.htaccess” Blocklist

As always, if you have something to add to make this article better, or a question – please comment now!

11 Comments

Layne | Reward Rebel on April 29, 2008 at 7:15 am

I’d just been checking out a splog that hit RR yesterday, so your article is (once again) very timely! How d’ya do that? =;;

I’ve now signed up for Google Alerts on my domains – thanks for the tip – and I’m now going to take a peek at this .htaccess file coding – but I’m shaking already! =)r tee hee…
James on April 29, 2008 at 7:17 am

Great info! I thought scrapers and sploggers were just one of the things you had to put up with, like the weather. I didn’t realise .htaccess was so useful, and the plagiarismtoday.com link is a ripper.

thanks! =)]

Jamess last blog post..1
admin on April 29, 2008 at 8:01 am

@Layne – glad to help once again! =
@James – giving you great resources is what I’m all about… =)]
putik! on May 1, 2008 at 5:52 pm

nice article.. will try this one
Belajar SEO on February 15, 2009 at 4:31 am

Hello sir, thanks for this post. I have looking for this tip for a couple of day and finally found it. However, can you please tell me the exact place to put that code? here’s my .htaccess file.

# BEGIN WordPress

RewriteEngine On
RewriteBase /
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]

# END WordPress

I hope you can give a clear answer so that i can directly copy it into my htaccess. thanks and sorry for my bad english. I hope you understand what i meant. 🙂
admin on March 2, 2009 at 2:37 pm

you can put it wherever you like, before or after your wordpress code
Stephanie on June 8, 2009 at 3:59 pm

Thank you for posting this information. I just had this happen to me and did know what to do. I put a post on my blog to let my visitors know how to read your information on what to do for this situation. Thanks again 🙂
tuning on July 29, 2009 at 6:19 am

thank you for this informations.

I use bot trap on a site its blocking the most spammy bots but some bots are still coming on the site. i hope it will be better now.
Lassar on October 23, 2009 at 7:42 pm

If you are using a windows server; you will have to use php or something else to block the ip address.

Does someone know how to block bots by user agent in php ?
dwimarni on January 9, 2010 at 9:24 am

Thanks for sharing these tips. this very helpful to avoid scrapers and sploggers.
.-= dwimarni´s last blog ..Facebook Offer Fellowship =-.
Background Check guy on February 8, 2010 at 1:42 pm

People who steal content are doing themselves no favors whatsoever. The search engines are getting WAY too smart to get past with duplicated content.