You may not know it, but criminals and thieves are waiting for you to write your next blog post. They’re waiting so they can scrape your feed and weave it into their spam splog hoping to either build links for illicit sites or make quick bucks doing nothing on Made for Adsense of MFA sites. I’ll show you how to beat them and use your already existing .htaccess file. Whether you’ve been blogging for a long or short time you should be able to spot spammers a mile away. If not – I’m going to help you figure it out.
There are 3 ways I have to check for spam. The first is in my WordPress dashboard “incoming links”. If you don’t use WordPress, that’s ok – all you have to do is go to Google Blogsearch and do a search for “link:www.yoursitename.com”. That’s where WordPress gets it’s results.
You can’t always tell what’s spam just from the titles, but in this case – I think the first result is plainly spam. Here’s what I saw when I clicked on the link…
The splog is a scraper of the worst kind – the kind that steals your entire posts (images and all). This post contains a link to my blog, but actually it’s not a post of mine at all. This is a post written by Alan over at Affiliate Confession, and he just happenned to link back to my site. You can find lots of spammers and scrapers by looking at who’s linking to your site.
The second way I check for scrapers leeching my rss feed is by the comments I receive. Sure, you can receive spammy comments on your blog, but scrapers leave spammy “trackbacks” just like the droppings from a foul animal. I can hear a few people in the background asking “what’s a trackback?”. A trackback is kind of like an “auto-comment”. A scraper site steals your content using your RSS feed or directly from your page, and then it attempts to communicate with your blog by sending a “ping” for a trackback. Normally a blogger (or his blog software) might send out a “ping” manually or automatically to every URL linked in the latest post. The trackback comes into your blog as a comment that you have to approve.
In the example above I actually received 3 trackbacks during the night I caught this morning – which prompted my to write this article. The first one was obvious without even clicking that is was a spam trackback from the inappropriate keywords in the link title. The third was a post I recognized posting a comment on myself yesterday, so I knew that one was good. The second one I had to click on to see if it was spam or not.
Once I clicked on it, at first I didn’t think it was my content at at all. You can see in the first paragraph above the first paragraph is about sports. But the second paragraph (and the rest of the page) was clearly from one of my posts. You can see in the earlier trackback example image that after each trackback is a link to the page that was “tracked back to” (stolen). The content that was stolen on this splog was from my How to Earn Money Using Affiliate eBay WordPress plugin BayRSS post.
So – why was the first paragraph not my content? Look at it, that first paragraph has a link to auto insurance, and further down in the content (not pictured) are links to home equity loans, and business hosting. This is what they call an “auto-blog” or “re-blog” what “spins content”. I will mix your original posts (and links and images) and “spins” it together with some real content it gets from another source (to make it’s page “original content”) and then inserts important links in key points. The spammer either makes money from link building (as in this example) or from adsense (in the earlier example).
The third and last way I check for scrapers stealing my blog content is by using “google alerts”. It basically works the same as the google blogsearch earlier, except it can search a bit more AND it sends you automatic emails daily with the results. All you have to do is set a google alert for the name of your domain. I set one for “www.jtpratt.com”. You can set your own at google alerts.
The example above is a google alert I got last weekend. Look like it’s just an alert of my own post, until you look at the URL below (videositemap). I know this is spam without even clicking.
So now we’ve identified spam and some scrapers stealing our content. What to do about it? I’m going to show you how to use your .htaccess file. Read this post about .htaccess at Plagiarism Today. It explains what an .htaccess file is, and how to use it to prevent people from not only stealing your content, but also images and files (and your bandwidth) as well. If you use WordPress (or other blogging software), chances are you already have an .htaccess file in the root of your site, because that’s how your blog changes the pages or ?p=382 into pretty permalinks or URL’s like /my-post-about-dogs. All you have to do is add some additional code to that file directing your web server who to let in, and who to throw out! If scrapers can’t get to your content, they can’t scrape it!
There are many, many ways to block, redirect, and stop scrapers by putting code in your .htaccess file, but I prefer the method in that article…
order allow,deny deny from xxx.xxx.xxx.xxx allow from all
Now, the x’s aboe need to be turned in to numbers (you can use multiple lines). The number needs to be the IP address of the server you want to turn away. So – we need to find out the IP addresses of the 2 scraper splogs we found earlier. There are many ways to do this, I do mine on the command line using “nslookup” – however you can use a web based tool, like the one from zoneedit.com for free. Just enter the domain you want to lookup. I found the IP of videositemap.com is 220.127.116.11. Now I need to get the one for fantasyfootballpassport.com, which is 18.104.22.168. Now I update my code to add to my .htaccess file with that information like this…
order allow,deny deny from 22.214.171.124 deny from 126.96.36.199 allow from all
I added that code to the top of my .htaccess file before anything else, uploaded it back to my web site root, and then visited my blog in both firefox and IE to make sure it loaded properly. Now, those 2 scrapers won’t be getting to my content again. While I prefer to block the scrapers I know regularly come to my blog, you can be proactive and use block lists of known spam and scraper sites to prevent plagiarism before it happens.
Some site owners and bloggers prefer to block “user agents” instead of IP address of computers, because IP’s (when found out) can be changed. This is a little different, because you have to have access to your “raw access log” on your server to search for bad user agents crawling your feed or content. What is a “user-agent”? Simple, when you visit a site it may say that your “user-agent” is a paticular version of Firefox or Internet Explorer. Google’s search crawler comes in on the user-agen “googlebot”. Nefarious scraper robots and indexers have known names, and you can block them by their user-agent name instead of their IP address.
As I said, I am blocking scrapers using the method I just showed you, but there are many other ways to do it – all from your htaccess file. I’m going to give you a list of resources you can check out to get more information if you’re interested.
Fighting Scrapers and Splogs Resource List
How to Block Bots, Ban IP Addresses with .htaccess
.htaccess – Blocking IP Addresses, Robots, and Offline Browsers
Blocking Bad Bots and Site Rippers (Offline Browsers)
Ultimate .htaccess Blacklist 2
Joe Maller .htaccess blacklist
How You Can Stop Dirty Feed Scrapers in 3 Easy Steps
Block Website Content Thieves, Proxy Services & Exploited Servers, with this Apache Server “.htaccess” Blocklist
As always, if you have something to add to make this article better, or a question – please comment now!