How to Keep Feed Scrapers, Spammers, and Splogs From Stealing Your Content如何保持饲料的刮削器,垃圾邮件,并Splogs窃取您的内容
1,107 views - 1107意见者:管Posted in: 张贴于:
1,107 views 1107意见
Please note: This page was originally written in English.请注意:此网页最初是用英文写的。
The original post can be viewed原来的文章可被视为 here这里 . 。
Please note: This page was originally written in English.
The text has been translated using an online service such as Google or Babelfish.
The original post can be viewed here.
You may not know it, but criminals and thieves are waiting for you to write your next blog post.你可能不知道,但罪犯和窃贼正在等待你写你的下一个博客帖子。 They’re waiting so they can scrape your feed and weave it into their spam splog hoping to either build links for illicit sites or make quick bucks doing nothing on Made for Adsense of MFA sites.他们正在等待使他们能够勉强你的饲料和编织到他们的垃圾邮件splog希望能建立联系或者非法网站或快钱什么都不做就提出的Adsense的外交部的网站。 I’ll show you how to beat them and use your already existing .htaccess file.我会告诉你如何击败他们,并使用您现有的。 htaccess文件。 Whether you’ve been blogging for a long or short time you should be able to spot spammers a mile away.您是否已经博客了很长或短的时间内你应该能够看清垃圾邮件发送者一英里以外。 If not - I’m going to help you figure it out.如果没有-我会帮你的数字了。
There are 3 ways I have to check for spam.有3种方式我已经来检查垃圾邮件。 The first is in my Wordpress dashboard “incoming links”.首先是在我的WordPress的仪表板“传入的连接” 。 If you don’t use Wordpress, that’s ok - all you have to do is go to如果您不使用WordPress的,这是玉-所有您需要做的就是去 Google Blogsearch谷歌Blogsearch and do a search for “link:www.yoursitename.com”.并搜索“链接: www.yoursitename.com ” 。 That’s where Wordpress gets it’s results.这就是WordPress的得到它的结果。

You can’t always tell what’s spam just from the titles, but in this case - I think the first result is plainly spam.你不能总是知道的垃圾邮件刚刚从字幕,但在这种情况下, -我想的第一个结果显然是垃圾邮件。 Here’s what I saw when I clicked on the link…以下是我看到当我点击链接...

The splog is a scraper of the worst kind - the kind that steals your entire posts (images and all).该splog是一个刮板最严重的一种-那种抢断您的整个员额(图像和所有) 。 This post contains a link to my blog, but actually it’s not a post of mine at all.此帖包含一个链接到我的博客,但实际上这不是地雷后在所有。 This is a post written by Alan over at这是一个后撰写的艾伦比在 Affiliate Confession会员告白 , and he just happenned to link back to my site. ,他只是happenned连结回到了自己的网站。 You can find lots of spammers and scrapers by looking at who’s linking to your site.您可以找到大量的垃圾邮件发送者和刮削器看是谁的连接到您的网站。
The second way I check for scrapers leeching my rss feed is by the comments I receive.第二,我检查刮削器leeching我的RSS种子是由我收到的评论。 Sure, you can receive spammy comments on your blog, but scrapers leave spammy “trackbacks” just like the droppings from a foul animal.当然,您可以接收垃圾评论您的博客,但是假垃圾刮削器“搬场”就像粪便从犯规的动物。 I can hear a few people in the background asking “what’sa trackback?”.我可以听到一些人在后台问: “ what'sa引用? ” 。 A trackback is kind of like an “auto-comment”.阿引用是一种像“汽车评论” 。 A scraper site steals your content using your RSS feed or directly from your page, and then it attempts to communicate with your blog by sending a “ping” for a trackback.阿刮板抢断网站内容使用您的RSS饲料或直接从您的网页,然后它试图沟通,您的博客通过发送一个“平” ,为引用。 Normally a blogger (or his blog software) might send out a “ping” manually or automatically to every URL linked in the latest post.通常一个Blogger (或他的博客软件)可能会发出一个“平”手动或自动对每个网址链接在最新的职位。 The trackback comes into your blog as a comment that you have to approve.在引用进入您的博客作为一个评论,你必须批准。

In the example above I actually received 3 trackbacks during the night I caught this morning - which prompted my to write this article.在上面的例子中实际上我收到3搬场在晚上,我发现今天-这促使我写此文章。 The first one was obvious without even clicking that is was a spam trackback from the inappropriate keywords in the link title.第一次是显而易见的,甚至是点击是垃圾引用的不适当的关键字的链接标题。 The third was a post I recognized posting a comment on myself yesterday, so I knew that one was good.第三是后我认张贴评论昨天我,所以我知道,一个是好的。 The second one I had to click on to see if it was spam or not.第二个我不得不按一下,看看它是否是垃圾邮件或不是。

Once I clicked on it, at first I didn’t think it was my content at at all.一旦我点击它,首先我不认为这是在我的内容在所有。 You can see in the first paragraph above the first paragraph is about sports.您可以查看在上述第1款第1款是关于运动。 But the second paragraph (and the rest of the page) was clearly from one of my posts.但是,第二段(和其余页)显然是从我的一个职位。 You can see in the earlier trackback example image that after each trackback is a link to the page that was “tracked back to” (stolen).你可以看到在早些时候引用的例子形象,在每一个引用是一个链接的网页是“跟踪回” (被盗) 。 The content that was stolen on this splog was from my内容是关于这个被盗splog是从我的 How to Earn Money Using Affiliate eBay Wordpress plugin BayRSS如何赚钱子公司易趣使用WordPress的插件BayRSS post.职位。
So - why was the first paragraph not my content?因此, -为什么是第一款不是我的内容? Look at it, that first paragraph has a link to auto insurance, and further down in the content (not pictured) are links to home equity loans, and business hosting.你看,这第一款有一个链接,汽车保险,并进一步下跌的内容(不是照片)是指向房屋净值贷款和商业托管。 This is what they call an “auto-blog” or “re-blog” what “spins content”.这就是他们所说的“汽车博客”或“重新博客”是什么“旋转的内容” 。 I will mix your original posts (and links and images) and “spins” it together with some real content it gets from another source (to make it’s page “original content”) and then inserts important links in key points.我将原来的组合员额(和链接和图像)和“旋转” ,连同一些真正得到它的内容从其他来源(使它的一页“原创内容” ) ,然后插入重要环节的关键点。 The spammer either makes money from link building (as in this example) or from adsense (in the earlier example).该垃圾邮件发送者可以赚钱的链接建设(如在这个例子中) ,或从AdSense (在早先的例子) 。
The third and last way I check for scrapers stealing my blog content is by using “google alerts”.第三个也是最后一个,我检查刮削器窃取我的博客内容是利用“谷歌警示” 。 It basically works the same as the google blogsearch earlier, except it can search a bit more AND it sends you automatic emails daily with the results.它基本上作品一样,谷歌blogsearch较早,但它可以搜索更多的与它自动向您发送电子邮件每日所取得的成果。 All you have to do is set a google alert for the name of your domain.所有您需要做的是确定谷歌戒备状态,您的域名。 I set one for “www.jtpratt.com”.我提出一个“ www.jtpratt.com ” 。 You can set your own at您可以设定自己在 google alerts谷歌快讯 . 。

The example above is a google alert I got last weekend.上面的例子是谷歌提醒我上周末。 Look like it’s just an alert of my own post, until you look at the URL below (videositemap).看起来这只是提醒我自己的职位,直到你看看下面的网址( videositemap ) 。 I know this is spam without even clicking.我知道这是垃圾邮件,甚至一下。
So now we’ve identified spam and some scrapers stealing our content.所以,现在我们已经确定了垃圾邮件和一些刮削器盗窃我们的内容。 What to do about it?怎么办呢? I’m going to show you how to use your .htaccess file.我要告诉你如何使用。 htaccess文件。 Read this post阅读此文章 about .htaccess at Plagiarism Today约。 htaccess今天在抄袭 . 。 It explains what an .htaccess file is, and how to use it to prevent people from not only stealing your content, but also images and files (and your bandwidth) as well.这说明什么。 htaccess文件,以及如何用它来防止人们不仅窃取您的内容,而且还图像和文件(和你的带宽) ,以及。 If you use Wordpress (or other blogging software), chances are you already have an .htaccess file in the root of your site, because that’s how your blog changes the pages or ?p=382 into pretty permalinks or URL’s like /my-post-about-dogs.如果您使用WordPress的(或其他博客软件) ,您已经有一个。 htaccess文件中的根源,您的网站,因为这是您的博客如何改变网页或? P值382到相当永久性或网址就像/我的员额约为狗。 All you have to do is add some additional code to that file directing your web server who to let in, and who to throw out!所有您需要做的就是添加一些额外的代码,该文件指示您的Web服务器向谁放,谁扔出去! If scrapers can’t get to your content, they can’t scrape it!如果刮削器不能找到您的内容,但不能刮它!
There are many, many ways to block, redirect, and stop scrapers by putting code in your .htaccess file, but I prefer the method in that article…有许多方法,块,重新导向,刮削器,并停止把你的代码。 htaccess文件,但我更喜欢的方法,第...
order allow,deny deny from xxx.xxx.xxx.xxx allow from all Now, the x’s aboe need to be turned in to numbers (you can use multiple lines).现在, X的aboe需要转向以号码(您可以使用多线) 。 The number needs to be the IP address of the server you want to turn away.人数需要的IP地址,服务器要回避。 So - we need to find out the IP addresses of the 2 scraper splogs we found earlier.所以-我们需要找到的I P地址2刮板s plogs我们发现较早。 There are many ways to do this, I do mine on the command line using “nslookup” - however you can use a web based tool,有很多方法可以做到这一点,我不煤矿在命令行上使用“ nslookup ” -不过您可以使用基于W eb的工具, like the one from zoneedit.com像一个来自zoneedit.com for free.免费的。 Just enter the domain you want to lookup.只需输入域名您要查找。 I found the IP of videositemap.com is 70.87.226.18.我发现的IP是70.87.226.18 videositemap.com 。 Now I need to get the one for fantasyfootballpassport.com, which is 216.139.234.32.现在我需要得到一个fantasyfootballpassport.com ,这是216.139.234.32 。 Now I update my code to add to my .htaccess file with that information like this…现在,我更新我的代码添加到我的。 htaccess文件,信息这样的...
order allow,deny deny from 70.87.226.18 deny from 216.139.234.32 allow from all I added that code to the top of my .htaccess file before anything else, uploaded it back to my web site root, and then visited my blog in both firefox and IE to make sure it loaded properly.我还指出代码是我最。 htaccess文件之前别的,上传回我的网站根目录,然后访问我的博客在Firefox和IE浏览器,以确保它正确地载入。 Now, those 2 scrapers won’t be getting to my content again.现在,这些第2刮削器将不会得到我的内容了。 While I prefer to block the scrapers I know regularly come to my blog, you can be proactive and use block lists of known spam and scraper sites to prevent plagiarism before it happens.虽然我更喜欢以阻止刮削我知道经常来我的博客,你可以主动和使用块列出已知的垃圾邮件和刮削器网站,以防止剽窃之前发生。
Some site owners and bloggers prefer to block “user agents” instead of IP address of computers, because IP’s (when found out) can be changed.一些网站所有者和博客宁愿块“用户代理” ,而不是IP地址的电脑,由于IP的(当发现)可以更改。 This is a little different, because you have to have access to your “raw access log” on your server to search for bad user agents crawling your feed or content.这是一个稍有不同,因为你必须有存取您的“原始访问日志”在您的服务器上寻找不良用户代理抓取您的饲料或内容。 What is a “user-agent”?什么是“用户代理” ? Simple, when you visit a site it may say that your “user-agent” is a paticular version of Firefox or Internet Explorer.很简单,当您访问一个网站它可以说,您的“用户代理”是一个paticular版本的Firefox或Internet Explorer 。 Google’s search crawler comes in on the user-agen “googlebot”.谷歌的搜索是在检索的用户议程“ Googlebot会” 。 Nefarious scraper robots and indexers have known names, and you can block them by their user-agent name instead of their IP address.邪恶的刮削器机器人和索引已经知道名字,可以阻止他们的用户代理的名称,而非其IP地址。
As I said, I am blocking scrapers using the method I just showed you, but there are many other ways to do it - all from your htaccess file.正如我所说的,我阻止刮削器使用的方法我只是表明你,但也有许多其他方式来做到这一点-所有从您的h taccess文件。 I’m going to give you a list of resources you can check out to get more information if you’re interested.我要去给你一个资源列表,您可以检查,以获得更多的信息,如果您有兴趣。
Fighting Scrapers and Splogs Resource List 战斗刮削器和Splogs资源列表
How to Block Bots, Ban IP Addresses with .htaccess如何阻止僵尸,潘基文与IP地址。 htaccess
.htaccess - Blocking IP Addresses, Robots, and Offline Browsers 。 htaccess -阻断I P地址,机器人,和离线浏览器
Blocking Bad Bots and Site Rippers (Offline Browsers)阻断不良网站和漫游Rippers (离线浏览器)
Ultimate .htaccess Blacklist 2终极。 htaccess黑名单第2
Joe Maller .htaccess blacklist乔Maller 。 htaccess黑名单
How You Can Stop Dirty Feed Scrapers in 3 Easy Steps如何您可以停止肮脏的饲料刮削器在3个简单步骤
Block Website Content Thieves, Proxy Services & Exploited Servers, with this Apache Server “.htaccess” Blocklist座网站内容窃贼,代理服务和被剥削服务器,这个Apache服务器。 “ htaccess ”拦截列表
As always, if you have something to add to make this article better, or a question - please comment now!一如往常,如果您有什么补充,使这个更好的文章,或一个问题-现在,请评论!
Tags:标签: htaccess , , scraper铲运机 , , spam垃圾邮件 , , splog
























April 29th, 2008 at 7:15 am 2008年4月29号在上午07点15分
I’d just been checking out a splog that hit RR yesterday, so your article is (once again) very timely!我刚刚被检查出splog击中昨天率,所以您的文章(再次)非常及时! How d’ya do that?德遐如何做到这一点 ?
I’ve now signed up for Google Alerts on my domains - thanks for the tip - and I’m now going to take a peek at this .htaccess file coding - but I’m shaking already!我现在注册了谷歌快讯我的域名-感谢冰山-我现在要考虑披露了这一点。h t access文件编码-但我握手了!
tee hee…开球熙...
April 29th, 2008 at 7:17 am 2008年4月29号在上午7点17
Great info!伟大的信息! I thought scrapers and sploggers were just one of the things you had to put up with, like the weather.我想刮削器和sploggers刚刚的事情之一就是你必须忍受,如天气。 I didn’t realise .htaccess was so useful, and the plagiarismtoday.com link is a ripper.我不明白。 htaccess是如此有用, plagiarismtoday.com链接是开膛手。
thanks!谢谢!![= ) ]](http://www.jtpratt.com/wp-includes/images/smilies/4.gif)
Jamess last blog post..1 Jamess最后的博客帖子.. 1
April 29th, 2008 at 8:01 am 2008年4月29号在上午08时01分
@Layne - glad to help once again! @莱恩-高兴能再次帮助!
![= ) ]](http://www.jtpratt.com/wp-includes/images/smilies/4.gif)
@James - giving you great resources is what I’m all about… @詹姆斯-给你伟大的资源是我所有关于. ..
May 1st, 2008 at 5:52 pm 08年5月1日在下午5点52分
nice article..好文章.. will try this one将尝试此一