How to prevent web crawlers? It is enough to read this article.

WeTest Tencent Quality Open Platform (wetest.qq.com) is a one-stop game testing platform officially launched by Tencent Games. In the spirit of open and win-win, Tencent Games has been precipitated for more than ten years. After thousands of game tempered excellent testing programs and tools, it has been opened to game developers to help improve users' research and development efficiency and product quality.

Have you been harassed by reptiles? When you see the word "reptile", is it already a little bloody? Be patient, do something a little, you can let them win in name, and actually let them suffer.

First, why should anti-reptiles 1. Reptiles account for a higher proportion of total PV, so that money is wasted (especially March reptiles)

What is the concept of the crawler in March? In March of each year, we will meet a peak of reptiles.

At first we were puzzling. Until one time, in April, we deleted a url, and then a crawler continually crawled the url, causing a lot of errors, and the test started to trouble us. We had to deliberately release a site for the crawler and restore the deleted url back.

But at that time one of our team members said that they were very dissatisfied, saying that we can't kill the reptiles, and we have to publish them specifically for it. This is really too faceless. So an idea was made, saying: url can be on, but definitely not giving real data.

So we posted a static file. The error stopped and the reptile didn't stop, which means the other party didn't know that things were fake. This matter has given us a lot of inspiration and has directly become the core of our anti-reptile technology: change.

Later, a student came to apply for an internship. We looked at the resume and found that she had climbed Ctrip. Later, when I was interviewed, I confirmed that she was the guy who killed us in April. However, because it is a sister, the technology is not bad, and then we were recruited. It is now officially ready to join.

Later, when we discussed together, she mentioned that a large number of masters would choose to crawl OTA data and conduct public opinion analysis when writing a thesis. Because the papers were handed in May, so everyone read the books. You know, the various DotA and LOL in the early days are in March, and it’s too late. Grab the data quickly, analyze it in April, and submit the paper in May.

It is such a rhythm.

2. The resources that the company can check for free are taken away in batches and lose competitiveness, so that they make less money.

The price of the OTA can be queried directly in the non-login state. This is the bottom line. If you force a login, you can let the other party pay the price by blocking the account. This is also the practice of many websites. But we can't force the other party to log in. Then if there is no anti-reptile, the other party can copy our information in batches, and our competitiveness will be greatly reduced.

Competitors can catch our prices. After a long time, users will know that they only need to go to competitors. There is no need to come to Ctrip. This is not good for us.

3. Is the reptile suspected of breaking the law? If so, can I sue for compensation? This can make money.

I specifically consulted the law on this issue, and finally found that this is still a sideline in the country, that is, it may be successful, or it may be completely invalid. Therefore, it is still necessary to use technical means to make the final guarantee.

Second, what kind of reptiles 1. Very low-level graduating graduates

The March crawler we mentioned at the beginning is a very obvious example. The reptiles of recent graduates are usually simple and rude. Regardless of the server pressure and the unpredictable number of people, it is easy to hang the site.

By the way, it is no longer possible to get the offer by climbing Ctrip. Because we all know that the first person who said that a beautiful woman is like a flower is a genius. And the second one. . . Do you understand?

2. Very low-level startup small company

There are more and more startup companies now, and I don’t know who is being fooled. Then everyone started to find out what they are doing. I think big data is hot and I started to do big data.

The analysis program was almost written, and I found that I had no data at hand.

How to do? Write reptile crawling and there are countless small reptiles, constantly crawling data for the company's life and death.

3. I accidentally misunderstood the uncontrolled small reptile that no one stopped.

Comments on Ctrip may sometimes be as high as 60% of the visitor is a crawler. We have chosen to block it directly, they are still crawling tirelessly.

What does that mean? That is to say, they simply can't climb any data. Except that httpcode is 200, everything is wrong, but the crawler still doesn't stop. This is probably a small crawler hosted on some servers. It has been unclaimed. Working hard.

4. Formed commercial opponents

This is the biggest opponent. They have the skills, the money, what is there, and if you die with you, you can only bite the bullet and die.

5. Breathing search engine

Don't think that search engines are good people, they also have a time to vent, and a gust of wind will lead to server performance degradation, request volume is no different from cyber attacks.

three. What are reptiles and anti-reptiles?

Because anti-reptiles are a relatively new area for the time being, some definitions have to be down. Our internal definition is like this:

Reptile: A method of obtaining website information in bulk using any technical means. The key is in batches.

Anti-crawling: A way of using any technical means to prevent others from obtaining their own website information in bulk. The key is also the batch.

Injury: In the process of anti-reptiles, the ordinary user is mistakenly identified as a crawler. The anti-reptile strategy with high accidental injury rate can't be used even better.

Intercept: Successfully blocked crawler access. There is a concept of interception rate here. Generally speaking, the higher the interception rate, the higher the possibility of accidental injury. Therefore, we need to make a trade-off.

Resources: The sum of machine costs and labor costs.

It is important to remember that labor costs are also resources and more important than machines. Because, according to Moore's Law, machines are getting cheaper. According to the development trend of the IT industry, programmers are getting more and more expensive. Therefore, letting the other party work overtime is king, and the cost of the machine is not particularly valuable.

Fourth, know yourself and know how to write simple crawlers

To do anti-reptiles, we first need to know how to write a simple crawler.

Currently, the crawler data found on the web is very limited, usually just a piece of python code. Python is a good language, but it is really not the best choice for crawling sites with anti-reptile measures.

More ironically, the python crawler code that is usually found will use a lynx user-agent. What should you do with this user-agent, so I don't need it for me?

Usually writing reptiles goes through several processes:

Analyze page request format

Create a suitable http request

Send http requests in batches to get data

For example, look directly at the Ctrip production url. Click the "OK" button on the details page to load the price. Assuming the price is what you want, which request is the result you want after grabbing the network request?

The answer is unexpectedly simple, you only need to use the amount of data transmitted in the network to reverse the order. Because other confusing urls are more complex, developers will not be willing to add data to him.

5. Know yourself and know how to write advanced crawlers

So how should the reptile advancement be done? Usually the so-called advanced has the following:

Distributed

There are usually some textbooks telling you that in order to climb efficiently, you need to distribute crawlers to multiple machines. This is totally deceptive. The only role of distribution is to prevent the other party from sealing the IP. Sealing IP is the ultimate means, the effect is very good, of course, the user is also very cool.

2. Simulating JavaScript

Some tutorials will say that simulating javascript and crawling dynamic web pages is an advanced technique. But in fact this is just a very simple function. Because, if the other party has no anti-reptiles, you can directly catch the ajax itself, without having to care about how js handles it. If the other party has anti-reptiles, then javascript must be very complicated, the focus is on analysis, not just simple simulation.

In other words: this should be the basic skill.

3. PhantomJs

This is an extreme example. This thing was originally intended to be used for automated testing. As a result, the effect is very good and many people use it as a crawler. But this thing has a bad injury, that is: efficiency. In addition, PhantomJs can also be caught, for a variety of reasons, not to mention here.

Six, the advantages and disadvantages of different levels of reptiles

The lower the level of reptiles, the easier it is to be blocked, but the performance is good and the cost is low. The more advanced reptiles, the harder it is to be blocked, but the lower the performance and the higher the cost.

When the cost is high enough, we can eliminate the need to block the reptiles. There is a term in economics called the marginal effect. The cost is high to a certain extent, and the benefits are not many.

Then if we compare the resources of both parties, we will find that it is not worthwhile to die unconditionally with the other party. There should be a golden point. If you exceed this point, let it climb. After all, our anti-reptiles are not for the sake of face, but for business reasons.

How to design an anti-crawl system (conventional architecture)

A friend once gave me such an architecture:

Pre-processing the request for easy identification;

Identify if it is a reptile;

Properly handle the recognition results;

At the time, I felt that it sounded very reasonable. It was a structure, and the idea was different from us. Later, we really did not react to it. because:

If you can identify a reptile, how much nonsense? I want to do it if I want to do it. If you can't identify the reptile, who do you handle properly?

There are two sentences in three sentences that are nonsense, only one sentence is useful, and no specific implementation has been given. So: What is the use of this architecture (teacher)?

Because there is currently an architect worship problem, many small startup companies are recruited in the name of architects. The given TItle is: the junior architect, the architect itself is a senior position, why there is a primary structure. This is equivalent to: junior general / junior commander.

Finally went to the company and found ten people, one CTO, nine architects, and maybe you are a junior architect, others are senior architects. However, the junior architects are not too arrogant, and some small startups are recruiting CTOs for development.

Traditional anti-reptile means

The background counts access and is blocked if a single IP access exceeds the threshold.

Although this effect is not bad, but there are actually two defects, one is very easy to accidentally injure ordinary users, the other is that IP is actually not worth the money, dozens of dollars may even buy hundreds of thousands of IP. So overall it is more deficient. But for the reptiles in March, this is still very useful.

The background counts the access, and if a single session access exceeds the threshold, it is blocked.

This looks a bit more advanced, but in fact the effect is even worse, because the session is completely worthless, just re-apply one.

The background counts the access, and if a single userAgent access exceeds the threshold, it is blocked.

This is a big move, similar to antibiotics, the effect is surprisingly good, but the lethality is too great, the accidental injury is very serious, and you should be very careful when using it. So far we have only temporarily blocked Firefox under mac.

Combination above

The combination ability becomes larger, the accidental injury rate decreases, and it is easier to use when encountering low-level reptiles.

From the above we can see that in fact, the crawler anti-reptile is a game, and the RMB player is the most powerful.

Because of the methods mentioned above, the effects are average, so it is still more reliable with JavaScript.

Some people may say: If you do javascript, can you skip the front-end logic and pull the service directly? How can it be reliable? Because ah, I am a headline party AI JavaScript is not just doing the front end. Skip the front end is not the same as skipping JavaScript. In other words: our server is done by nodejs.

Thinking questions: When we write the code, what code are we most afraid of? What code is not easy to debug?

Eval

Eval is notorious, it is inefficient and readable. It is exactly what we need.

Goto

Js is not good for goto support, so you need to implement goto yourself.

Confuse

The current minify tool is usually a simple name like minify to abcd, which does not meet our requirements. We can minify to be better, such as Arabic. why? Because Arabic sometimes writes from left to right, sometimes from right to left, and sometimes from bottom to top. Unless the other party hires an Arab programmer, it is not a headache.

Unstable code

What bug is not easy to fix? A bug that is not easy to reproduce is not easy to fix. Therefore, our code is full of uncertainty, and it is different every time.

Code demo

The download code itself can be easier to understand. Here is a brief introduction to the following ideas:

The pure JAVASCRIPT anti-reptile DEMO, by changing the connection address, allows the other party to grab the wrong price. This method is simple, but it is easy to find if the other person looks at it in a targeted way.

Pure JAVASCRIPT anti-reptile DEMO, change the key. This is simple and not easy to find. But it can be achieved by intentionally crawling the wrong price.

Pure JAVASCRIPT anti-repeller DEMO, change the dynamic key. This method can make the cost of changing the key zero, so the cost is lower.

Pure JAVASCRIPT anti-reptile DEMO, very complicated to change the key. This method can make the other party difficult to analyze. If you add the browser detection mentioned later, it is more difficult to be crawled.

So far.

Earlier we mentioned the marginal effect, that is, we can stop here. Subsequent reinvestment in manpower will not be worth the loss. Unless there is a special opponent to die with you. But this time is to fight for dignity, not for business reasons.

Browser detection

Our detection methods are different for different browsers.

IE, detecting bugs;

FF, the degree of strictness of the test;

Chrome, detecting powerful features.

Eight, I caught you - then what to do will not trigger production events - direct interception

May trigger production events - give false data (also called poisoning)

There are also some divergent ideas. For example, is it possible to do SQL injection in the response? After all, it is the first mover's hand. However, this issue has not given a specific reply to the law, and it is not easy to explain to her. So for the time being it is just an idea.

Technical suppression

We all know that there is a de command in DotAAI. When the AI ​​is killed, the multiple of its experience will increase. Therefore, too many AIs are killed in the early stage, and the AI ​​will be dressed in a gods and cannot be killed.

The correct way is to suppress the opponent's level, but not kill. The anti-reptiles are the same. Don't overdo it at the beginning, forcing people and you to die.

2. Psychological warfare

Provocative, compassionate, ridiculous, and wretched.

I don’t mention it above, everyone can understand the spirit.

3. Release water

This may be the highest state.

Programmers are not easy, and it is especially difficult to be a crawler. Poor pity they give them a small meal to eat. Maybe in a few days, you will be a reptile because of the anti-reptiles.

Current Type Voltage Transformer

Current Type Voltage Transformer,Mini Voltage Transformer,Current Voltage Transformer,Mini Encapsulated Voltage Transformer

Zibo Tongyue Electronics Co., Ltd , https://www.tongyueelectron.com

This entry was posted in on