Proxisch: An Optimization Approach of Large-Scale Unstable Proxy Servers Scheduling

Nowadays, big companies such as Google, Microsoft,
which have adequate proxy servers, have perfectly implemented
their web crawlers for a certain website in parallel. But due to
lack of expensive proxy servers, it is still a puzzle for researchers
to crawl large amounts of information from a single website in
parallel. In this case, it is a good choice for researchers to use
free public proxy servers which are crawled from the Internet. In
order to improve efficiency of web crawler, the following two issues
should be considered primarily: (1) Tasks may fail owing to the
instability of free proxy servers; (2) A proxy server will be blocked
if it visits a single website frequently. In this paper, we propose
Proxisch, an optimization approach of large-scale unstable proxy
servers scheduling, which allow anyone with extremely low cost to
run a web crawler efficiently. Proxisch is designed to work efficiently
by making maximum use of reliable proxy servers. To solve second
problem, it establishes a frequency control mechanism which can
ensure the visiting frequency of any chosen proxy server below the
website’s limit. The results show that our approach performs better
than the other scheduling algorithms.




References:
[1] S. Kaur and A. Gupta, “A survey on web focused information extraction
algorithms,” 2015.
[2] S. Brin and L. Page, “Reprint of: The anatomy of a large-scale
hypertextual web search engine,” Computer networks, vol. 56, no. 18,
pp. 3825–3833, 2012.
[3] Attributor, “Attributor.”
[4] Y. Zhang, J. Tang, Z. Yang, J. Pei, and P. S. Yu, “Cosnet: Connecting
heterogeneous social networks with local and global consistency,” in
Proceedings of the 21th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining. ACM, 2015, pp. 1485–1494.
[5] S. Ji, W. Li, P. Mittal, X. Hu, and R. Beyah, “Secgraph: A uniform
and open-source evaluation system for graph data anonymization and
de-anonymization,” in 24th USENIX Security Symposium (USENIX
Security 15), 2015, pp. 303–318.
[6] R. Patel and P. Bhatt, “A survey on semantic focused web crawler
for information discovery using data mining technique,” International
Journal for Innovative Research in Science and Technology, vol. 1, no. 7,
pp. 168–170, 2015.
[7] W. Galuba, K. Aberer, D. Chakraborty, Z. Despotovic, and
W. Kellerer, “Outtweeting the twitterers-predicting information cascades
in microblogs,” in Proceedings of the 3rd conference on Online social
networks, vol. 39, no. 12, 2010, p. 3ˆaAS3.
[8] V. Shkapenyuk and T. Suel, “Design and implementation of a
high-performance distributed web crawler,” in Data Engineering, 2002.
Proceedings. 18th International Conference on. IEEE, 2002, pp.
357–368.
[9] H. T. Y. Achsan and W. C. Wibowo, “A fast distributed focused-web
crawling,” Procedia Engineering, vol. 69, pp. 492–499, 2014.
[10] S. A. Catanese, P. De Meo, E. Ferrara, G. Fiumara, and A. Provetti,
“Crawling facebook for social network analysis purposes,” in
Proceedings of the international conference on web intelligence, mining
and semantics. ACM, 2011, p. 52.
[11] L. F. Lopes, J. Zamite, B. Tavares, F. Couto, F. Silva, and M. J.
Silva, “Automated social network epidemic data collector,” in INForum
informatics symposium. Lisboa, 2009.
[12] M. Ke, P. Zhang, and G. Chen, “The crawler of specific resources
recognition based on multi-thread,” in Computational Sciences and
Optimization (CSO), 2012 Fifth International Joint Conference on.
IEEE, 2012, pp. 569–572.
[13] A. H. Wang, “Don’t follow me: Spam detection in twitter,” in Security
and Cryptography (SECRYPT), Proceedings of the 2010 International
Conference on. IEEE, 2010, pp. 1–10.
[14] B. Liu, L. Wang, and Y.-H. Jin, “An effective hybrid pso-based algorithm
for flow shop scheduling with limited buffers,” Computers & Operations
Research, vol. 35, no. 9, pp. 2791–2806, 2008.
[15] G. Schmidt, “Scheduling with limited machine availability,” European
Journal of Operational Research, vol. 121, no. 1, pp. 1–15, 2000.
[16] D. McCoy, J. A. Morales, and K. Levchenko, “Proximax: A
measurement based system for proxies dissemination,” Financial
Cryptography and Data Security, vol. 5, no. 9, p. 10, 2011.
[17] Q. Wang, Z. Lin, N. Borisov, and N. Hopper, “rbridge: User reputation
based tor bridge distribution with privacy preservation.” in NDSS, 2013.
[18] M. H. Au, A. Kapadia, and W. Susilo, “Blacr: Ttp-free blacklistable
anonymous credentials with reputation,” 2012.
[19] D. Bilenko, “gevent,” http://www.gevent.org/, 2015.
[20] 199it, “Report about renren,” http://www.ebrun.com/20130507/72900.shtml,
2013. [21] K. Reitz, “Requests library,” http://www.python-requests.org/en/latest/,
2015.