Abstract: Given a fixed fund, purchasing fewer hosts of higher capability or inversely more of lower capability is a must-be-made trade-off in practices for building a Hadoop big data platform. An exploratory study is presented for a Housing Big Data Platform project (HBDP), where typical big data computing is with SQL queries of aggregate, join, and space-time condition selections executed upon massive data from more than 10 million housing units. In HBDP, an empirical formula was introduced to predict the performance of host clusters potential for the intended typical big data computing, and it was shaped via a regression approach. With this empirical formula, it is easy to suggest an optimal cluster configuration. The investigation was based on a typical Hadoop computing ecosystem HDFS+Hive+Spark. A proper metric was raised to measure the performance of Hadoop clusters in HBDP, which was tested and compared with its predicted counterpart, on executing three kinds of typical SQL query tasks. Tests were conducted with respect to factors of CPU benchmark, memory size, virtual host division, and the number of element physical host in cluster. The research has been applied to practical cluster procurement for housing big data computing.
Abstract: The system for analyzing and eliciting public
grievances serves its main purpose to receive and process all sorts of
complaints from the public and respond to users. Due to the more
number of complaint data becomes big data which is difficult to store
and process. The proposed system uses HDFS to store the big data
and uses MapReduce to process the big data. The concept of cache
was applied in the system to provide immediate response and timely
action using big data analytics. Cache enabled big data increases the
response time of the system. The unstructured data provided by the
users are efficiently handled through map reduce algorithm. The
processing of complaints takes place in the order of the hierarchy of
the authority. The drawbacks of the traditional database system used
in the existing system are set forth by our system by using Cache
enabled Hadoop Distributed File System. MapReduce framework
codes have the possible to leak the sensitive data through
computation process. We propose a system that add noise to the
output of the reduce phase to avoid signaling the presence of
sensitive data. If the complaints are not processed in the ample time,
then automatically it is forwarded to the higher authority. Hence it
ensures assurance in processing. A copy of the filed complaint is sent
as a digitally signed PDF document to the user mail id which serves
as a proof. The system report serves to be an essential data while
making important decisions based on legislation.
Abstract: The current Hadoop block placement policy do not fairly and evenly distributes replicas of blocks written to datanodes in a Hadoop cluster.
This paper presents a new solution that helps to keep the cluster in a balanced state while an HDFS client is writing data to a file in Hadoop cluster. The solution had been implemented, and test had been conducted to evaluate its contribution to Hadoop distributed file system.
It has been found that, the solution has lowered global execution time taken by Hadoop balancer to 22 percent. It also has been found that, Hadoop balancer respectively over replicate 1.75 and 3.3 percent of all re-distributed blocks in the modified and original Hadoop clusters.
The feature that keeps the cluster in a balanced state works as a core part to Hadoop system and not just as a utility like traditional balancer. This is one of the significant achievements and uniqueness of the solution developed during the course of this research work.