A CLOUD BASED MODEL FOR DEDUPLICATION OF LARGE DATA

Archit Gupta
5 min readSep 15, 2021

INTRODUCTION

Due to the latest technological advances, automation systems and the availability of large applications such as digital libraries, a large amount of data is generated on a daily basis. To process this data and store the results obtained, a larger storage area is required. With the required data in the future it will be difficult to search this mass storage for efficient and accurate time. Searching for data and processing data will take more time than usual and this keeps growing daily as the data continues to grow. To solve these problems the concept of duplication is used to classify data into smaller amounts of data and to store only unique data. That is, all duplicate data is deleted during the data backup process. This way the amount of storage space required can be saved. And the time taken to view the data while it is being received will also be reduced as the number of available data blocks will also be smaller compared to that of the actual data. The process of duplication is difficult and some tools are also used for that . Good use of good data structure. The various steps used in repetition are as follows:
•Pre-install input data
•Chunking
•Hash generation
•A duplicate indicator

The concept of duplication is explained in more detail by the various algorithms available in each section. Research is also being done on various existing algorithms and their limitations. The proposed cloud-based model aims to solve some of the problems in the copyright process and offers a new research approach to this. By using cloud computing technology and data duplication stored data can also be protected by providing a variety of security services from the cloud environment.

2. DEDUPLICATION AND EXISTING METHODS

The details that need to be prioritized are considered if necessary. In some cases the initial processing takes a lot of time based on the type of processing used. Data cleaning, generalization, data encryption, data creation, etc. The next step is to connect. The process of separating data provided in multiple blocks or data clusters is called chunking. This is an important step in the process of repetition. This is because, based on the size of each piece the number of duplicate data changes. Depending on the initial details, the chunk size should be adjusted in such a way that the available sections have a larger number of duplicates and thus the storage size will be reduced as much as possible. This measurement is based on the weight reduction given i

- Data size before duplication

  • Data size after duplicate The repetition rate depends on the fraction size selected during the determination process. Determination can be made using a variety of methods such as:
    •Fixed size smokes Data is spilled into letters of the same size.
    •Flexible Size Cutting Data is divided into batches of various sizes
    depending on specific features or pattern such as reading each line, sentence, paragraph
    • Powerful determination The decision-making process is carried out vigorously when and when it is read and processed and no patter or determining factors have been adjusted prior to that. This is done at random random size.
    •Complete scanning The entire file is considered as a single processing piece.
    -
    This is only used in cases where the input data contains a large number of frequently repeated files and files. The number of buttocks produced depends on the chunk size and the total size of the input data. The next step after disconnection is the Hash generation in which the generated episodes of propaganda.

3. CLOUD BASED DEDUPLICATION

Cloud computing is an emerging technology in IT and is used in almost all areas and applications. By using a cloud environment, processing is made easier due to the variety of resources that can be used. In our proposed method, the cloud service is used to provide data retention to the user and to provide data security after the duplication process is done. The proposed draft model based deduplication model is shown in Fig. 1. As shown in Fig. 1 user is client and server is cloud. The user performs the first two steps of debugging on the client side (Advance and Chunking process). The generated chunks are transferred to a server system installed in the cloud. This category is called data upload. Upload service is provided by “Service Provider” in the cloud. Service Provider also provides the required storagethe required space to store user data.

These data cables are then encrypted by “Cloud Broker” using infor
user identification such as name, personal ID ID, username and password encryption password . This way, only the user can access his data because only he knows his password. The cloud usually encrypts the user’s password so the event Cloud Broker does not know the user’s password . While uploading chunk user data, the user also provides their required information through the encryption process.

4. PERFORMANCE EVALUATION

The performance of the proposed model depends on the following parameters or metrics:
•Recurrence Rate Storage percentage saved due to duplication.This depends on the type of chunking and the chunk size used.
•Browse Delay Check time to get duplicates. This depends on an algorithm used to identify duplicates.
•Collision Rate Percentage of duplicates received incorrectly, which is the accuracy of repeated identification. This depends on the hash generation algorithm used.
•Security Rate Percentage security provided for user data. This depends on
the type of security used and the number of unconfirmed attempts to access user data.

In the future, this model could be expanded using features related to security policy and the new model could be specifically designed for cloud duplication which includes a novel approach to duplicate discovery and hash creation. Multiple users can access the service simultaneously and details about one user are unknown to other users.

5.CONCLUSION

Data segregation is an important factor in recent times due to the increase in the amount of data. At the same time cloud computing technology has gained many fields in the current levels of research. This paper provides a cloud-based model for large data duplication. Apart from this the proposed model also aims to provide security for user information stored in cloud storage. The proposed method can be used using any of the algorithms as stated in the literature study. Depending on the type of algorithm used, the model is more efficient and accurate compared to that of existing demolition processes. Existing methods can also be extended to the proposed model and integrated with it using the same algorithms used there

REFERENCES

[1] Joao Paulo, Jose Pereira; “A Survey and Classification of Storage Deduplication SystemsACM Journal of Computing Surveys, Article №11, Volume 47, Issue 1, July 2014.

[2] Manvizhi. N, Suguna. M; “Duplication Tool for a Data Repository in E-Shopping Using Evolutionary Computing”, International Journal of Advanced Research in Computer Science and Software Engineering, Volume 3, Issue 6, June 2013.

[3] J.R. Waykole, S.M. Shinde; “A Survey Paper on Deduplication by using Genetic AlgorithmAlongwith Hash-Based Algorithm”, International Journal of Engineering Research and Applications, Volume 4, Issue 1, January 2014.

--

--