An Analysis of Archive Digitization in the Context of Big Data

Xu, Debin

doi:https://doi.org/10.1155/2022/1517824

Mobile Information Systems

On this page

Abstract Introduction Analysis Conclusions Data Availability Conflicts of Interest Acknowledgments References Copyright Related Articles

Special Issue

Edge Intelligence in Internet of Things using Machine Learning 2022

View this Special Issue

Research Article | Open Access

Volume 2022 | Article ID 1517824 | https://doi.org/10.1155/2022/1517824

An Analysis of Archive Digitization in the Context of Big Data

Debin Xu^1,2

Academic Editor: Mian Ahmad Jan

Received18 Apr 2022

Accepted31 May 2022

Published11 Jul 2022

Abstract

With the advancement of technology and the accumulation of archival information resource data in the era of big data, making use of archival information resources to improve the archival service capability and promote intelligent archive management has become a pressing task for data resource management. Keeping the significance of archive management in consideration, this research focuses on the implementation of data mining technology in the archive information resource sharing platform by discussing the construction of the data mining model, which is based on explaining the ideas related to data mining. The functional modules and operational process of the data mining model are comprehensively examined in order to select and create archival information data sources. The process of preparing data mining models as well as the creation of the prediction and the profiling models are discussed to lay down the foundation of a data mining-based model for archival information resources. Furthermore, this study also provides a more effective scientific solution to achieve the goal of comprehensive development of archival information digitization.

1. Introduction

With the rapid evolution of data-related technology in today’s Internet era, data have become increasingly important. In the past, traditional data held by different departments often showed a chimney-type silo effect, with great correlation value between each other but could not be fully utilized, especially in archival information resources [1]. The development and advancement of information technology (IT) have led to the deep mining of archival information, making the various elements of archival information management form intrinsic links and fully realizing the sharing of archival information resources [2, 3]. At present, the application of archival information resources data are mainly focused on the maintenance and retrieval of existing data, and the process of data mining is not only the numerical analysis of data at the surface level but also the deep semantic knowledge discovery based on the content [4].

In the era of big data, the widespread use of IT has broken the traditional way of working, and the same is true for archives management. In terms of ease of use and mobility in archive preservation, the digitalization of archives offers unrivaled benefits over traditional archives. The digitized archives can use big data and cloud computing technology to reasonably calculate and adjust the actual demand, quickly match the relevant resources, provide rapid feedback, and effectively solve the problem of a “bloated library” in traditional archives management. For the archives themselves, digitization will lead to a new era by utilizing the core of big data [5, 6], which lies in mining value from the massive amount of data. First, digitization has completely changed the way archives are used and greatly improved the utilization rate of archives. After digital processing, the needed data may be discovered within a matter of seconds, which not only reduces the workload of archivists but also greatly improves the efficiency of archives. In addition, it also changes the storage mechanism of archives.

Despite the fact that archives digitization significantly transforms the way archives are managed and gives a great deal of convenience, it also introduces several major issues and concerns. In terms of ideology, there is a problem with not paying enough attention to the digitization of archives; in terms of technology, there is some difficulty with digitization processing; in terms of talents, there is a shortage of talent in the digitization of archives management; and in terms of confidentiality, data security is the focus of archives management in the era of big data.

From the perspective of data flow mining, data mining of archival information resources is a huge challenge from the perspective of big data [7]. Through the establishment and implementation of a data mining model for archival information resources, we can provide a data mining solution for the archival information resource sharing platform as a reference. The basic aim of this study is to develop an archival information resource sharing platform by exploiting the benefits of current edge technologies such as data mining, deep learning, cloud storage, and big data. In order to evaluate the model’s validity, provide a more effective scientific solution, and achieve the goal of comprehensive archive information creation, the prediction model and profile model set are also being produced.

The remaining structure of the paper is organized as follows: Section 2 discusses the data mining-related concepts that include data analysis, data mining, and common analysis methods in data mining. The establishment of a data mining model for archival information resource sharing platforms is presented in Section 3. Along with the establishment of data mining models, this section further highlights the file information resource data selection and file information resource data source creation. Section 4 is about the implementation of a data mining model for the archival information resource sharing platform. This section further talks about the functional module for data mining models, data mining model processing flow, data mining model sample preparation, predictive model creation, and profiling model set creation. Section 5 is about the experiment and analysis whereas Section 6 finally concludes the paper.

Data in an archival information resource sharing platform can be structured such as data in relational databases, semistructured such as textual, graphical, and image data, or even unstructured such as data distributed in a network [8,9]. Data mining techniques include statistical analysis, sequence pattern discovery, and data extraction [8], and for data mining of archival information resource sharing platforms, two types of tasks are included: prediction tasks and description tasks [10]. The refinement of data mining tasks is usually exploratory in nature and generally requires subsequent data processing to validate and interpret the results of the previous data mining.

2.1. Data Analysis

With the development of information technology, mathematics and computer science have combined to produce the branch of data analysis. Data analysis of archival information resources entails the automated analysis of collected archival information resources data using appropriate statistical analysis methods. It also emphasizes the aggregation of the collected data into statistical reports in order to maximize the use of existing archival data and the utility of shared platform data. The traditional Excel data analysis function can meet the requirements of data statistics and analysis, and the tool can eventually produce histograms, correlation coefficients, covariance, various probability distributions, sampling, dynamic simulations, and overall mean judgments. Modern data analysis generally makes use of the powerful data analysis functions of databases, such as Microsoft SQL Server, Oracle, and other medium and large databases, which come with data analysis tools and can be configured to report data as needed.

2.2. Data Mining

The demand for data utilization is increasing, and data mining research has become a popular direction in this context. The rise of big data has put conventional basic data mining methods to the test [11], and data mining of archive information resources remained a popular issue in recent years. Modern technologies including database technology, artificial intelligence (AI), statistical modeling, pattern matching, visualization, and other relevant technologies are used in the data mining process [12]. Furthermore, data mining transforms the analysis of archival information resource data from traditional semiautomated to fully automatic via set techniques to identify potentially important data to assist archival services in making forward-looking judgments. This is particularly beneficial for multimedia archival information resources, which are traditionally stored in a structured manner because they exist mainly in text format. With the upgrading of hardware equipment and the development of modern Internet technologies, archival information resources exist in a variety of forms, including text, images, videos, or structured records [1, 13] such as lists and tables, which can also contain unstructured data.

2.3. Common Analysis Methods in Data Mining

Among different analysis methods for data mining, this paper focuses on two of them, namely, neural network analysis and the connected analysis approach. The neural network analysis approach instructs a neural network to learn a string of archive information resource data, which is subsequently summarized in a distinct manner. If new archival information resource data is obtained, the neural network can use the results of past learning and make an intelligent generalization to derive valuable data references. On the other hand, the connected analysis method is based on the graph theory in the field of mathematics and develops a model from the relationship of different data. The foundation of this approach is built on the relationship between different instances of data, which can be a relationship between people, things, or people and things. This method can be used to create a variety of intelligent applications. In the future, more research can be done to improve the archival service capability.

3.1. Establishment of Data Mining Models

To fully exploit the benefits of the archive information resource sharing platform, a comprehensive processing architecture based on users’ actual demands should be built [14]. Before the data mining model of the shared platform is established, the business problems involving archival information resources data in archival services should be well defined. Then, how to solve the problems should be determined based on the problems, and the broad problem objectives should be specified and refined. In the process of building the data mining model for archival information resources, some of the key parameters and factors need to be highlighted. That includes the identification of target platform users, setting feasible behavior patterns for target users, determining common user operation habits, listing high-frequency utilization information in the process of data utilization, analyzing high-frequency data access characteristics, and establishing special access channels for specific archival information resources data to enhance data utilization efficiency. The specific establishment process is illustrated in Figure 1.

In this process, the intermediate model is used to translate the relationship between input data variables and target variables. An accurate and thorough understanding of data mining objects and tasks is required. The archive information resource data cannot be translated into mining tasks until this challenge is fully understood. Before implementing the data mining task, it is also necessary to clarify how to use the data mining results of archival information resources and determine the way to deliver the results.

3.2. File Information Resource Data Selection

The data storage points for archival information resources are initially on the big data platform connected with the sharing platform but it is a challenging task for the platform itself to greatly increase the data [15]. Before entering the system, the big data of archive information resources stored in the warehouse was cleaned and verified. The sharing platform will utilize data integration technology to merge diverse data resources. The identical information point names would meet in the real data integration process, but with different meanings, and a data migration procedure will be implemented in the middle.

The existence of specific data in the data set will produce greater value. The essence of data mining is to use past data to predict future trends. For the data mining of archival information resources, the newer information in history is more valuable for data mining. The data with too much history needs deep processing because there is less additional attribute information. Only in this way can the value of archival information resources data with a longer history be brought into play.

3.3. File Information Resource Data Source Creation

Data exploration must be done thoroughly before the data are utilized to develop the model. The data quality problem of archival information resources cannot be found until subsequent use. In the case of excessive creation of specific data sources, the following steps can be followed.

To examine the data distribution of archival information resources, visualization tools can be used during the initial exploration of the database. Existing visualization tools, such as Excel, provide powerful aggregate analysis support for archival information data to be mined. After acquiring data files from archival information resources data sources, the contents need to be analyzed; however, the process of analysis may generate warnings of inconsistencies or definition problems, which need to be analyzed and solved to avoid unnecessary troubles in subsequent analysis.

Problematic data descriptions can be identified by comparing and double-checking the values and descriptions stored in archival information resource data, observing the archival information resource big data, and comparing them to variable descriptions in existing documents. Therefore, it is important to make sure that the data stored in the archival information resource is consistent with the data to be described. And, that the field definition information for each record is clear so that it does not lead to inconsistent data source errors in subsequent data mining processes. The data of archival information resources with problems are discussed and studied. Unlike the data of ordinary business systems, the authenticity and accuracy of archival information resources data are very demanding, and if the data in the storage system looks doubtful, it needs to be recorded. This process necessitates time and consideration, which is especially vital for the outcomes of archive data mining [16].

4. Deployment of the Proposed Model Based on Data Mining Algorithm

4.1. Functional Module for Data Mining Models

The data mining model function module of archival information resources primarily comprises a data collection module, storage module, data classification module, data analysis module, and so on. The data collection module includes backend management data collection and user behavior data collection. Following data sharing, the public interface will interact with the data on a regular basis, and new data will be gathered in the archive information resource sharing platform. To improve the data mining module’s data operation performance, the storage module backs up the archive information resource data stored on the platform regularly and automatically separates the historical data from the present operation data. The data classification module uses the original data obtained from the archival information resource sharing platform and applies intelligent algorithms to summarize and aggregate archival information resource data, which is crucial for the source data to be mined to accomplish its intended aim. The purpose of the data analysis module is to intelligently extract value keywords from categorized data, which is the value of data mining from archive information resources. The specific functional module is shown in Figure 2.

4.2. Data Mining Model Processing Flow

The processing flow of archives information resource data mining model includes data preparation, data preprocessing, data mining, evaluation, and other processes [17]. The working of the data preparation stage is to obtain basic information from the archives information resource sharing platform according to user needs. The purpose of the data preprocessing stage is to preprocess a large number of imperfect, fuzzy, and redundant data collected in the first stage and convert them into accurate and effective data. In this process, the data and attributes related to data mining can be used. The data mining algorithm is used in this process, which lays a foundation for data mining. The data mining stage can be divided into three parts, that is, determine the mining and its knowledge type, determine the algorithm, and carry out the actual data mining according to the algorithm. The verification of the prediction results takes the most time during the evaluation stage. The unqualified results can be remined according to the above-given three steps until the prediction results meet the requirements. At the same time, the redundant knowledge in the mining results shall be deleted.

4.3. Data Mining Model Sample Preparation

Knowledge discovery algorithms must learn from the raw data of archival information resources because without a sufficient number of examples of big data models of archival information resources, data mining models cannot produce the desired predictive models. In this case, edge samples are used to enrich the sample model set and improve the success rate of a particular prediction result. Archival information resources data mining models require a long time series to build since there is a risk that models based on a short time series will not truly reflect the data trend information in the final data learning knowledge. In practical time-series applications, it is necessary to combine multiple time-series information in the model to eliminate the impact of trend analysis due to time progression.

4.4. Predictive Model Creation

When an archive information resource data mining model is used for prediction, it must indicate the amount of time the model set will take, as well as the precise time period the predictive model will use to interpret the most recent output using the model set in the past. Once deployed in a formal environment, a predictive model can learn by itself and use continuously updated data to predict the future. If recent forecasts are needed for onward forecasting, a possible solution is to skip the recent forecast data input in the model set.

4.5. Profiling Model Set Creation

The difference between the archival information resource data model set and the predictive model is that the time frames of the target of the profiling model overlap with the time frames of the inputs. This difference has a significant impact on the modeling effort because the inputs may bias the target model; however, a strict selection of the inputs to the profiling model can avoid this problem. A model is said to be a profiling model when the time frames of the target variables match the time frames of the input variables. Profiling model input variables may introduce unfixed mining patterns that may confuse data mining techniques.

5. Experiment and Analysis

The application system of digital archives shall support the classification, cataloging, naming, and storage of electronic documents, full electronic document archiving, and preservation, and assist in the classification and cataloging of paper, and other traditional carrier archived documents were marked as B-CP and Q-CP. For comparison, L-1 regularization was adopted for maximum likelihood estimation in the conditional random field (CRF) model, as shown in Figure 3. It must be able to assign values to the year, organization, project code, file number, piece number, storage duration, and other classification elements of various types of electronic or traditional carrier archiving documents automatically or semiautomatically. It should also adjust its arrangement order, to complete the classification organization of electronic documents or auxiliary traditional carrier archiving documents and maintain the correspondence documents on the same subject.

(a)

(b)

(c)

It must also be able to synchronize electronic and conventional carrier archiving materials, issue file numbers automatically based on categorization findings, and finish the archiving process. It must be able to create and name folders automatically and level by level based on each component of the file number, categorize and store electronic files, and name electronic files with the file number [6]. The technical means provided by the third-party authority shall be used to generate the solidified information of the original text of the electronic archives at an appropriate time to provide verification channels. The functional requirements for uploading, hooking, and storing digital copies of traditional carrier archives can be implemented by reference. Figure 3 compares the F1 values of the CRF model and our method on B-CP, Q-CP, and O labels are good.

5.1. Recognition of the Correlation between Various Pairs of Arguments

In this study, the appropriate teaching concepts were identified using the exam results of 6,000 freshmen from 42 project institutions in China. Table 1 shows the AUC results of different pairs of arguments used in this study.

It should automatically construct file-level and document-level electronic directories of electronic archives based on the file number and support the description of metadata such as title, responsible person, document number, time, and storage length of various kinds of electronic archives. It shall also be able to provide necessary description windows according to different requirements of electronic archives description, provides a drop-down menu, portable input, calendar, time axis, and other automatic description tools for the author and improve the automation of description methods. As shown in Tables 1 and 2 that it shall be able to automatically verify the integrity, standardization, description of information’s effectiveness, prompt modification and correction, and support the addition, modification, and deletion of metadata or catalog data within the scope of authority.

As shown in Figure 4 that taking the research literature with the theme of “archives website” included in the CNKI database as the data source, using the methods of word frequency statistics and coword analysis, and using SPSS software for factor analysis and cluster analysis of its high-frequency keywords, it is found that the research hotspots of archives websites in China mainly focus on seven research topics. That includes current situation investigation and research, construction of archives websites in Colleges and universities, resource integration and organization, website evaluation, website design and construction, website technology and service, analysis and reference of foreign excellent archives websites, and detailed analysis of the research theme. It is because to provide some references for the future research in the field of archives websites in China.

6. Conclusions

Data have become increasingly significant in the Internet era, therefore, it is crucial to keep the data safe and secure. The deep mining of archive data have resulted from the progress and development of IT. Data mining entails not just surface-level data analysis but also deep semantic knowledge discovery. The usage of archives has changed dramatically as a result of digitization, and the utilization rate of archives has increased intensely. Therefore, this study focuses on the application of data mining technology in the archive information resource sharing platform by addressing the data mining model’s development, which is based on explaining data mining concepts. In order to identify and produce archive information data sources, the functional modules and processing method of the data mining model are thoroughly analyzed. The prediction model and profile model set are developed in order to verify the model’s viability, give a more effective scientific solution, and accomplish the objective of complete archive information development.[18].

Data Availability

The data used to support the findings of this study are available from the author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This study was supported by Domestic Visiting Research Project of the Excellent Young Backbone Teachers of Higher Education of Anhui Province (Project no : GXGNFX2020068); Key Project of Humanities and Social Science Research of Anhui Education Department, “Research on Integration and Service of University Archival Resources in Big Data Environment” (Project no.: SK2020A0741); and the project of Humanities and Social Sciences Planning Fund of the Ministry of Education, “Research on the Cost Measurement and Control of Digital Preservation” (Project no. 20YJA870018), Phase Achievement.

References

N. Pretto, E. Micheloni, A. Chmiel et al., “Multimedia archives: new digital filters to correct equalization errors on digitized audio tapes,” Advances in Multimedia, vol. 2021, Article ID 5410218, pp. 1–11, 2021.
View at: Publisher Site | Google Scholar
A. N. Kumar, M. I. Miga, T. S. Pheiffer, L. B. Chambless, R. C. Thompson, and B. M. Dawant, “Persistent and automatic intraoperative 3D digitization of surfaces under dynamic magnifications of an operating microscope,” Medical Image Analysis, vol. 19, no. 1, pp. 30–45, 2015.
View at: Publisher Site | Google Scholar
L. Müller, A. Tipold, J. P. Ehlers, and E. Schaper, “Digitalisierung der Lehre? - Begleitende Bedarfsanalyse zur Implementierung von Vorlesungsaufzeichnungen in der tiermedizinischen Ausbildung,” Tierärztliche Praxis Ausgabe K: Kleintiere/Heimtiere, vol. 47, no. 3, pp. 164–174, 2019.
View at: Publisher Site | Google Scholar
F. Menna, E. Nocerino, F. Remondino, M. Dellepiane, M. Callieri, and R. Scopigno, “3D Digitization of an Heritage Masterpiece - a Critical Analysis on Quality Assessment,” ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. XLI-B5, pp. 675–683, 2016.
View at: Publisher Site | Google Scholar
N. Khan, I. Yaqoob, I. A. T. Hashem et al., “Big data: survey, technologies, opportunities, and challenges,” The Scientific World Journal, vol. 2014, Article ID 712826, pp. 1–18, 2014.
View at: Publisher Site | Google Scholar
J. Li, “Application of intelligent archives management based on data mining in hospital archives management,” Journal of Electrical and Computer Engineering, vol. 2022, Article ID 6217328, pp. 1–13, 2022.
View at: Publisher Site | Google Scholar
B. Ogilvie, “Scientific Archives in the Age of Digitization,” Isis, vol. 107, no. 1, pp. 77–85, 2016.
View at: Publisher Site | Google Scholar
M. Tabaa, A. Brahmi, and B. Chouri, “Contribution to the implementation of an industrial digitization platform for level detection,” Procedia Computer Science, vol. 191, pp. 457–462, 2021.
View at: Publisher Site | Google Scholar
N. Orio, L. Snidaro, S. Canazza, and G. L. Foresti, “Methodologies and tools for audio digital archives,” International Journal on Digital Libraries, vol. 10, no. 4, pp. 201–220, 2009.
View at: Publisher Site | Google Scholar
L. Zeng, “Analysis on the library digitization construction under the big data era of university,” Applied Mechanics and Materials, vol. 687-691, pp. 2656–2659, 2014.
View at: Google Scholar
W. Wang Yuan and C. Chang Shengyan, “The thinking of digitization management of weapon equipment development in big data era,” in Proceedings of the 2017 IEEE 2nd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA), pp. 166–170, IEEE, Chengdu, China, 2017 April.
View at: Publisher Site | Google Scholar
J. Li, “Design of an effective archive management system with a compression approach for network information technology,” Wireless Communications and Mobile Computing, vol. 2022, Article ID 032017, pp. 1–10, 2022.
View at: Publisher Site | Google Scholar
S. C. Yu, H. H. Chen, and H. W. Chang, Building an Open Archive union Catalog for Digital Archives, The Electronic Library, 2005.
X. Li, L. Song, and H. Wu, “Digitalization of cross-country skiing training based on multisensor combination,” Journal of Sensors, vol. 2021, Article ID 5662716, pp. 1–11, 2021.
View at: Publisher Site | Google Scholar
J. P. Purdy, “Anxiety and the archive: understanding plagiarism detection services as digital archives,” Computers and Composition, vol. 26, no. 2, pp. 65–77, 2009.
View at: Publisher Site | Google Scholar
J. Ondrich and J. Ruggiero, “Outlier detection in data envelopment analysis: an analysis of jackknifing,” Journal of the Operational Research Society, vol. 53, no. 3, pp. 342–346, 2002.
View at: Publisher Site | Google Scholar
Y. Zhou, W. Ye, S. M. Resnick, J. R. Brašić, A. H. Crabb, and D. F. Wong, “Analysis of the noise-induced underestimation in the DV and DVR images generated by graphical analysis and an SRTM model with linear regression,” Neuroimage, vol. 41, p. T87, 2008.
View at: Publisher Site | Google Scholar
B. Weber-Lewerenz, “Corporate digital responsibility (CDR) in construction engineering-ethical guidelines for the application of digital transformation and artificial intelligence (AI) in user practice,” SN Applied Sciences, vol. 3, no. 10, p. 801, 2021.
View at: Publisher Site | Google Scholar

Copyright

Copyright © 2022 Debin Xu. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

PDF Download Citation

Download other formats

Order printed copies

Mobile Information Systems

Edge Intelligence in Internet of Things using Machine Learning 2022

An Analysis of Archive Digitization in the Context of Big Data

Abstract

1. Introduction

2. Data Mining-Related Concepts

2.1. Data Analysis

2.2. Data Mining

2.3. Common Analysis Methods in Data Mining

3. Establishment of Data Mining Model for Archive Information Resource Sharing Platform

3.1. Establishment of Data Mining Models

3.2. File Information Resource Data Selection

3.3. File Information Resource Data Source Creation

4. Deployment of the Proposed Model Based on Data Mining Algorithm

4.1. Functional Module for Data Mining Models

4.2. Data Mining Model Processing Flow

4.3. Data Mining Model Sample Preparation

4.4. Predictive Model Creation

4.5. Profiling Model Set Creation

5. Experiment and Analysis

5.1. Recognition of the Correlation between Various Pairs of Arguments

6. Conclusions

Data Availability

Conflicts of Interest

Acknowledgments

References

Copyright