Research Article

Design and Development of a Big Data Platform for Disease Burden Based on the Spark Engine

Table 1

Technical differences between Spark and Hadoop.

HadoopSpark

TypeBasic platform, including calculation, storage, and schedulingPure distributed computing tools
SceneMass data batch processing (disk iterative calculation)Massive data batch processing (memory iterative calculation, interactive calculation), massive data stream calculation
PriceLowHigh
Programming paradigmMAP + REDUCERDD is a DAG directed acyclic graph
API level is relatively low, and algorithm adaptability is poorThe API is top-level and easy to use

Data storage structureThe calculation result is on the HDFS disk with a large delayRDD intermediate operation results are stored in memory with a small delay
Operation modeTasks are maintained in process mode, and the task starts slowlyTasks are maintained in a threaded manner, with fast task startup, and can be created in batch to improve the parallel ability