Research Article
Design and Development of a Big Data Platform for Disease Burden Based on the Spark Engine
Table 1
Technical differences between Spark and Hadoop.
| | Hadoop | Spark |
| Type | Basic platform, including calculation, storage, and scheduling | Pure distributed computing tools | Scene | Mass data batch processing (disk iterative calculation) | Massive data batch processing (memory iterative calculation, interactive calculation), massive data stream calculation | Price | Low | High | Programming paradigm | MAP + REDUCE | RDD is a DAG directed acyclic graph | API level is relatively low, and algorithm adaptability is poor | The API is top-level and easy to use |
| Data storage structure | The calculation result is on the HDFS disk with a large delay | RDD intermediate operation results are stored in memory with a small delay | Operation mode | Tasks are maintained in process mode, and the task starts slowly | Tasks are maintained in a threaded manner, with fast task startup, and can be created in batch to improve the parallel ability |
|
|