Abstract

In recent years, the increasing disparity between the data access speed of cache and processing speeds of processors has caused a major bottleneck in achieving high-performance 2-dimensional (2D) data processing, such as that in scientific computing and image processing. To solve this problem, this paper proposes new dual unit tile/line access cache memory based on a hierarchical hybrid Z-ordering data layout and multibank cache organization supporting skewed storage schemes. The proposed layout improves 2D data locality and reduces L1 cache misses and Translation Lookaside Buffer (TLB) misses efficiently and it is transformed from conventional raster layout by a simple hardware-based address translation unit. In addition, we proposed an aligned tile set replacement algorithm (ATSRA) for reduction of the hardware overhead in the tag memory of the proposed cache. Simulation results using Matrix Multiplication (MM) illustrated that the proposed cache with parallel unit tile/line accessibility can reduce both the L1 cache and TLB misses considerably as compared with conventional raster layout and Z-Morton order layout. The number of parallel load instructions for parallel unit tile/line access was reduced to only about one-fourth of the conventional load instruction. The execution time for parallel load instruction was reduced to about one-third of that required for conventional load instruction. By using 40 nm Complementary Metal-Oxide-Semiconductor (CMOS) technology, we combined the proposed cache with a SIMD-based data path and designed a 5 × 5 mm2 Large-Scale Integration (LSI) chip. The entire hardware overhead of the proposed ATSRA-cache was reduced to only 105% of that required for a conventional cache by using the ATSRA method.

1. Introduction

Although computer performance has been considerably improving year by year, the demands on highly efficient 2D data processing, such as image processing, video coding, and scientific computing, are still required. Improvements to processing speeds of processors and memory access speeds are important keys to improve the performance of 2D data processing. Currently, processing speeds can be easily improved by using Single Instruction Multiple Data (SIMD) extensions and multicore technology [1, 2]. However, efficient column- and row-directional data access for SIMD parallel processing has not yet been obtained because conventional cache memory based on conventional raster layouts (row- and column-major layout) does not adequately allow for operations of data with 2D spatial reference locality. For example, TLB misses and conflict misses often occur in large-scale MM because the cache memory can neither contain all the data needed for the program’s execution nor exploit 2D data locality in the column direction. The raster layout also cannot be used for image processing because it does not provide 2D localized block data accessibility and causes the excessive data transfer problem. Therefore, ineffective column-directional data access to cache has been a major bottleneck in achieving high performance 2D data processing by utilizing extended SIMD instructions.

The conventional tiling focuses on reduction of cache capacity misses by dividing the loop iteration space into smaller tiles; reuse of array elements within each tile is maximized by ensuring that the working set size for the tile fits into the data cache. However, 2D data processing by SIMD operations is restricted by unavailability of column-directional parallel access function so that it cannot avoid the performance degradation by frequently TLB misses and conflict misses. In recent years, recursive data layouts with tiling technique for multidimensional array (2D arrays) have been proposed to reduce column-directional conflict misses and improve 2D data locality, regardless of the data array size [35]. However, conventional software-based tiling and recursive data layouts suffer from a significant overhead in execution time because the loop partition computing for tiling and the address calculation computing for recursive data layouts are complex [611].

The recursive data layout with tiling technique can be implemented in hardware to minimize the address calculation overhead. For example, mapping data on Dynamic Random Access Memory (DRAM) by using a recursive data layout has been investigated by some researchers [12, 13]. A 3D-stacked DRAM based on hardware address translation unit is proposed to improve the performance of multidimensional data applications, such as matrix transposition, and cube rotation. However, it is difficult to mass-produce this DRAM at low cost because of the special specifications of DRAM; thus the cost for main memory becomes significantly high. In addition, a cache-based hardware tiling method has been proposed to reduce cache capacity miss and minimize address calculation overhead [14, 15]. Compared with the DRAM-based tiling method, a low hardware scale overhead for cache implementation can be attained. However, these previous studies focus mainly on eliminating the address calculation overhead and they cannot provide parallel row- and column-directional access capability.

To solve these problems, we present new parallel unit tile/line access cache memory that the data in the column direction as well as the data in row direction can be accessed in parallel. The proposed cache architecture is based on a hierarchical hybrid Z-ordering data layout to improve 2D data locality and a multibank cache organization supporting skewed storage scheme to provide a parallel data access function of unit tile/line.

This paper makes the following contributions as compared with our previous work [16]:(i)We present a cache memory with 8 × 8 byte unit tile and 64-byte unit line accessibility. Parallel unit tile/line access corresponding to row-/column-directional access can exploit 2D data locality, minimize the TLB misses, and reduce the required load instructions considerably.(ii)Based on the conventional Z-Morton order layout and conventional Morton hybrid order layout, we present a hierarchical hybrid Z-ordering data layout and show that the proposed data layout can provide substantial 2D data locality, minimize TLB misses, exploit hardware-prefetching technique, and avoid allocation of unused regions in virtual memory.(iii)We propose an aligned tile set replacement algorithm (ATSRA) for reduction of the hardware overhead in the tag memory of the proposed cache. The proposed ATSRA method can considerably reduce the entire hardware scale of the proposed cache, reduce the LRU memory area, and simplify the cache architecture without affecting the cache performance.(iv)We implement the proposed cache in hardware and evaluate the hardware scale overhead of the ATSRA-cache. Compared with the conventional cache, the entire hardware scale overhead of the ATSRA-cache is reduced to only 4% and 5% for a 2-way set associative cache for the former and an 8-way set associative cache for the latter.(v)We use a modified version of the SimpleScalar simulator to evaluate the performance of parallel unit tile/line access for MM in terms of the number of L1, L2 cache misses, TLB misses, and execution time (cycles). The results obtained from evaluation revealed that the ATSRA-cache and the Non-ATSRA-cache provide the least number of cache and TLB misses as compared with that of the conventional cache.(vi)We evaluate the performance of load instruction reduction (SIMD processing) for parallel unit tile/line access. The results from evaluation indicated that the number of load instructions for unit tile/line access is reduced to about one-fourth of the conventional load instruction. The execution time (number of cycles) for unit tile/line access is reduced to about one-third of those required for raster line access due to the effect of L1 and L2 caches misses.

2. Recursive Data Layouts

In computing, raster layout (row- and column-major layout) is method for storing multidimensional arrays in contiguous memory locations. However, raster layout has poor locality for some data access pattern (2D data) because vertically adjacent data are stored apart. Conflict misses and TLB misses of raster layout occur frequently for 2D data access. The raster layout also cannot provide efficient access to the 2D localized block data. For example, in motion estimations, the typical cache line size is too long to load the maximum 16 × 16 pixel reference block.

2.1. Z-Morton Order Layout

In recent years, recursive data layouts have been proposed as a replacement for row- and column-major layout. The most popular data layout among the recursive data layouts is Z-Morton order layout [17, 18], as shown in Figure 1. Z-Morton order maps multidimensional data into one-dimensional data while preserving locality of the data. Figure 1 shows that Z-Morton orders divide the original 16 × 16 matrix into four quadrants (I, II, III, and IV) and lay out these submatrices in memory in the Z-order (row-major). Each quadrant is recursively divided into four subquadrants using the same Z-order (row-major) till the recursion suspends at individual bit.

The Z-Morton order layout has the following features:(i)The Z-Morton order layout can take better advantage of cache locality at all levels from L1 cache to paging. Not only can cache conflict misses and TLB misses be reduced efficiently but also the unused regions are not allocated in virtual memory if the quadrant size of Z-Morton order layout is matched with the page size.(ii)The Z-Morton order provides substantial 2D spatial reference locality when traversed in the row and column directions. Given a cache with any power-of- two cache line size, the cache hit rate of row-directional access is the same as the cache hit rate of column-directional access. For row- and column-directional data access, the theoretical cache hit rate for a cache with cache line size is 1 - (1/).

Tables 1 and 2 show the theoretical hit rate for row-directional access and column-directional access, respectively. Compared with the raster layout, Z-Morton order provides the same cache hit rate for both row- and column-directional access. The hit rate is low because the Z-Morton order does not provide cache line-sized data access and column-directional access capability.

2.2. Morton Hybrid Order Layout

Z-Morton order layout can also be used only on higher-level quadrants, while the lowest-level quadrants still use row- or column-major layout such as Morton hybrid order layout [19, 20]. Figure 2 shows a Morton hybrid order layout that divides original 16 × 16 matrix into equally sized quadrants. Elements are stored in column-major within the lowest-level quadrants. Outside the lowest-level quadrants, high-level quadrants are stored in Z-Morton ordering. This provides 2D data locality inside each quadrant, with fast address calculation among different quadrants.

Morton hybrid order layout is designed to take advantage of hardware architecture and compiler optimization while still exploiting the benefits of hierarchical tiling data layout. To reduce conflict and TLB misses, the lowest-level quadrant size is typically chosen to be equal to cache size or page size. To take advantage of the Intel library and compiler optimizations on current processor, such as BLAS technique, the lowest-level quadrant of a hierarchical tiling data layout must be row or column-major order. Therefore, it is possible to combine the BLAS optimization with the Morton hybrid order layout to improve cache performance.

A related work [19] applied Morton hybrid order layout to dense BLAS techniques to improve the performance of dense matrix multiplication significantly. In addition, the evaluation results showed that the lowest-quadrant size is largely dependent on not only the memory architecture but also the BLAS implementation. This is one disadvantage of Morton hybrid order layout. While Z-Morton order layout has the advantage that its performance is only dependent on memory architecture, Morton hybrid order layout requires significant cache testing to achieve best performance on both memory architecture and the BLAS implementation.

3. Hierarchical Hybrid Z-Ordering Data Layout

Parallel data access in both the row (unit line) and column (unit tile) directions as well as the reduction of cache misses and TLB misses is essential to high-performance 2D data processing. Though Z-Morton order and Morton hybrid order can reduce cache misses and TLB misses efficiently, they do not provide parallel row- and column-directional accessibility. In addition, conventional Z-Morton order and Morton hybrid order have several problems as follows:(i)To provide parallel unit tile access, the memory must be divided into several two bytes wide banks corresponding to 2-byte size subline because Z-Morton ordering takes a 2D array stored in row-major order and reorders it as a 2 × 2 tile array in which the items of each tile are stored in row-major order. This division causes an area increase of the cache memory.(ii)The Z-Morton order layout and Morton hybrid order layout are not suitable for row or column data parallel access, because the cascaded subline addresses are noncontiguous in both row and the column directions.(iii)The Morton hybrid order requires significant cache-testing to achieve high-performance on both the architecture and the BLAS implementation for MM.

To eliminate these problems, we propose a new data layout based on the conventional Z-Morton order layout and Morton hybrid order layout called the hierarchical hybrid Z-ordering data layout. Data in the proposed layout is recursively subdivided into equally sized tiles stops at large tiles with a size of T (8-byte-word × 32). As shown in Figure 3, the size of a large tile is equal to 4 KB because modern computer systems normally support a 4 KB page size. The division of the address space into small or large tiles, whose internal data tags are equal to one another, matches the 2D spatial reference locality to minimize not only TLB updates or misses, but also excessive data transfers between the main memory and auxiliary storage. A conventional 64-byte-sized cache line in each large tile is divided into 8 × 8 byte-sized unit tiles. Unit tile access corresponds to parallel access of contiguous double words in the column direction. As shown in Figure 4, row-major order is adopted inside the unit tiles and large tiles. A large tile composed of 8 × 8 unit tiles stores data in row-major order. The numbers in the unit tile represent 8-byte data that are contiguously accessed in the column direction. In addition, data outside the large tile are stored in Z-Morton ordering among large tiles. Based on the proposed layout, all large tiles are mapped onto contiguous memory location and allocation of unused regions in virtual memory can be significantly reduced.

The hierarchical hybrid Z-ordering data layout can also exploit constant-stride hardware-prefetching and cache tiling as well as the locality of the Z-Morton order layout [21]. Athanasaki [22] proved that prefetching in combination with cache tiling and block data layout, set optimal tiling size being equal to L1 cache size can decrease the L2 and TLB misses. However, this approach could only exploit prefetching in the row-directional access. Compared with this previous research, the proposed hierarchical hybrid Z-ordering data layout has three main features:(i)It can exploit constant-stride prefetching in both row- and column-directional access.(ii)It can be used for 2D data processing because all levels of tiles are mapped into contiguous memory locations. Data in each tile are accessed in row-major order and prefetching can be satisfactorily exploited.(iii)The large tile size is equal to the cache way size and page size so that prefetch can be effective for the column-directional access in the same large tile. As a result, the probability of crossing 4 KB page boundaries is less than 1/32 in the column-directional contiguous access and TLB misses can be minimized in processing using cache tiling techniques.

For example, consider the tiled MM that is shown in Listing 1. Arrays A[i][k] and B[k][j] can be loaded by using constant-stride prefetching in corresponding row direction and column direction. The matrix size is 32. We defined the size of a large tile to be 256 × 16 bytes because the MM accesses data in the row direction by 32 × 8 bytes (256 bytes).

(1) int main()
(2)
(3) double A[N][N], B[N][N], C[N][N];
(4) for(ii = 0; ii < N; ii+=32)
(5) for(jj = 0; jj < N; jj+=32)
(6) for(kk = 0; kk < N; kk+=32)
(7) for(i = ii; (i < N && i < ii + 32); i++)
(8) for(j = jj; (j < N && j < jj + 32); j++)
(9) for(k = kk; (k < N && k < kk + 32); k++)
(10) C[i][j]+= A[i][k] B[k][j];
(11)

By using constant-stride prefetching, there is only one miss that occurs in the row-directional access (32 accesses) for array A[i][k]. In addition, two misses occur in the column-directional access (32 accesses) for array B[k][j] because there are only 16 rows in each large tile. Column-directional access crosses the boundary of the large tile, which causes additional cache misses and TLB misses. Therefore, the total cache hit rate is 1 - 3/64 = 95.3%. Column-directional access causes a TLB miss per 16 accesses and row-directional access causes a TLB miss per 32 accesses in terms of TLB performance. Therefore, the TLB hit rate for column-directional access is 1 - 1/16 = 93.8%. The TLB hit rate for row-directional access is 1 - 1/32 = 96.9%. Table 3 summarizes the theoretical cache and TLB hit rates.

For the raster layout, the row-directional access only causes 1 miss per 32 accesses if sequential unit-stride prefetching is used. Column-directional access causes 32 misses per 32 accesses because each stride access crosses 4 KB page boundaries. As a result, hardware-prefetching cannot be used and the cache hit rate is only 1 - 33/64 = 48.4%. Table 3 shows the theoretical hit rate of different level of memory hierarchy for a MM by using double-precision floating-point operations.

We define the size of a large tile to be 128 × 32 byte for another MM by single-precision floating-point operations because MM accesses the data in the row direction by 32 × 4 bytes (128 bytes). By using constant-stride prefetching, there is only one miss that occurred in the row-directional access (32 accesses) for array A[i][k]. In addition, there is only one miss that occurred in the column-directional access (32 accesses) for B[k][j] because there are 32 rows in each large tile. Therefore, the total cache hit rate is 1 - 2/64 = 96.9%. Column-directional access causes a TLB miss per 32 accesses and row-directional access causes a TLB miss per 16 accesses in terms of TLB performance. Therefore, the TLB hit rate for column-directional access is 1 - 1/32 = 96.9%. The TLB hit rate for row-directional access is 1 - 1/16 = 93.8%. Table 4 summarizes the theoretical cache and TLB hit rates.

For raster layout, the row-directional access causes only 1 miss per 32 accesses if the sequential unit-stride prefetching is used. Column-directional access causes 32 misses per 32 accesses because the stride crosses 4 KB page boundaries. As a result, hardware-prefetching cannot be used and the cache hit rate is only 1 - 33/64 = 48.4%. Table 4 shows the theoretical cache and TLB hit rate.

4. Proposed Cache Memory with Unit Tile/Line Accessibility

The proposed hierarchical hybrid Z-ordering data layout can minimize the TLB misses, exploit hardware-prefetching technique, and provide parallel column-directional accessibility. To provide row- and column-directional accessibility, we propose a cache memory that provides access function of 8 × 8-byte-sized unit tile or 64-byte-sized unit line to the hierarchical hybrid Z-ordering data layout. Parallel unit tile/line access can improve the instruction throughput with low latency overhead of the cache memory. By using SIMD extensions, the number of load instructions required for column-directional access (8 × 8-byte-sized unit tile access) or row-directional access (64-byte-sized unit line access) can be reduced to only one-eighth of that required for conventional raster line access.

4.1. Parallel Unit Tile/Line Data Access Mechanism

Parallel unit tile access is also effective to reduce the amount of excessive data transfer and to exploit 2D data locality well since image processing applications operate on small blocks of 2D data [23]. However, unit line access as cache line access is essential for current conventional applications. Therefore, we proposed a dual unit tile/line access cache memory based on 8-bank memory array structure supporting skewed array store scheme, as shown in Figure 5. The multiple memory banks are defined as tag-bank0~tag-bank7 for tag memory and data-bank0~data-bank7 for data memory. Multiple sublines compose a unit tile loaded from main memory and each of them is stored in each memory bank. However, if some of the sublines are stored in the same bank, any unit tile or any unit line cannot be accessed in parallel [16, 18].

To eliminate this problem, we adopt skewed array scheme in storing the unit tile or the unit line to the cache memory. The 8-bank memory array structure with a skewed storage scheme in Figure 6 provides parallel unit tile/line access. For example, the first, second, third, and fourth row of the unit line are stored with additional skews corresponding to 0, 1, 2, and 3. Since the subline of unit line is skewed, additional operations are needed to align the elements after the elements have been loaded to the processor register. For unit tile/line access, the sublines in the Nth row generally need to be rotated (N-1) times to the left to be aligned. An N-bank structured memory with a skewed storage scheme causes an increase in the entire scale of hardware and transfer delay. This is because each memory bank needs an address generation circuit to calculate the address. The proposed hardware-based address translation unit is shown in Figure 7. Since the translation circuit consists of single-stage 2 to 1 multiplexers to select the hierarchical hybrid Z-ordering address or the conventional address, neither the access cycle time nor the latency increases.

4.2. Multi-TLB Organization for Unaligned Unit Tile/Line Access

The proposed cache provides aligned/unaligned unit tile/line accessibility. The unaligned unit tile access means that the accessed column-directionally adjacent unit tiles have different tag value, as shown in Figure 8. The unaligned unit line access means the accessed unit line across the large tile boundary in the row direction, as shown in Figure 8.

To parallel access TLB for unaligned unit tile/line access, the proposed cache adopts a multi-TLB organization, as shown in Figure 5. The TLB1 performs address translation if the tag value of unit tile is odd number. The TLB0 performs address translation if the tag value of unit tile is even number.

4.3. Hardware Scale Optimization of the Proposed Cache

As shown in Figure 5, the capacity of each tag memory bank is only 8 words because the tag memory of the proposed cache is divided into 8-banks. For VLSI hardware design, each tag memory bank is composed of one SRAM (Static Random Access Memory) macro. However, the entire hardware scale of the proposed cache will be increased if the size of SRAM macro we use is smaller than 32 words because the conventional SRAM macro size is commonly 32 words [17].

Here, we propose an ATSRA method to reduce the entire hardware scale overhead of the proposed cache. The ATSRA method focuses mainly on reducing tag memory capacity and it can also simplify the proposed cache architecture. The proposed ATSRA method has the following features:(i)The proposed cache selects the unused for longest time aligned tiles to be replaced according to the least recently used (LRU) replacement policy.(ii)If a unit tile or unit line miss occurs, the processor loads an aligned tile set from the lower level cache or the main memory. An aligned tile set contains eight aligned tiles and each tile of an aligned tile set must be stored in the same cache way set.(iii)If a unit tile or unit line hit occurs, the cache updates the LRU bit of the aligned tile set. Therefore, the LRU memory capacity of the proposed ATSRA-cache can be reduced to only one-eighth of that required for the proposed Non-ATSRA-cache.

Figure 9 shows the structure of a large tile. Each aligned tile set is composed of eight aligned unit tiles. All the unit tiles in an aligned tile set have the same tag value and the same high-order 3 bits of the index field. If a cache miss occurs due to the unit tile or unit line access, the lower level cache or main memory will transfer a corresponding aligned tile set to the cache or processor and all aligned unit tiles must be stored to the same cache set. In addition, the cache updates LRU information by the aligned unit tile set when a unit tile or unit line hit occurs. Via this method, it is not necessary to read the tag values from eight tag memory banks so that the proposed cache with only two tag memory banks can also provide unit tile/line accessibility. The capacity of each tag bank of the proposed cache is increased to 32 words, as shown in Figure 10. The LRU memory capacity is reduced from 64 words to only 8 words. Figure 11 shows the proposed ATSRA-cache architecture. The proposed ATSRA-cache adopts two-bank tag memory structure.

5. Hardware Implementation

In the previous section, we show that our proposed cache can minimize TLB and conflict miss and provide parallel unit tile/line accessibility. Here, we combined the proposed 2-way set associative 8 KB cache with cache line of 64 bytes into an SIMD-based data path [24] and fabricated an LSI test chip to show the feasibility of the proposed cache. We measure the proposed cache performance in terms of three main aspects: hardware scale overhead, clock frequency, and read latency. The logic design is synthesized from Systemverilog with Synopsys Design Compiler under 40 nm CMOS technology in the chip fabrication program of the VLSI Design and Education Center (VDEC).

There are two types of the proposed cache: a 2-way set associative 8 KB cache and an 8-way set associative 32 KB cache. The cache line size is 64 bytes. Both of the proposed caches adopt a 3-stage structure composed of the first tiling address converting stage, the second tag, and data memory access stage for hit/miss determination and the third way selection stage for loading unit tile/line so that its latency becomes 3 cycles, equal to that of the L1 data cache in the recent Intel or advanced reduced instruction set computer (RISC) machine (ARM) high-performance processors, as shown in Table 6. The clock period of SRAM Macro we use is 3.43 ns and the clock period of the three types of cache memory is suppressed to 3.9 ns. Therefore, the evaluation result shows that the clock period of the proposed cache with the dual data access mode does not increase compared with that of the conventional cache.

5.1. Hardware Scale Overhead and Speed Performance Evaluation

The details of the specifications and chip layout of 5 × 5 mm2 area are shown in Figure 12 and Table 5, respectively. Table 5 shows the area breakdown by module. The proposed cache memory is composed of the 96 SRAM macros for the data and tag memories, accounting for 51% and 10% of the total layout area, respectively. The peripheral circuit scale of the proposed cache is small, representing only 7% of the total layout area. As a result, the proposed cache occupies 68% of the total layout area.

Tables 7 and 8 show the evaluation result of the hardware scale overhead. The conventional cache adopts two-tag-banks structure for aligned/unaligned row-directional data access. The proposed Non-ATSRA-cache adopts multiple eight-tag-banks structure for row-/column-directional data access, as shown in Figure 5. In contrast, the proposed ATSRA-cache adopts multiple two-tag-banks structure. The peripheral circuit scale of proposed cache increases to two times of that of the conventional cache. As a result, the entire hardware scale overhead of the proposed Non-ATSRA-cache (2-way) is suppressed to 25%. By ATSRA method, the tag memory area of the proposed cache is not increased as compared with that of the conventional cache. Furthermore, the LRU memory area of the proposed ATSRA-cache is reduced to only 1/8 of that required for the conventional cache. As a result, the entire hardware scale of the proposed ATSRA-cache (2-way) is only 1.04 times that of the conventional cache. The entire hardware scale of the proposed ATSRA-cache (8-way) is only 1.05 times that of the conventional cache.

5.2. Critical Path for Loading

The latency of the proposed cache is 3 cycles. As expected, the critical path delay of the proposed caches are as follows: (1) the time to convert the raster scan order address into hierarchical hybrid Z-ordering address; (2) the time to access the SRAM marco (tag memory and data memory); (3) the time to select data from the 2-way and provide the result to processor.

As shown in Figure 13, on the first cycle, the address conversion performs Z-ordering address translation and generates multiple addresses by using bit-order interchange and addition operation. Each address is sent to corresponding cache memory bank for loading unit tile and unit line. Since the address translator consists of multiple adder circuits and single-stage 2 to 1 multiplexers selecting the hierarchical hybrid Z-ordering address or the raster address, it does not increase the access cycle time. In addition, the tile/line way selection for line access consists of eight single-stage 8 to 1 multiplexers selecting the subline from the cache; it does not cause a long delay time. The longest delay time is the access time for tag and data memory access. The minimum clock period of the proposed cache is suppressed to 3.9 ns.

6. Parallel Unit Tile/Line Access Evaluation

We implement the parallel unit tile/line access function in the SimpleScalar simulator [2528] and propose a unit tile/line access evaluation method to evaluate the ATSRA-cache performance for MM. The SimpleScalar simulator does not support SIMD instructions. Instead, we select double precision corresponding to unit tile or unit line for SIMD parallel processing.

We simulated a common 4-way issue superscalar processor configuration with memory hierarchy as shown in Table 9. We set the size of L1 and L2 cache in accordance with the corresponding size of cache in the current Intel processor and measure the L1, L2 cache and TLB misses and execution time performance (cycles). We compare the proposed hierarchical hybrid Z-Morton order layout against the row-major layout and the Z-Morton data layout. We use a 6-loop tiled MM with ijk order, as shown in Listing 1. The tile size is set to 32 × 32 because this size allows all three arrays of MM to fit in L1 cache size; 1 tile of each array contains 32 × 32 elements and takes up 32 × 32 × 8 bytes. Since MM uses 3 arrays (A, B, C), the sum is 3 × 32 × 32 × 8 = 24,576 bytes < 32 Kbytes.

6.1. Evaluation of Parallel Unit Tile/Line Access

The MM algorithm is in Listing 1, where the row- and column-directional access are equal to each other. We create all of three matrics A, B, and C in C language using malloc() to guarantee that all elements of the matrix are contiguously allocated in the memory space. If the read address is in the address range of matrix A, unit line access mode is selected as the row-directional access to the matrix A access. If the read address is in the address range of matrix B, unit tile access mode is selected as the column-directional access to the matrix B access data. Figures 1417 show the evaluation results for parallel unit tile/line access.

Figure 14 shows the number of the DL1 cache misses. In Listing 1, matrix A[i][k] is accessed in row-major order, and the other matrix B[k][j] is accessed in column-major order. Therefore, the column-directional access of matrix B[k][j] causes frequently conflict misses and TLB misses especially at power-of-two sized matrix composed of row over 4 Kbytes page size. This is because the conventional cache stores data in row-major order so that it cannot exploit 2D data locality well. At non-power-of-two sized matrix, the proposed cache and the conventional cache achieve similar performance because the conventional cache does not suffer from significantly conflict misses when matrix size is non-power-of-two. In addition, because the large tile size is 256 bytes × 16, the number of conflict misses for row-directional access can be reduced as compared with the Z-Morton order layout.

Figure 15 shows the number of UL2 cache misses of the raster scan order where the proposed layouts are almost equal to each other at the non-power-of-two sized matrix. Figure 16 shows the TLB performance. The hierarchical hybrid Z-ordering data layout provides the best TLB performance because the large tile size matches the page size and the number of TLB misses for row-directional access is low. Therefore, the proposed cache provides efficient column-directional parallel access that was the same as row directional parallel access so that it enabled efficient SIMD operation that required no transposition in MM.

Figure 17 shows the overall speedup of each configuration normalized to the raster scan order configuration. In all cases, the proposed ATSRA cache provides similar performance to that of the Z-Morton order layout. In addition, at the non-power-of-two sized matrix, the Z-Morton order and the hierarchical hybrid Z-ordering achieve the similar execution time performance as compared with the raster scan order although they suffer from little bit higher number of DL1 caches because their TLB performance is better than the raster scan order at all cases.

6.2. Evaluation of Reduction to Load Instructions

Because the SimpleScalar simulator does not support SIMD instructions, we revise the method of calculating execution cycles for load instructions and evaluate the effectiveness of the column-directional parallel access. Section 4 explains that the number of load instructions required for parallel unit tile/line access can be reduced by using SIMD extensions to only one-eighth of that required for conventional raster line access in which SIMD extensions are not used. This is because only one load instruction can read an 8 × 8-byte-sized unit tile or a 64-byte-sized unit line from the proposed cache. We carried out simulations to verify that the proposed cache can efficiently reduce the number of load instructions for parallel unit tile/line access.

Only the frequency of reduced load instructions for unit tile/line access is counted in the simulation. Parallel load instruction is used for unit tile/line access. For conventional raster line access, parallel load instructions can only be used for row-directional data access. Therefore, the number of load instructions for column-directional data access is reduced to one-eighth of that required for conventional raster line access. The number of load instructions for unit line access is equal to that required for conventional raster line access (row-directional data access). The results obtained from evaluations are shown in Figure 18. In all cases, the number of load instructions required for parallel unit tile/line access is reduced to about one-fourth of that required for conventional raster line access. As a result, the proposed cache with unit tile/line accessibility can considerably improve the effective transfer rate of the load instruction.

We also evaluate the performance improvement in execution time (cycles) for parallel unit tile/line access by using parallel load instructions. We calculate the execution time by using the following equation: Execution time = the number of load instructions × load throughput + the number of DL1 cache misses × DL1 cache miss penalty + the number of UL2 cache misses × UL2 cache miss penalty. As shown in Table 9, the DL1 cache miss penalty is 6 cycles and the UL2 cache miss penalty is 30 cycles. We assume that the load throughput is 1 cycle. The equation to calculate the execution time for conventional load instruction is in (1) and that to calculate the execution time for parallel load instruction is shown in (2).

We evaluate the performance improvement by using (3). The results obtained from evaluations are shown in Table 10. Although the number of parallel load instructions for parallel unit tile/line access are only one-fourth of the conventional load instructions, the performance of the proposed cache is also affected by the performance of DL1 and UL2 caches. As a result, the execution time of the parallel load instructions is about one-third of that required for conventional load instructions.

7. Conclusion

The main achievement and contribution of this paper have been to propose a new dual unit tile/line access cache memory in improving the performance of column-directional cache memory access. We proposed a hierarchical hybrid Z-ordering data layout that can improve 2D data locality and exploit hardware prefetching well. In addition, we proposed a ATSRA tag memory reduction method that can minimize the hardware scale of the proposed cache. After analysis and consideration of the experimental results, we proved that the proposed cache achieves both parallel unit tile/line accessibility by the minimal overhead hardware increase although its access efficiency can be outperformed as compared to those previous studies.

Finally, we modified the SimpleScalar simulator to evaluate the performance of parallel unit tile/line access and provided the following important conclusions: (1) While providing column-directional parallel access capability, the proposed cache succeeds to suppress the conflicts miss increase to the power-of-two sized matrix computation as compared with that of the raster layout and Z-Morton order layout. As a result, the proposed cache enables almost one cycle load of 8 double precision data in both row and column directions so that it not only makes matrix transposition unnecessary but also allows effective utilization of 2D reference locality by SIMD operations. (2) The proposed ATSRA cache provides a high-performance of the parallel unit tile/line with the minimal hardware increase to the normal DL1 cache structure. (3) The number of parallel load instructions required for parallel unit tile/line access is reduced to about one-fourth of that required for conventional raster line access. The proposed cache with tile/line accessibility further improves the performance of 2D applications by using SIMD instructions.

In future, we will combine the proposed cache with SIMD processor and evaluate the unit tile/line access performance for parallel data access.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

This work is supported by Grant-in-Aid for Scientific Research C Grant Number 24500059 and VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys, Inc.