Performance Awakens

The Performance of New Processors Based on the Soft Machines™ VISC™ Architecture

Soft Machines™ has invented a new way to improve performance scaling. We make it possible for a single software thread/task to run on multiple cores using existing instruction sets. We refer to this architecture as VISC™. Each operating system thread sees one “virtual core” that is composed of multiple physical cores, resulting in higher performance and lower power. In October 2014, Soft Machines demonstrated a prototype VISC Processor and SoC in 28nm silicon proving that VISC delivers a 2X-3X IPC speedup over existing processors. The architectural performance simulation models we have developed for Shasta, Shasta+ and Tahoe have been correlated and proven using the 28nm VISC processor. In October 2015, we debuted the first VISC processor, code name Shasta, and today, for the first time, we are showing the projected performance of a new generation of VISC processors compared with today’s fastest, most energy-efficient mobile processors. We run today’s code—x86 and ARM, for instance—on our processors using software and hardware conversion technologies that make it possible for current and future applications to benefit.

Performance Innovation

The graphs below compare upcoming VISC processors to the ARM A72, Intel Skylake and Apple A9X—the latest and most powerful processors in the industry. We have carefully measured their power and performance, and then extrapolate as processor frequency scales up and down. Leveraging proven methodologies from the VISC prototype, we simulate the performance of the Soft Machines VISC multicore processors Shasta/Shasta+/Tahoe so that our customers and developers can plan for upcoming devices and applications. All processors’ performance and power are baselined to the advanced 16nm finfet process technology. The same compiler settings are used for all platforms for an apples-to-apples comparison of performance. GCC 4.9 is used for A72, Skylake and VISC platforms and Clang for the A9X.

SMCH_VISC Graphs-1-02

Let’s examine the energy graph. On the x-axis is the SPEC2006 score, which is a geo-mean of the individual integer and floating point tests. The y-axis is the scaled energy at each operating point. This illustrates the energy used at each performance point. Looking at the left side of the graph, you will notice that the A72, A9X and Intel Skylake processors are all on a similar performance scaling curve. This is because traditional CPU microarchitecture techniques have been locked in a state of refinement, and have generally exhausted the methods available to improve on today’s approach. Differences in performance are primarily driven by the width of the machines, the pipeline depth and their transistor count. Bigger processors have more performance, consume more power and are seen on the higher end of the same curve. The ARM A72 is an energy-efficient design that operates at the bottom end of the curve below 1 watt, whereas the Apple A9X generally operates in a higher power range, but fundamentally fits on the extended curve of the A72 in the middle portion of the same general curve. The Intel Skylake processors enable substantial performance and a full range of productivity applications, but are out of the range of today’s mobile computing platforms, which must depend upon compact and passive cooling solutions and must operate for extended periods using compact batteries. VISC delivers an average of 2-4X performance/watt advantage due to its fundamental breakthrough of using dynamic resource scaling. VISC uses all of the cores already in the chip to improve performance, with the advantage of increasing power only linearly. Today’s RISC and CISC cores have to scale up frequency and voltage2, which has a cubic effect on power to gain more performance, which translates to more energy.

You can see that the VISC processor family delivers anywhere from ~2 to 7x less energy depending on where you look on the graph. We like to simplify this performance/watt advantage to ~2-4x, to be a bit modest.

SMCH_VISC Graphs-1-01

Now let’s take a look at comparing the processors in the performance/watt graph two different ways. One way is to look at the performance of various processors all at the same power level—note the horizontal dashed lines. The dual-core Shasta processor shows a 160-180% performance advantage over ARM A72 at the same power levels of 1.5-3.0 watts when running a single-instance SPEC2006. In the spirit of full disclosure, we had to extend the A72’s curve to reach the 1.5-3 watt range, so it is shown in a dashed line for higher-end implementations beyond mobile. If we didn’t extend the curve, then Shasta would be 150% faster at 1 watt. The second way to compare processors is to look at a fixed performance point and see how much power each processor requires—note the vertical dashed lines. Shasta can realize a 3-5x advantage in power when comparing the same performance levels, as shown by the dashed lines. The Tahoe processor allows one virtual core to run on quad physical cores, and has a >6x power advantage over a 2.8GHz Skylake at the same performance level. Again, we like to simplify the chart’s 1.5-6x performance/watt advantage down to a simple 2-4x average window.

How VISC™ Works

The evolution of computing is driven by improving the performance of computer processors and interfaces. For processors to run faster, we increase the speed with which each task performs, and break big tasks into many parallel subtasks and reassemble the results at the end. Due to maximum power limitations, modern processors use multiple cores, but ultimately, if each core isn’t getting faster, performance plateaus and computing evolution slows down.

VISC processors allocate compute resources dynamically across the physical cores “under the hood.” The hardware automatically breaks each software thread into multiple hardware threadlets, which are then managed by a virtual core. The virtual core uses the entire pool of physical processors, and dynamically assigns the exact amount of execution resources required by the threadlet, providing near-perfect optimization for both throughput and latency performance. Less demanding threads automatically get a smaller amount of execution resources, and when a demanding thread is present, it immediately gets more resources. By handling this scheduling at the hardware level inside the processor, these resources are made available at cycle-level granularity, enabling ideal balancing for mixed workloads in real-world systems without requiring complex software multithreading or OS scheduler intervention.

SMI_Dual_SW_Threads-04

Computing Innovation

Soft Machines is developing platform approaches for emerging client and server devices.

Emerging consumer and client computing applications in mobile virtual and augmented reality create the most challenging workloads we have seen, while demanding wearable, free-moving, low-power form factors. Virtual reality is emerging as the most disruptive innovation in modern gaming, but it demands mobile performance well beyond current capabilities.

Today’s traditional clients are also evolving. Next-generation game consoles will need to power totally immersive virtual reality experiences. Next-generation tablets will completely replace our laptops once they are capable of content creation. Together, these trends are forming a new landscape of client computing, one that demands desktop performance in mobile form factors. Gaming and simulation are notoriously demanding and among the most challenging applications to run across multiple cores, but early VR HMDs are now tethered to powerful desktop systems. VISC is well-positioned to exceed today’s architectures, powering the next generation of client computing.

Server architectures continue to evolve to meet diverse workloads across storage, cloud and enterprise, while also driving down power to reduce the total cost of ownership. Each segment presents unique requirements that VISC is uniquely positioned to meet.

Cloud servers are responsible for providing dedicated Web application sessions to millions of users. These sessions represent many concurrent requests that present themselves as multiple software threads requiring high-throughput performance. VISC’s unique ability to dynamically allocate the exact compute resources required by each software thread on a cycle-by-cycle basis provides near perfect load balancing and enables highly optimized throughput performance for user responsiveness and low power consumption.

Traditional enterprise servers typically run corporate database applications. These applications typically require the maximum single-thread performance possible. CPU vendors have spent billions of dollars and many thousands of engineer-years chasing the “holy grail” to improve single-thread performance, but have not made any major architectural advancements since the late 1990s. VISC’s ability to use all of the available processor cores to execute a single thread is the first major processor innovation in years. VISC delivers 2-3x speedup of instructions-per-cycle over the A72 and Skylake processors, resulting in unprecedented single thread performance and making it ideal for enterprise servers.

VISC will bring server performance to mobile devices. It can enable entirely new kinds of computing devices. It can power extremely small, affordable computing devices, improving connectivity and collaboration for emerging markets and applications such as VR, autonomous, high-end client and cloud. VISC completes this era of computing by reconciling the need for single-threaded performance with the trend toward multicore processor design.