Day 2: Infineon’s Embedded Systems Summer Course (2024) part 1
On 06/04/2024 Prakash Sir started with revision of the previous class and continued on with Memory architectures.
Memory architectures are divided into 3 types
- Shared Memory
- Distributed memory
- Hybrid Shared Distributed memory
1.Shared Memory is divided into Unified Memory Access and Non-Uniform Memory Access.
Unified Memory access shared memory means “All CPUs have equal memory access latency”
Non Uniform Memory Access means “Access latencies of local and remote memories are different”
2. Distributed Memory-Private address space
Distributed Memory with private address space is a fundamental architecture in parallel computing, commonly deployed in High-Performance Computing (HPC) systems.
In this setup, each processing unit, typically a CPU, possesses its own dedicated memory space. Unlike shared memory architectures where all CPUs can access a common memory pool, in distributed memory systems, CPUs cannot directly access each other’s memory.
This isolation ensures data integrity and security within each processing unit. To facilitate communication and collaboration among CPUs, message passing protocols or interconnect mechanisms are employed.
While this architecture requires explicit data exchange between CPUs, it offers scalability and flexibility in handling large-scale computations.
Moreover, the private address space approach minimizes contention and ensures efficient utilization of memory resources, making it a preferred choice for parallel computing tasks requiring high performance and robustness.
3. Hybrid model
The Hybrid Memory Model combines aspects of both shared and distributed memory architectures, offering a versatile approach to parallel computing. In this model, nodes or processing units have their own private memory space, akin to distributed memory systems, but they can also access a shared memory pool. This shared memory region allows for efficient communication and data sharing between nodes, enhancing collaboration and synchronization in parallel computations. By leveraging the strengths of both shared and distributed memory paradigms, the Hybrid Memory Model accommodates diverse computational workloads, optimizing performance and scalability for a wide range of HPC applications.
The memory hierarchy in microcontrollers often consists of multiple levels, each serving different purposes and accessible by different components within the system. At the top of the hierarchy are the CPU cores, which have their separate instruction (I$) and data (D$) caches.
These caches are typically split into instruction and data cache memory (I-CCM and D-CCM, respectively) (CCM-Closely Coupled Memory), which are often tightly integrated with the CPU cores for fast access. Additionally, there might be a shared L2 cache serving all CPU cores, providing a larger and more unified cache space for frequently accessed data.
The communication between different components, including CPU cores and caches, is facilitated by a bus or interconnect. Below the L2 cache, there are typically Level-0, Level-1 (possibly NUMA), and Level-2 memories (UMA), providing varying levels of access latency and capacity for storing and accessing data.
This hierarchical arrangement optimizes memory access speed and efficiency in microcontroller systems, catering to the requirements of embedded applications across a diverse range of use cases.
Yes the CPU 2 can access cache of CPU 1 as they are shared memories.
In this memory mapping configuration, both CPU-A and CPU-B have their own private memories, labeled MEM-A and MEM-B respectively. Additionally, each CPU has its own instruction cache (ICCM) and data cache (DCCM), denoted as ICCM-A, ICCM-B, DCCM-A, and DCCM-B.
The shared memories, accessible to both CPUs, are mapped to specific address ranges in the CPU address space. For example:
- Addresses 0x5000–0x5FFF correspond to ICCM-A.
- Addresses 0x6000–0x6FFF correspond to DCCM-A.
- Addresses 0x7000–0x7FFF correspond to ICCM-B.
- Addresses 0x8000–0x8FFF correspond to DCCM-B.
- Addresses 0x9000–0x9FFF correspond to the shared memories accessible to both CPUs.
When an address in the shared memory range (0x9000–0x9FFF) is accessed, both CPUs fetch data from their respective private memories only. This behavior ensures that CPU-A’s private memory cannot be accessed by CPU-B and vice versa, maintaining data integrity and security between the CPUs.
Cache Memories
Cache memory mitigates the performance drawbacks of slow main memory by storing frequently accessed data and nearby data in a smaller, faster memory closer to the CPU. By exploiting principles of temporal and spatial locality, cache memory reduces pipeline stalls, minimizes energy consumption associated with data movement, and enhances overall system performance in computing architectures.
Cache memory operates on the principle of cache line fill, where upon accessing a memory location, not only the requested data but also adjacent data within the same cache line are brought into the cache. This ensures that subsequent accesses to nearby memory locations within the same cache line result in fetches from the cache, reducing latency. Key considerations in cache management include cache policies like Write Through and Write Back, cache line mapping (associativity), eviction and refill policies, as well as ensuring cache coherence to maintain data consistency across multiple caches in a multi-core system.
Cache policies (Write Through, Write Back)
Cache policies, namely Write Through and Write Back, dictate how data updates are managed between the cache and the main memory.
- Write Through: In this policy, whenever the CPU writes data to the cache, the corresponding data in the main memory is also updated immediately. This ensures that the main memory always reflects the latest data. While this approach maintains data consistency, it can lead to increased memory traffic and potentially lower performance due to the frequent updates to main memory.
- Write Back: Unlike Write Through, Write Back cache - delay updating the main memory until the cache line is evicted from the cache. When a CPU writes to a location in the cache, the corresponding data in the main memory remains unchanged until the cache line is replaced. This strategy reduces memory traffic and can improve performance by batching multiple writes before updating main memory. However, it introduces the risk of data inconsistency between the cache and main memory until the cache line is written back.
Both policies offer trade-offs between performance, consistency, and memory bandwidth usage, and their suitability depends on the specific requirements of the system and the workload characteristics.
Write Through: Used in systems requiring immediate consistency, like banking or real-time control systems.
Write Back: Ideal for general-purpose computing and graphics processing, optimizing performance by delaying updates to main memory.
Cache line mapping (associativity), Eviction and refill policies
Cache line mapping, eviction, and refill policies are crucial aspects of cache management:
Cache Line Mapping: Determines how cache lines are mapped to specific locations within the cache. Common mappings include direct-mapped, set-associative, and fully associative, each offering different trade-offs between cache size, access latency, and complexity.
Eviction Policies: Dictate which cache line is replaced when the cache is full and a new line needs to be fetched. Popular eviction policies include Least Recently Used (LRU), First In, First Out (FIFO), and Random Replacement. These policies aim to maximize cache hit rates by retaining the most relevant data in the cache.
Refill Policies: Govern how data is fetched from main memory to fill cache lines. Typically, data is fetched either on demand (fetching data only when a cache miss occurs) or proactively (prefetching data into the cache before it’s requested), aiming to minimize cache miss penalties and improve overall system performance.
Cache Coherence
Cache coherence refers to the consistency of data stored in multiple caches in a shared-memory multiprocessor system. In such systems, each processor (or core) typically has its own cache, and when multiple processors access the same memory location, it’s possible for each processor’s cache to have a different copy of the data.
Ensuring cache coherence is essential to prevent data inconsistency and maintain program correctness. There are several protocols and mechanisms to achieve cache coherence:
1. Invalidation-Based Protocols: Invalidation-based protocols operate by invalidating (marking as stale) copies of data in other caches whenever one cache modifies a shared memory location. This ensures that only one copy of the data is considered valid at any given time.
2. Update-Based Protocols: Update-based protocols propagate updates to shared memory locations to all other caches that have a copy of the data. This approach reduces the need for cache invalidations but requires more complex mechanisms to ensure consistency and synchronization.
3. Snooping Protocols: Snooping protocols monitor (snoop) the interconnect between caches for memory access requests. When a cache access is detected that could potentially affect the coherence of shared data, the appropriate actions (such as invalidation or update propagation) are taken to maintain coherence.
4. Directory-Based Coherence: In directory-based coherence, a centralized directory maintains information about which caches have copies of each memory block. When a cache modifies a memory block, it communicates with the directory to ensure coherence by coordinating invalidations or updates to other caches as necessary.
Cache coherence protocols ensure that all processors in a shared-memory system observe a consistent view of memory, despite the presence of multiple caches with potentially conflicting copies of data. Maintaining cache coherence is crucial for the correctness and reliability of parallel and distributed computing systems.
In a typical microcontroller unit (MCU), various interconnect types are used to facilitate communication between multiple components such as CPUs (masters) and peripherals (slaves). These interconnects can include shared buses and crossbars, with varying degrees of partial or full connectivity.
In a typical microcontroller unit (MCU), various interconnect types are used to facilitate communication between multiple components such as CPUs (masters) and peripherals (slaves). These interconnects can include shared buses and crossbars, with varying degrees of partial or full connectivity.
For instance, in a setup with three masters (M1, M2, M3) and five slaves (S1, S2, S3, S4, S5), employing a shared bus or crossbar would require 15 wire bundles to connect all masters to all slaves. However, to reduce the complexity and bundle count, a mechanism is needed.
One solution is to introduce a common wire bundle, which simplifies the interconnection scheme. This bundle connects all CPUs and slaves to an arbiter and decoder, which manage communication between them. The arbiter handles requests from CPUs and grants access to the bus, while the decoder decodes addresses and selects the appropriate slave device. This setup reduces the wire count and complexity, making the interconnect system more manageable and efficient.
In many microcontroller units (MCUs), a common multi-level bus structure is utilized for interconnection. This structure employs a common wire bundle that connects multiple components, including CPUs, slaves (peripherals), arbiters, and decoders. For instance, one level of the bus may connect CPU-1, CPU-2, and a subset of slower slaves (e.g., Slave-1, Slave-2, Slave-3), managed by an arbiter and decoder. Another level may connect high-speed slaves (e.g., memories) or additional peripherals (e.g., Slave-4, Slave-5, Slave-6, Slave-7, Slave-8) through a separate arbiter and decoder. This multi-level bus architecture helps streamline interconnection, manage communication between various components, and accommodate different types of peripherals with varying speed requirements.
APB (Advanced Peripheral Bus): Low-speed bus for slower peripherals like timers and GPIOs.
AHB (Advanced High-performance Bus): High-speed bus for faster and critical peripherals, memory components, and high-speed communication interfaces.
The choice between using a crossbar and a shared bus depends on various factors such as system requirements, performance goals, and design constraints. Each approach has its advantages and disadvantages:
1. Crossbar:
Advantages: A crossbar offers full connectivity between all masters and slaves, allowing concurrent transactions without contention. It provides high bandwidth and low latency, making it suitable for systems with demanding performance requirements.
Disadvantages: Crossbars can be complex and expensive to implement, especially in large systems with many masters and slaves. They may also consume more power and area compared to simpler interconnects.
2. Shared Bus:
Advantages: Shared buses are simpler and more cost-effective to implement compared to crossbars. They are suitable for systems with fewer masters and slaves or lower performance requirements. Shared buses can also be easier to design and debug.
Disadvantages: Shared buses can become a bottleneck in systems with high traffic or many concurrent transactions. They may introduce contention and latency, especially when multiple masters attempt to access the bus simultaneously.
In scenarios where high performance, low latency, and full connectivity are crucial, a crossbar may be preferred. However, for simpler systems or applications where cost and complexity are primary concerns, a shared bus can offer a more practical solution. Ultimately, the choice between a crossbar and a shared bus depends on the specific requirements and constraints of the system being designed.
Click here for part 2