ECE Review: Embedded

Showing posts with label Embedded. Show all posts

Sunday, June 22, 2008

Resource Pattern

How to solve the problems, priority inversion and deadlock, of resource sync?

* Critical section pattern
pros: avoid both problems
cons: cost high

* Priority inheritance
pros: avoid priority inversion and simple
cons: deadline and chain blocking could happens
chain blocking: J1 needs S1 and S2 but S1 is used by J2 and S2 by J3. The priority of them is P1>P2>P3. Therefore, J1 must wait for both of them finish.

* Highest locker pattern
One prio ceiling is defined for each resource in system design. The basic idea is that the task owning the resource runs at the highest prio ceiling of all the resources that it currently owns, provided that it is blocking one or more higher prio tasks. In this way, chain blocking is avoided.

pros: avoid chain blocking
cons: deadlock still exists

* Priority ceiling pattern
The idea is to ensure that when a job J preempts the critical section of another job and executes its own critical section, the priority at which this new critical section will execute is guaranteed to be higher than the inherited priorities(so could run) AND the ceiling priorities(so could continue) of all the preempted critical sections in the system. The diff from Highest locker is in the first place the job would not be assigned to the priority ceiling of the locked resource.
pros: avoid both
cons: cost high

Memory Patterns

* Static allocation pattern
pros: deterministic allocation/deallcation time and no memory fragmentation and easy to maintain.
cons: long init time and large mem supply. inflexible.

* Dynamic allocation pattern
pros: flexible, sizable exec
cons: non-deterministic allocation/deallcation time and mem fragmentation. Hard to maintain ptrs mem.

* Pool allocation pattern
pros: more flexible than static allocation to satisfy dynamic requirements. sub deterministic allocation/deallcation time(possible the pool runs out) and no mem fragmentation.
cons: the num of objs in the pool needs to be explored for the best performance of diff systems.

* Fixed sized buffer pattern
pros: no mem fragmentation since it always uses the worst case of mem requirement.
cons: waste mem on average. And it could be improved by managing varied size heaps.

* Smart ptr pattern

* Garbage collection pattern

* Garbage compactor pattern

Thursday, January 3, 2008

Multiprocessing and Multithread

Multiprocessing: multiple CPU, like CMP, SMP, etc.
Multithreading: One CPU but probably with more threading HW support, like multiple Register files and PC's.

Superscalar: HW supported dynamic instruction issue and branch predication
VLIW: Software (compiler) supported multiple instruction issue and BP

ccNUMA SMP: symmetrical multiprocessing. OS could be running on any of multiple CPUs with the help of "resource locker". cc means cache coherence and each CPU has its own memory banks (NUMA).

Saturday, September 29, 2007

Board Support Package

* Two Steps
- Basic init
1) Disable interrupts and cache
2) Init memory controller and cache
3) Init stack register and other registers
4) Init URAT for debugging
This step generally uses Assembly to implement the above tasks
- Board init
1) Clear memory
2) Init interrupts
3) Init timers
4) Init other hardwares
5) Init RTOS
This step would run C codes

Tuesday, August 28, 2007

I/O Subsystem

* I/O Subsystem
- Interfaces between a device and the main processor occur in two ways: port mapped and memory mapped.
- I/O devices are classified as either character-mode devices or block-mode devices. The classification refers to how the device handles data transfer with the system.
- I/O devices could be active(issue interrupts periodically or aperiodically) or passive(no interrupts, CPU needs polling for read).
- DMA controllers allows data transfer bypassing the main processor.
- I/O Modes
1) Interrupt-driven I/O (active)
in which an input buffer is filled at device interrupt time(Ready) with DMA and is emptied by processes that read the device probably triggered by DMA complete interrupt; an output buffer is filled by processes that write to the device and is emptied at device interrupt time(Ready) with DMA. This might overload the CPU if the I/O traffic is heavy.
2) Polling in real-time (passive)
Within the regular interrupt of a timer, CPU would poll every I/O.
- Task Assignments for I/O
1) ISR associated tasks for active I/O devices
2) Timer triggered tasks for polling of passive I/O devices
3) Resource control task to control a shared I/O device or a group devices
4) Request dispatcher to multiple tasks from devices
- I/O subsystem associates ISR closely.
- I/O subsystems must be flexible enough to handle a wide range of I/O devices, which hides device peculiarities from applications.
- The I/O subsystem maintains a driver table that associates uniform I/O calls, like create, delete, read, write, ioctrl, etc. with driver-specific I/O routines.
- The I/O subsystem maintains a device table and forms an association between this table and the driver table.

Exception and Interrupt

* Classification of General Exceptions
- Async non-maskable
- Async maskable
- Sync precise: PC points to the exact instruction caused that exception
- Sync imprecise

* Priority
- Async non-maskable
- Sync precise
- Sync imprecise
- Async maskable
- Programmable Task

* Processing General Exceptions
- Install exception handles
It requires replacing the appropriate vector table entry (accessed by IRQ) with the address of the desired ESR or ISR. To install handles when the device is used instead of initialization in order to share the precious interrupt resources.
- Exception frame or interrupt stack
The main reasons existing for needing an exception frame are 1) to handle nested exceptions; 2) the portion of the ISR in C/C++ requires a stack to which pass function parameters and to invoke a library function. Do not use the task stack for exception frame because it might cause stack overflow which is very hard to debug (depending on which task, which interrupt and its freq and timing, etc.).
- Three ways to mask interrupts (No effect on non-masking interrupts)
1) Disable the device
2) Disable the interrupts of the same or lower priority levels
3) Disable global system-wide interrupts
They might be used because:
1) the ISR tries to reduce the total number of interrupts raised by the devise,
2) the ISR is non-reentrant,
3) the ISR needs to perform some atomic operations
The masking could be done by setting the interrupt mask register, which would be saved in the beginning of interrupt handles and restored in the end. Therefore, ISR could disable other ISR happening and need to save status register while exception handles do not have this ability.
- Exception processing time
The interrupt frequency of each device that can assert an interrupt is very important for the ISR design. It is possible for the entire processing to be done within the context of the interrupt, that is, with interrupts disabled. Notice, however, that the processing time for a higher priority interrupt is a source of interrupt latency for the lower priority interrupt. Another approach is to have one section of ISR running in the context of the interrupt and another section running in the context of a task. The first section of the ISR code services the device so that the service request is acknowledged and the device is put into a known operational state so it can resume operation. This portion of the ISR packages the device service request and sends it to the remaining section of the ISR that executes within the context of a task. This latter part of the ISR is typically implemented as a dedicated daemon task. Note, however, the interrupt response time increases. The increase in response time is attributed to the scheduling delay, and the daemon task might have to yield to higher priority tasks. In conclusion, the duration of the ISR running in the context of the interrupt depends on the number of interrupts and the frequency of each interrupt source existing in the system.

* General Guides
On architectures where interrupt nesting is allowed:
- An ISR should disable interrupts of the same level if the ISR is non-reentrant.
- An ISR should mask all interrupts if it needs to execute a sequence of code as one atomic operation.
- An ISR should avoid calling non-reentrant functions. Some standard library functions are non-reentrant, such as many implementations of malloc and printf. Because interrupts can occur in the middle of task execution and because tasks might be in the midst of the "malloc" function call, the resulting behavior can be catastrophic if the ISR calls this same non-reentrant function.
- An ISR must never make any blocking or suspend calls. Making such a call might halt the entire system.
- If an ISR is partitioned into two sections with one section being a daemon task, the daemon task does not have a high priority by default. The priority should be set with respect to the rest of the system.

Friday, August 24, 2007

Memory Management

* How to Use Memory?
In terms of how to use memory/buffer, components could be divided into three categories.
- In place read-only
- In place read and write
- transform
For transform components, they accept the data in the input buffer, do some kind of transformation on the data, then put the new data to the new requested buffer and free the old one.

* Two Kinds of Buffer Factory
- Discrete buffer factory: fixed size data buffers
- Circular buffer factory: variable size data buffers

* Discrete Buffer
- Reference count for zero-copy reuse
In the cases of sharing discrete data buffers, one solution is to make copies of buffer data. But it wastes memory. Reference passing is a better way and a reference count of a buffer is equal to the number of filters which have access to this buffer. The buffer memory is only freed when its reference count falls to zero. In order to support reference count, all the downstream components must be in place read-only ones. The buffer's reference count would be automatically added one or N (for multi-output stream connection) when it is written to a stream connection; but each component must release buffer when it has finished using it and passed its pointer to the stream connection.

* Circular buffer
- One exclusive write and multiple reads
One write pointer and multiple read pointers are maintained for this kind of buffer. The amount of empty space in the buffer is equal to the distance from the write pointer to the closet read pointer. In other words, the slowest reader controls the amount of available space in the buffer. And in this way, the mutual exclusion is realized. For those components which underflow in the circular buffer, they are responsible for combining the new data with the old ones; or they could not update the reader pointer and try again later.

Data Streaming Model

* Data streaming model is a fundamental software architecture, which has wide applications in the field of data processing. This model is based on the following basic ideas for the embedded applications:
- Multi-threading
- Object-oriented: components

* Components
A component is a hardware-independent, software abstraction of a hardware or microcode function. It has its own interface that performs component specific functions. Components are the basic units to construct a data stream or data flow. A component could be associated with a thread or not. Two key features for components are inheritance and polymorphism and dynamic creation and linking. Dynamic creation means components must be dynamically allocated at startup or runtime instead of global or local allocation. Dynamic creation and linking make application and low-level supports, like middleware, separated as much as possible.

* Types of Components
- Stream filter
It is the processing element (brick) in a data stream. In general, it is associated with one thread. It has input/output ports and command/message ports for async events.
- Stream connection
It is used to connect two or more stream filters to construct a data stream (mortar). In essence, a stream connection is a message queue. In general, no thread is bond with stream connections. Therefore, the comm. among tasks is implemented with message passing instead of shared data structure. This model might be more suitable for multi-core application since it helps to hide the existence of other cores.
- Buffer factory
It is used to manage memory usage within a data stream. One or more buffer factories could be associated with one flow as needed. No threads are with them.
- Flow controller
This macro component creates, initializes and connects multiple building components as mentioned above to construct a data flow. Also it is the interface to the higher application level and emits commands to and receives status messages from the whole flow. It could be considered to be a special case of stream filters.

New components are created and old components might be deleted when a new data flow is built, in order to efficiently make use of memory. However, some kinds of components provide facilities for all data streams, so they need to be created once at startup and stay in the whole system lifetime.

Sunday, August 19, 2007

Memory Organization

* Memory Hierarchy
- Objective: Reduce the performace gap between memory and CPU
- Principle: The principle of locality (temporal and spatial)
- Why hierarchy? Trade-off between performance and cost
- Result: Program codes and data are spreaded among multiple locations in the memory hierarchy.
- Order in performace: Register, cache, main memory and disk
- Performance metric:
latency (response time, execution time): how fast to get?
throughput (bandwidth): how many/much to get?
for example, register has low latency and high throughput while cache has low latency but low throughput.
Sometimes, power consumption is important.
- Two things are very important in memory hierarchy: Cache and virtual memory

* Cache
- Basic unit
Block/line with multiple bytes. The address of first byte should be aligned to the stride of cache line.
- Where to put cache lines?
Directed mapping (one set associative, n sets), n/m set associative (m sets), and fully associative(one set).
- How to locate cache lines?
Memory address: Tag + Set index + Block offset
- Replacement rule
FIFO, LRU (least Recent Used), and Random
- Write policy
Write back and write through
Write allocate and non-write allocate
write buffer, Non-block cache operations
- Cache size
Cache line component = cache tag + state bits (valid, dirty for WB) + cache data
Cache size in bytes = block size * set associative * sets
- Cache Organization
Two logic dimensions:
Horizontal: Set associative
Vertical: Set
Each set associative could be one cache bank. Note large bulk of data with sequential addresses would be stored vertically not honrizontally.

* Virtual Memory
- Virtual address space
Each process has its own virtual address space, which is determined by the bus width of the system. The process could use this address space fully and its codes and data could be stored anywhere in the space. The organization of text, data, bss regions, stack and heap are in the virtual address space not in physical address space. Therefore the address the program could see is the virtual address but not physical address of main memory.
- Virtual memory
Virtual memory is to expand the concept of main memory to include the disk. The main memory could be considered the cache of the disk.
- Advantage of virtual memory
1) Make multiple processes share main memory
2) No need to worry about the size of main memory
3) Make it possible relocate the codes and data (swapping)
- Virtual address to physical address
Cache and memory are accessed with physical address. So virtual address need to be tranlated to physical address after it leaves CPU. TLB and page table for each process are used for this purpose. Page table would map the virtual address to physical address. Since the size of page table might be large and it is located in main memory, two memory accesses are needed for one address access. According to the principle of locality, TLB (translation buffer) is used to do this job fastly.
- Basic unit
Page or segment; the size of page could be the size of one cache bank (sets * block size)
- Where to put pages?
Fully associative
- How to locate pages?
TLB and Page Table for each process
- Replacement rule
LRU
- Write policy
Write back
- Virtual address format
Virtual address component = virtual address index + page offset
Virtual address index is used to look up the page table.
- TLB
TLB is the cache of process's page table in main memory. It has the same properties as cache. The exception is that cache is controlled completely by hardware, but once page fault happens, the OS needs to be in charge since the cost to access the disk is so high as to switch other process context.

* Main Memory Organization
Main memory could be divided into multiple contiguous address sections, or banks. Each memory bank could be mapped to memory chips. Inside the memory chip, bits could be organized in banks and address interleaves among banks in order to improve the access bandwidth. Two "banks" here are different in terms of address mapping.

ECE Review