"ϕEngine ISA: R&D and Vector Operations"

President, technoventure, inc.

1mo

The ϕEngine ISA has gone through a huge number of iterations and permutations over the past seven years of its existence. That is R&D at its finest. The ϕSemiVec portion of ϕEngine supports vectors with an arbitrary number of elements (called their Counts). It was recently expanded to double the maximum number of elements in each Vector Segment that can be processed in parallel. Since ϕSemiVec shares the Scalar Register Files with their Scalar Operation Counterparts, it makes accessing the same values easy when mixing scalar and vector operations. However, the vector operations can now use up registers more quickly. So, once again, I used a technique that was inspired by the AMD64 REX byte which extends the number of integer registers in the x86. But instead of quadrupling the number, I am just doubling the number of both integer and floating point registers. This requires two prefixes. One is for the Subject and one is for the Object when it is a register. Vector operations are done in segments, so whole vectors do not have to reside in registers at the same time. Instead, they are cycled in and out of registers from and to memory working on a chunk at a time. These chunks are limited to a maximum magnitude of 16 units (used to be 8). For integers this is 16 bytes and for float is 32 bytes. One quarter of all data registers are Scratch Registers that also serve as Argument Registers when passing arguments by register. All of these can be reached without a prefix. An equal number of Protected Registers beyond those can also be reached without a prefix. All of the new registers that require a prefix are Protected Registers.

To view or add a comment, sign in

More Relevant Posts

Roberto Heckers

Angular and Nx Specialist - Published author: Effective Angular
2w
Report this post
Did you know that since #Angular V20, you can make your host element type-safe 🤩 You simply add typeCheckHostBindings to your tsconfig and your host element becomes fully type-safe 🎉 Stop using @HostBinding and @HostListener, and start using the host element inside your component decorator ✅
6 Comments
Like Comment
To view or add a comment, sign in
Deen Kotturi, MSEE, MBA

Technologist | Silicon Designer | Engineer
1mo Edited
Report this post
Here is an interesting problem I had to address recently. When debugging input data for sign-off timing and EM/IR runs with Liberate, Tempus and Voltus, reviewing the ccsp tables often reveals issues such as missing arcs or toggle conditions, typically stemming from library characterization setup or library update. Conducting basic checks on input pins, output pins, power pins, switching conditions, and timing criteria can swiftly assess the quality of the ccsp data. For quick, targeted checks without setting up or utilizing an EDA tool, a minimum python function is provided below to extract all toggle conditions present in the ccsp library. Ideally, we may want to do such input data quality checks as part of the EDA automation. So, enabling such code anytime ccsp libraries are read is a good practice before using up expensive compute resources for timing and electrical analysis. We can use a similar concept to selectively pick specific group, complex, or simple attributes (e.g. timing and internal power) and scale up using Dask. BTW, I have full code optimized for better performance on large Liberty files. Reach out to me if you want that. #ccsp #emir #eda #voltus #tempus #liberate
5 Comments
Like Comment
To view or add a comment, sign in
Vahid Abedin Ara

Senior .NET Developer| C#, ASP.NET, EF Core
6d
Report this post
What is a Thread? Answering this question helps us gain a deeper understanding of tasks and threads in .NET. In the previous post, I explained interactions. Golden sentence: The operating system can only perform one or more parallel operations based on its cores, but it manages and executes most tasks concurrently and in parallel. For each process or application, the OS creates and provides a portion of heap and stack. The heap is expandable, and for each thread that is created, a separate stack is allocated. Therefore, a thread is a resource consisting of CPU capability and a RAM stack, provided to us by the OS and the application — along with a shared heap space among all threads — so that an instruction can be converted from concurrent parallel form to interaction and be processed by the CPU. Question: What is the TASK in .NET?
Like Comment
To view or add a comment, sign in
Paul McKneely

President, technoventure, inc.
2w Edited
Report this post
This is a re-post with new picture. I had modified the RISC-V picture without changing the bottom portion to reflect x86. ϕEngine has only 4.5× the code density of x86. The 19× was for RISC-V. Sorry about that. We compared RISC-V with ϕEngine on saturated add. But how does the x86 compare? At least it has flags, so the code won't be quite as bad as RISC-V. For an even comparison, we will assume that the calling convention allows passing of the arguments in EAX and EBX: 0000: 01 d8 add eax,ebx 0002: 71 07 jno b <NoSat> 0004: 72 06 jb c <Minus> 0006: B8 FF FF FF 7F mov eax,0x7FFFFFFF 000B: c3 ret 000C: B8 00 00 00 80 mov eax,0x80000000 0011: C3 ret 0012 Keep in mind that, when overflow happens, the sign is the opposite of what it should be. The code came to 18 bytes. That is 4.22× times the code density of RISC-V. Much of that is because it has flags and can detect overflow in hardware. But the x86 can't do clamping in hardware so it has to test the flags and load the appropriate values into the register that returns the result. So ϕEngine code density is 4.5× that of the x86. The x86 is about midway between RISC-V and ϕEngine on a log scale.
Like Comment
To view or add a comment, sign in
Faisal Althobaiti

Senior CPU Design Engineer | RISC-V Architect | CPU Microarchitecture Design | RTL Design
1w Edited
Report this post
🔹 𝗦𝗲𝗿𝗶𝗲𝘀 𝗧𝗶𝘁𝗹𝗲: “Demystifying RISC-V External Debug” 𝗣𝗼𝘀𝘁 𝟯 — 𝗠𝗲𝗺𝗼𝗿𝘆 𝗔𝗰𝗰𝗲𝘀𝘀 𝗶𝗻 𝗥𝗜𝗦𝗖-𝗩 𝗗𝗲𝗯𝘂𝗴𝗴𝗶𝗻𝗴 “Accessing memory during debug sounds simple… until cache, PMP, and bus restrictions join the party.” Three main methods exist to access memory during debug: 1️⃣ 𝙎𝙮𝙨𝙩𝙚𝙢 𝘽𝙪𝙨 𝘼𝙘𝙘𝙚𝙨𝙨: Direct, core-independent — but can hit cache incoherence unless fenced. 2️⃣ 𝙋𝙧𝙤𝙜𝙧𝙖𝙢 𝘽𝙪𝙛𝙛𝙚𝙧: Uses core load/store instructions — simple, coherent, but depends on the core correctness. 3️⃣ 𝘼𝙗𝙨𝙩𝙧𝙖𝙘𝙩 𝘾𝙤𝙢𝙢𝙖𝙣𝙙 𝘼𝙘𝙘𝙚𝙨𝙨: Direct hardware path — coherent and core-independent, but adds design complexity. The right choice depends on balancing 𝘀𝗽𝗲𝗲𝗱, 𝗰𝗼𝗵𝗲𝗿𝗲𝗻𝗰𝗲, and 𝗿𝗼𝗯𝘂𝘀𝘁𝗻𝗲𝘀𝘀 for your debug scenario. 𝗡𝗲𝘅𝘁 𝘂𝗽: We’ll break down how loading a program and setting a breakpoint really works under the hood. 𝗛𝗮𝘀𝗵𝘁𝗮𝗴𝘀: #HardwareDebug #ProcessorDesign #RISCV #HardwareSecurity
Like Comment
To view or add a comment, sign in
Yash Pandey

MCA'25 | DUCS
3w
Report this post
Day 156/160 of GFG 160 Days of DSA 🔥 Problem: Maximum XOR of Two Numbers in an Array ⚡ 🔹 Problem in simple words: Given an array of integers, find two numbers whose XOR is maximum. 👉 Brute force = O(N^2) (check all pairs). Too slow for large arrays. 🔹 Approach used: I used a Binary Trie (bitwise trie). Each number is represented in binary (up to 20 bits in my case). Insert each number into the Trie, bit by bit (0 or 1). While inserting the next number, try to move in the opposite direction (if current bit = 1 → try to go 0) to maximize XOR. Keep track of the maximum XOR found so far. 🔹 Key Insight: To maximize XOR, we always try to pick the opposite bit at each level. Example: if current bit = 1, best choice is 0 (and vice versa). 🔹 Complexity: Insert each number → O(20) Query each number for max XOR → O(20) Overall → O(N * logM) where M = max value in array. ⚡ This is a powerful technique, often used in network routing (longest prefix match), competitive coding, and bit manipulation problems. #GFG160 #GeekStreak2025 #DSA #Trie #BitManipulation #XOR
Like Comment
To view or add a comment, sign in
Gyanendra Singh

NextJS | Typescript
2w
Report this post
Day 158 of #gfg160 – Find Missing Number Using XOR Solved using Bit Manipulation (XOR trick) ➤ Problem: Given an array of size n−1 containing distinct numbers from 1 to n, find the missing number ➤ XOR properties: • a ^ a = 0 • a ^ 0 = a ➤ Approach: • Compute xor1 = XOR of all elements in the array • Compute xor2 = XOR of numbers from 1 to n • The missing number = xor1 ^ xor2 ➤ Why it works: XOR cancels out all common numbers, leaving only the missing one Time complexity: O(N) Space complexity: O(1) #gfg160 #geekstreak2025 #160DaysOfCode #DSA #GeeksforGeeks
Like Comment
To view or add a comment, sign in
Jacob Bartlett

I write Jacob’s Tech Tavern 🍺
3w
Report this post
Weak References Performance Penalties: Indirection and Cache Misses You’ve likely heard about “indirection” of weak references leading to performance hits, and we can see it directly in WeakReference.h, where weak references themselves are defined in the runtime. The WeakReferenceBits are, quite literally, the bits making up the pointer to the side table. This is the weak reference. To access the side table, it calls getNativeOrNull() on the pointer’s bits. Then, calling tryRetain() on the side table’s pointer to the object, it attempts to create and return a strong reference to the HeapObject (or returns nil). This adds many more steps to fetching memory compared to strong references, which already exist as pointers to HeapObject. The performance penalty isn't just from the extra function call overhead here. The extra pointer dereferencing puts us at risk of a CPU cache miss every time we access a weak reference. Read my comprehensive deep dive on how Swift reference counting is implemented 👻 https://lnkd.in/exMCs4Ua
1 Comment
Like Comment
To view or add a comment, sign in
SemiWiki.com

8,539 followers
2w
Report this post
The results vaulted Alchip into a clear 3DIC technology leadership because it validated an entire, integrated 3DIC solution, as well as its various elements. The test chip provided CPU/NPU core demonstration, UCIe and PCIe PHY preparation, Lite-IO infrastructure, and third-party IP. The latter is particularly important because any 3DIC proven IP is hard to find. #alchip #3dic #cpu #npu #ucie #pcie #phy #ip #semieda #semiconductor #semiconductors #semiconductorindustry #semiconductormanufacturing #semiwiki https://lnkd.in/ggDtkGnf

Alchip 3DIC Test Chip Tape Out Validates Ecosystem Readiness semiwiki.com
Like Comment
To view or add a comment, sign in
David Zhu

Linux driver developer
1mo
Report this post
🔧 Understanding FDF Files in EDK II: The Blueprint of UEFI Firmware Ever wondered how UEFI firmware components are organized into that final image that boots your system? The Flash Definition File (FDF) handles this critical job in the EDK II build process. What is an FDF File? Think of FDF as the "architect's blueprint" for your firmware: DSC File: Defines WHAT to build (components, libraries, modules) FDF File: Defines HOW to organize those built components into the final flash image Real-World Example: Intel Platform FDF Structure # Flash Layout Definition [FD.PLATFORM_FLASH] BaseAddress = 0xFF800000 Size = 0x800000 ErasePolarity = 1 # Boot Block Region (Recovery) 0x00000000|0x00040000 gUefiOvmfPkgTokenSpaceGuid.PcdOvmfFlashNvStorageVariableBase|gUefiOvmfPkgTokenSpaceGuid.PcdOvmfFlashNvStorageVariableSize FV = FVMAIN_COMPACT # Main Firmware Volume 0x00040000|0x007C0000 FV = FVMAIN # DXE Core and Drivers [FV.FVMAIN] BlockSize = 0x1000 NumBlocks = 0x7C0 FvAlignment = 16 ERASE_POLARITY = 1 MEMORY_MAPPED = TRUE STICKY_WRITE = TRUE # INF Files (What goes where) INF MdeModulePkg/Core/Dxe/DxeMain.inf INF MdeModulePkg/Universal/PCD/Dxe/Pcd.inf INF PlatformPkg/Drivers/CustomDriver/CustomDriver.inf Key FDF Sections Explained: [FD] - Flash Device: Defines the overall flash layout Base address and size of flash chip Memory regions and their purposes [FV] - Firmware Volume: Containers within the flash Block alignment and erase policies Which modules go into each volume INF Entries: Actual driver/module placement Maps built .efi files to specific locations Defines load order and dependencies Why This Matters: • Boot Order: Critical modules load first (SEC, PEI, DXE) • Memory Efficiency: Optimal placement reduces waste • Security: Separate regions for secure vs. non-secure code • Recovery: Dedicated boot block for system recovery Pro Tips: Use FDF compression to save space: COMPRESS PI_NONE Implement proper alignment for performance Plan for firmware updates with dedicated regions Test different layouts for optimal boot times The FDF file bridges firmware engineering and system architecture. It's not just about building code—it's about creating a bootable, efficient system that actually works in production. #UEFI #EDK2 #FirmwareDevelopment #SystemArchitecture #EmbeddedSystems #Intel #AMD #BIOS
Like Comment
To view or add a comment, sign in

443 followers

View Profile Follow

"ϕEngine ISA: R&D and Vector Operations"

More from this author

Time/Space Trade-off in Computer Performance

Is 0 > 1? (or even 9?)

Local Speed Principle

Explore content categories