Cpu features not working - Fxpansion.com

Forum

FXpansion Forum

Cpu features not working

User and tech support for Strobe2.

Moderator: Moderators

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Cpu features not working

Postby mrkodak » Tue Sep 15, 2015 1:43 pm

Hi. Just wanted to let you know that it seems Strobe2 is not recognizing CPU features properly. The diagnostics in Strobe2 is telling that my CPU is not supporting SSE4_1 or AVX or FMA instructions. When in fact it does support them all.

You are probably using an Intel compiler which is causing this behavior.
My CPU is AMD.

Drew_fx
Posts: 3828
Joined: Fri Jul 21, 2006 5:32 pm
Location: London, UK

Postby Drew_fx » Tue Sep 15, 2015 2:37 pm

Please provide full system specs, and we'll look into it.

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Postby mrkodak » Tue Sep 15, 2015 3:17 pm

Drew_fx wrote:Please provide full system specs, and we'll look into it.


AMD FX-8350, 4GHz, 8-core, released:2011
Instructions sets :
MMX (+), SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, SSE4A, x86-64, AMD-V, AES, AVX, XOP, FMA3, FMA4

OS: Win10 Pro 64bit
Strobe2 v2.0.0.3

Let me know if there's other info you require.


All my own code works fine with AVX, FMA3 and FMA4 without issues. I'm using VS2013 Professional and the default MS compilers.

Angus_FX
Herder of Cats
Posts: 4172
Joined: Fri Sep 10, 2004 3:00 pm
Location: FX HQ, London
Contact:

Postby Angus_FX » Wed Sep 16, 2015 10:39 am

We're using MS compilers and hand-coded SSE/AVX intrinsics. Had to disable the newer instruction sets for AMD CPUs as we were getting crashes during beta & don't currently have any suitable Bulldozer hardware to debug on.

What Synthmark score is the diagnostic reporting?
-- Angus F. Hewlett - CEO - FXpansion --
Twitter | YouTube
Synth Squad | BFD Eco

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Postby mrkodak » Sun Oct 18, 2015 7:10 pm

Angus_FX wrote:We're using MS compilers and hand-coded SSE/AVX intrinsics. Had to disable the newer instruction sets for AMD CPUs as we were getting crashes during beta & don't currently have any suitable Bulldozer hardware to debug on.

What Synthmark score is the diagnostic reporting?


Hi, forgot to get back to this. You might be running into compatibility problems because you are using FMA3 (by AVX spec) and not FMA4 which is the only one Bulldozer micro-architecture supports. It was released before Intel had their AVX spec ready. I'm running a processor based on the newer Piledriver micro-architecture which supports both, FMA3 and FMA4.

For debugging and building a route for both FMA versions take a look here: https://en.wikipedia.org/wiki/FMA_instruction_set

Here's also a list of the Visual Studio intrinsics for FMA4
"https://msdn.microsoft.com/en-us/library/vstudio/gg445134(v=vs.100).aspx"
and an overall look on compiler intrinsics
"https://msdn.microsoft.com/en-us/library/vstudio/26td21ds(v=vs.100).aspx"
(these links didn't work correctly on your forum.. had to force them to text)

But i guess you might know these already.

I also tested the diagnostic thing on Strobe2. Here's what i got out of it.

Strobe 2 : V2.0.0.3
Oscillator time: 0.352366
Filter time: 0.148141
Amp time: 0.018377
Full System time: 0.528251
Synthmark score: 1.893041

Then i noticed there was an update and installed it..

Strobe 2 : V2.0.1.0
Oscillator time: 0.353216
Filter time: 0.145943
Amp time: 0.017875
Full System time: 0.519857
Synthmark score: 1.923606

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Postby mrkodak » Wed Nov 04, 2015 11:28 pm

Angus_FX wrote: don't currently have any suitable Bulldozer hardware to debug on.


Any chance of getting an "uncprippled" build to test on my Vishera processor? Just remove all the cpuid tests. I'm happy to work as your guinea pig in this matter. I also have a quadcore Bulldozer Phenom2 processor on another machine which i can use for testing if needed.


Btw. the product at the moment doesn't seem to me quite like advertised because of this. I almost thought of returning it for a refund. But it does run just fine and it sounds lovely. No complaints there. Maybe I'd just need a strong cup of stfu.. and get back to making music.

Angus_FX
Herder of Cats
Posts: 4172
Joined: Fri Sep 10, 2004 3:00 pm
Location: FX HQ, London
Contact:

Postby Angus_FX » Thu Nov 05, 2015 11:09 am

If we remove the CPUID tests entirely, it'd crash immediately - on your CPU and 80% of others.

Vishera appears to support AVX1 and FMA3, but not AVX2. We can, in that instance, take advantage of AVX1 but not FMA3 -- Visual Studio doesn't provide an option to generate FMA3 w/o AVX2.

I'm happy to release a build as you describe, but it may be a little while yet.

Will post a bit more info on the development process shortly, but in summary we're developing the next Strobe2 update in parallel with Cypher2 (which is coming along very nicely!). Takes a bit longer this way but means we can make bigger improvements.
-- Angus F. Hewlett - CEO - FXpansion --

Twitter | YouTube

Synth Squad | BFD Eco

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Postby mrkodak » Thu Nov 05, 2015 12:07 pm

Angus_FX wrote:If we remove the CPUID tests entirely, it'd crash immediately - on your CPU and 80% of others.

Vishera appears to support AVX1 and FMA3, but not AVX2. We can, in that instance, take advantage of AVX1 but not FMA3 -- Visual Studio doesn't provide an option to generate FMA3 w/o AVX2.

I'm happy to release a build as you describe, but it may be a little while yet.

Will post a bit more info on the development process shortly, but in summary we're developing the next Strobe2 update in parallel with Cypher2 (which is coming along very nicely!). Takes a bit longer this way but means we can make bigger improvements.


Ok I see. From the earlier posts i got the picture you were using AVX and FMA3 with "hand written intrinsics code" in which case there is no limitation to using FMA3 without AVX2. Only with compiler generated code there is such a limitation. Sorry for that. I understand you have lots of better things to do than fiddle with obscure SIMD optimizations. You just need to be more clear when answering tech questions.


I can also extend an olive branch here and offer to do some "hand written intrinsics code" for you if you someday wan't to extend that way. It is usually more efficient than the generated one.



Btw. It's a bitch when Intel comes and steals a great instruction set (SSE5, XOP) from AMD and makes it into it's own little "standard" and starts to push the original out of the market. Well i blame MS a bit too as they go along with it.

Angus_FX
Herder of Cats
Posts: 4172
Joined: Fri Sep 10, 2004 3:00 pm
Location: FX HQ, London
Contact:

Postby Angus_FX » Thu Nov 05, 2015 12:44 pm

We do use hand written intrinsics.

But the way it's architected, there's a high level dispatcher which looks at the available instruction set & dispatches to a compilation unit compiled with a specific instruction set in mind, with compiler flags set a particular way.

It's actually pretty sophisticated - we have normal, human readable DSP code, i.e. x = a * b - (2.f * c); which can automatically be compiled down to different SoA layouts & vector instruction sets. There's even the possibility of changing data interleave factors to get better performance on chips with different levels of instruction latency. The whole thing resolves down to compact vector intrinsics (relying somewhat on the compiler to be not entirely stupid in how it handles templates, inline functions, temporaries, RVO etc.) which the compiler's optimiser can further rearrange.

On the down side - each code path has to be QA tested individually (it's supposed to All Just Work, but there's always scope for something to break), so in practice we limit to four optimised/tuned targets, as supported by Visual Studio: SSE2 on Conroe, SSE4.1 on Penryn, AVX1 on Sandy Bridge & AVX2 on Haswell. It'll work fine on everything else, but we can't claim to have done everything possible to squeeze every last drop of performance out on every uArch - there are just too many.

We were getting crashes on AMD machines during testing which were either incorrectly reporting CPU feature support, or more likely the compiler on a given setting was generating instructions that those particular (pre 2009) Athlons couldn't actually handle.

I don't really want to get in to the whole Intel vs AMD thing here.. suffice to say Intel's chips have more, wider vector units right now (AFAIK, two 256-bit VFPU per core in Haswell vs. AMD's one 256-bit (fused dual 128-bit) VFPU per two cores) - but Zen is looking very promising.
-- Angus F. Hewlett - CEO - FXpansion --

Twitter | YouTube

Synth Squad | BFD Eco

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Re: Cpu features not working

Postby mrkodak » Tue Dec 01, 2015 10:52 pm

Whoops. Took me a while to get back to this.
Thanks Angus for the detailed information. I'm eagerly waiting for the new build to see what it does.

But i still have to stress that I'm not at all in any way displeased with the current performance! I was just wondering why some feature that was so highly advertised was missing. I'm more than happy with the product. An excellent synth. And it runs remarkably well on my system, even without AVX.

I made a patch to "max out" the sound generators to test out the performance (no modulations though.. but this gives a general picture for simple sounds). The patch included:
Saw + square + the subs + filter + filter-env + amp-env
Stack at full 5.0 with some detune, just to get the beefiest sound possible..

I can run 16 instances all playing 32 voices at 44.1khz x2 oversampling. That is way more than I'll ever need.

Angus_FX
Herder of Cats
Posts: 4172
Joined: Fri Sep 10, 2004 3:00 pm
Location: FX HQ, London
Contact:

Re: Cpu features not working

Postby Angus_FX » Wed Dec 02, 2015 11:20 am

Glad to hear it!

Having done a little more research - switching these AVX optimizations on for AMD processors actually won't bring that much of a performance benefit, so you're not missing out on much.

Why..? Because Bulldozer has two 128-bit FMA-capable units for vector maths, which can operate together as a single 256-bit AVX unit.

The big speedup we get going from SSE to AVX on the newer Intel chips is because their vector units are 256 bits wide. So whereas we can crunch 4 numbers per unit per clock using SSE, that becomes 8 with AVX.

On Bulldozer/Piledriver, you can either send *two* 128-bit instructions per clock, or *one* 256-bit. So the total throughput doesn't change whether you're sending 128-bit or 256-bit instructions.

(There's a small speed going from SSE2 to SSE4, and AVX1 to AVX2 (including FMA3), but it's nothing like the difference between 128-bit & 256-bit processing on intel chips).

Further, it appears Microsoft's compiler won't generate FMA and target an AVX1 processor, at least not automatically -- you either have to target AVX1 without FMA, or AVX2 with FMA -- not ideal for targeting Bulldozer / Piledriver, which support FMA but don't seem to have all the AVX2 integer stuff.
-- Angus F. Hewlett - CEO - FXpansion --

Twitter | YouTube

Synth Squad | BFD Eco

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Re: Cpu features not working

Postby mrkodak » Sun Feb 21, 2016 9:04 pm

Angus_FX wrote:Having done a little more research - switching these AVX optimizations on for AMD processors actually won't bring that much of a performance benefit, so you're not missing out on much.


Yes, I've done some more research too. That seems to be the case for AMD processors. I've benchmarked a few systems now for my own dsp projects and AVX is usually just hurting performance. Some algorithms do get a slight boost from using MACC code even on the AMD.. for example IIR filters. But the difference is so marginal that it's not worth the risk of incompatibilities.

On the other hand it's remarkable how much speed up this Broadwell I'm benching now gets from using AVX. Nearly twice the performance on a simple ZDF-filter algorigthm. It's a shame though that there isn't many programs that would leverage this feature. And on the other hand it seems to make memory dependent algorithms slower. I'm seeing a 15% decline in performance with a Fractional delay agorithm. Looks very similar to what happens on an AMD processor in a similar case. So AVX is only hurting there. All on all this 2015 Broadwell i5 is running at little under half the performance of my 2012 Vishera AMD. Not bad. And neck in neck with non memory dependent AVX code. Remarkable. Though it's a bummer that normal x64 code is so much slower.

Next I'll test how much of a difference this has on Strobe2. It all depends on the complexity of the code.

mrkodak
Posts: 9
Joined: Sat Jan 22, 2011 6:07 pm

Re: Cpu features not working

Postby mrkodak » Sun Feb 21, 2016 9:55 pm

So. With the i5 I'm getting a synthmark score of 3.7
On the AMD Vishera 1.9
This would suggest the i5 can do twice more per core. It has 4 threads and the AMD has 8. A tie in performance?

Well.. I can run 4 instances of a fairly complex pad with modulations filtering and overdrive with the i5.
And 16 with the AMD. Same settings..


My next question would be. Is there a chance to disable AVX on processors that support it? Maybe this i5 isn't at it's best at it.

Angus_FX
Herder of Cats
Posts: 4172
Joined: Fri Sep 10, 2004 3:00 pm
Location: FX HQ, London
Contact:

Re: Cpu features not working

Postby Angus_FX » Tue Feb 23, 2016 2:42 pm

Is your i5 a true quadcore, or a 2C/4T? Judging from those numbers I suspect the latter.
-- Angus F. Hewlett - CEO - FXpansion --

Twitter | YouTube

Synth Squad | BFD Eco


Return to “Strobe2”

Who is online

Users browsing this forum: No registered users and 3 guests