News:

--

Main Menu

Running Slow Far Less Than Full Load

Started by Dave9, August 30, 2021, 08:08:57 PM

Previous topic - Next topic

Dave9

I can't get Avidemux 2.7.8 (current release version) to run at full load. No other performance issues apparent with other apps on same system.  I did a forum search but didn't find a solution unless I used the wrong keywords.

Doing a simple task like start with a VP9 video and convert to Nvidia H.264 (which my GTX 750ti supports), I can see that CPU and GPU clock speeds rise to max, but both GPU (Video Engine Load ~28%, all other loads 5% or less) and quad core CPU at ~30% load, never come close to 100%.  No individual cores of the CPU are maxed out either, fairly even load spread between them.

There is no disk bottleneck at how slow it is going, reading/writing to SSDs with no other I/O (not the OS SSD).  The SSD I/O from both reads and writes combined is under 2MB/s.  I've plenty of free memory too.

I have tried "High Priority" from the encoding window (confirmed it was set in Task Manager too) and it made almost no difference, there is nothing else this system is doing concurrently to use more than single-digit CPU or GPU time, just idling along.

The really strange part is that I can run a second instance of Avidemux and encode a second video simultaneously,. even add a resize filter, and it barely even slows down the first instance.  Video engine load might go up a little to around 33%, and CPU load does double to around 60% (I assume this is due to the resizer filter) but how can this be that a single decode/encode job is barely faster than running two of them simultaneously?  I mean that running two simultaneously, gets near twice as much video processed which I don't understand, is never how it's been with any other video apps I've worked with, which are instead bound by some observable bottleneck.


szlldm

I think the decoder is the bottleneck. VP9 hw decoding not possible with your GPU, therefore it is done with sw decoder. Is multithreaded decoding enabled? (Multithreading only works, if hw accelerated decoding disabled)

eumagga0x2a

I'd only add that for VP9, only slice-based multi-threading is possible, i.e. the way the encoder has generated the VP9 stream determines whether multi-threaded decoding brings any benefit.

In doubt, please use the latest nightly.

Dave9

#3
I half understand what's been stated, but if it is single threaded, why isn't that putting any CPU cores nearer 100% if it's the bottleneck?

If I disable hardware decoding to get it to do multi-threaded CPU, it still leaves HW encoding working?

I just disabled the HW decoding, now doing same type of conversion from VP9 to nvidia H.264, get about 60% CPU utilization, still no cores maxed out, and about 36% Video Engine Load.  It doesn't seem much if any faster, definitely not the 200%+ improvement I was hoping for by removing whatever the bottleneck is.

Is there any way to know for sure if all things in the nightly's are working?  Last time I got one, 2.7.9.21195 (aka 2.7.9 210714 in the about screen), Nvidia H.264 encoding didn't work at all, just stopped with error message about "Cannot Set Up Encoder", "Failed" so I went back to 2.7.8.

[time lapse] Okay so I just tried what I thought was the latest nightly, Avidemux_2.7.9 VC++ 64bits 210704.exe, (app ver. 2.7.9.21185) and it too shows the "Cannot Set Up Encoder" message trying to use Nvidia H.264 encoding.  Exact same operation works fine on the 2.7.8 release version, except by fine I mean it completes the job, just does it slowly as mentioned in my 1st post.

Dave9

Sorry if this is a stupid question but do emails & alerts not work for this forum?  I turned them on but didn't receive any for the two replies made, checked to see if they were blocked but they weren't.

eumagga0x2a

The latest nightly is a MinGW build from Aug. 23: https://avidemux.org/nightly/win64/
Are NVENC-based encoders broken in that one too?

The # of threads actually possible to decode VP9 depends on how this video was encoded. The multi-threading control in Avidemux Preferences affects only the unaccelerated libavcodec-based decoder, nothing else.

Dave9

Yes, the Aug 23 build linked still has the NVENC encoders broken, simply loading a file and trying to save using it (Video Output, Nvidia H264), immediately generates the popup window message "Video     Cannot set up encoder.  The configuration supplied to the encoder may be incompatible or the encoder may depend on features unavailable on this system".

I understand what you're stating about how the VP9 decoding could be using only one thread, but then if this is the bottleneck, why isn't this causing more than ~30% load on any CPU core handling that thread?  This is unlike any other encoder load I've seen, where nothing appears maxed out.

eumagga0x2a

Regarding NVENC-based encoders, your NVIDIA driver might be too old, it must be at least 456.71. Nvidia H264 encoder works for me on Windows 7 with some 477.xx driver.

To benchmark VP9 decoding speed properly, please choose the "null" video encoder and the "Dummy" muxer with various VP9 videos from different sources (as the encoder of these videos determines whether and how much a decoder can benefit from multi-threading).

Dave9

#8
That may be it, my driver ver is 451.48.  What is the significance, why did older drivers stop working?

It will take a while for me to do proper testing of various VP9 videos.

eumagga0x2a

Quote from: Dave9 on September 01, 2021, 01:51:12 AMWhat is the significance, why did older drivers stop working?

It has been built against a more current version of NVENC API.

knosso2919

Hi,
I am in the same condition as the OP. I almost missed this tread because whatever query i made gave either no results or non pertinent ones.

I'm trying to convert a 4k H264 file in a bunch of H265 files at various resolutions.
I have a 6 core Phenom 1090t that is being used between 30% and 50% and a GTX 1050ti that is barely being used at 10%. The SSD is barely being used as well.

I don't think that the OP video card is the bottleneck as I do have the latest video drivers and the same problem. Besides i knew that the hardware capabilities of the GPU do impact the amount of concurrent streams, max encoding resolution, quality, filesize, but i thought that any program compatible with NVENC was compatible with all NVENC capable GPUs.

Also since i had a bunch of files in the Queue i tired to start a second one to workaround the strange situation while the first one was being rendered but there does not seem to be a straightforward way to do so (OP did with 2 instances but with no Queue).

I'm running Avidemux 2.7.8. Paradoxically the framerate of the rendering is the same as if i was using handbrake but handbrake does bring my CPU to it's knees and does handle full 8 bit way worse than Avidemux.

Dave are you handling full color range files as i do or partial range ones?

Do you have any ideas of what we can try to fix this problem? If a single encode cannot be further parallelized is it possible to at least reach the Nvidia fake artificial limit of 3 concurrent encodings for consumer gpus?

Thanks in advance for your help.

eumagga0x2a

Do not run multiple instances of Avidemux concurrently. It may work, but this is entirely unsupported.

Quote from: knosso2919 on September 07, 2021, 06:01:26 PMI'm running Avidemux 2.7.8.

You might want to update to the latest nightly just for the sheer number of fixes and enhancements accumulated since the last release.

Quote from: knosso2919 on September 07, 2021, 06:01:26 PMParadoxically the framerate of the rendering is the same as if i was using handbrake but handbrake does bring my CPU to it's knees and does handle full 8 bit way worse than Avidemux.

I think, NVENC-based encoders in Avidemux don't handle full range at all.

Quote from: knosso2919 on September 07, 2021, 06:01:26 PMI don't think that the OP video card is the bottleneck as I do have the latest video drivers and the same problem.

Sufficiently recent drivers are a prerequisite for encoding in hardware via NVENC interface. It is not a question of performance – either it works or it doesn't (you get an error message trying to initialize the encoder).

Quote from: knosso2919 on September 07, 2021, 06:01:26 PMAlso since i had a bunch of files in the Queue

Do you use the command-line version to process jobs queue or the Qt one? The CLI version will decode video on the CPU, the Qt one can decode video via DXVA2 in the graphics card. If DXVA2 hw accelerated decoding is enabled, this mandates multi-threaded decode to be disabled. You should test what is more beneficial for performance for your use case – single-threaded decoding on the GPU or multi-threaded decoding on the CPU. By the way, with a 4k source, memory transfer can be a significant limiting factor.

Quote from: knosso2919 on September 07, 2021, 06:01:26 PMhandbrake does bring my CPU to it's knees

If the CPU load with Avidemux is rather low, it may be due to H.264 source being decoded in hardware.


knosso2919

Thanks eumagga0x2a for your reply and clarification.

I will try as soon as possible the Nightly.
I'm sorry that running multiple instances of the program is not supported and may become broken in the future or in some use cases because the software really shines in speed and quality.

In the mean time i experimented a bit with the version i currently have installed. The problem of not using all the cores was solved by forcing the number of cores in the option instead of living it to auto and the encoding bumps from 22fps to 30fps or so.

By running multiple instances of the program but leaving the core count as auto i can squeeze a total of 40fps between 3 concurrent encodings.

Turning on full hardware acceleration makes the program shine, the fps then is indeed limited by what a single core can do and in my case is 50fps. The GPU utilisation jumps around 50-60% with a 4k file.

2 concurrent instances are capable of doing 35 fps each  thus reaching 60fps and the GPU stays at a comfortable 80% (still at 4k).

3 concurrent instances do 25fps each so 75fps total and the GPU jumps around 90-95%.

Since these are all tests with 4k H264 sources converted to 4k H265 outputs it seems reasonable for me to almost always benefit from using 3 instances since i often resize videos to lower resolutions so i would rarely be bottlenecked by the encoder.

As for what concerns the color range it seems strange to me that it is not supported because i often use OBS with NVENC and it does support it. In any case even if avidemux does not support it then it must mean that it does an exceptional conversion because brightness, contrast, vibrancy, and overall sharpness is not affected at all. How can i verify whether the output video is full or partial? Unfortunately i don't see it in the metadata.

Also sorry but I indeed forgot to tell you my software configuration when i told the hardware. I am running windows 10 21H1 and not using the CLI (though i might in the future to speed things up as long as I find a way to keep using multiple instances and full hardware encoding).


Do you think that if I start to apply some denoise filters then I might benefit again from multithreading? Are there some denoise filters that run on the GPU? Currently i do the scaling with what appears to be a software scaler (spline), is there a way to resize using the GPU?

Thanks and I wish all a good day.

eumagga0x2a

Quote from: knosso2919 on September 08, 2021, 02:29:03 PMThe problem of not using all the cores was solved by forcing the number of cores in the option instead of living it to auto and the encoding bumps from 22fps to 30fps or so.

If you speak about the multi-threading settings in Avidemux Preferences, these settings affects the decoding part only and have no effect whatsoever as long as hw accelerated decoding (DXVA2 on Windows) remains enabled. If video decoding via DXVA2 has been already disabled, the maximum number of libavcodec threads Avidemux requests from the library is capped at 8 while the auto setting means it is equal the number of CPU cores or 8, whichever smaller.

Quote from: knosso2919 on September 08, 2021, 02:29:03 PMAs for what concerns the color range it seems strange to me that it is not supported because i often use OBS with NVENC and it does support it. In any case even if avidemux does not support it then it must mean that it does an exceptional conversion because brightness, contrast, vibrancy, and overall sharpness is not affected at all.

No support for full range means streams which use the full range are not marked as such in the VUI (video usability information) part of the codec extradata. Decoders usually assume limited range when full range is not explicitely specified, resulting in highlights and shadows clipped by contrast enhancement necessary to stretch limited color range to full at the display stage.

Quote from: knosso2919 on September 08, 2021, 02:29:03 PMHow can i verify whether the output video is full or partial?

Strictly speaking, this would require decoding all pictures of the video and checking the highest and lowest luma and chroma values. But usually players just apply what color information at the container level (when available) and codec extradata advertise.

Quote from: knosso2919 on September 08, 2021, 02:29:03 PMDo you think that if I start to apply some denoise filters then I might benefit again from multithreading?

Currently, no denoise filters are multi-threaded, but video filters generally run in a separate thread.

Quote from: knosso2919 on September 08, 2021, 02:29:03 PMAre there some denoise filters that run on the GPU?

No, there aren't any.

Quote from: knosso2919 on September 08, 2021, 02:29:03 PMCurrently i do the scaling with what appears to be a software scaler (spline), is there a way to resize using the GPU?

On Linux only, using either VA-API or VDPAU.

knosso2919

#14
Thanks for your kind reply.
It gave me more insight in the inner working of the program so i can understand its strengths and limitations better.

Is the VA-API/VDPAU support in linux enabled by default on all architectures (ARMHF, ARM64, x64) or there needs to be some compiling? Does the quality of the GPU matters? Like I've tried the hardware encoder of the raspberry pi 4 and its even worse than the one i have on my old smartphone while the GPU of my desktop does a pretty fine job. Does the same apply to the scaling as well?


This topic talks about creating a script to batch convert files in folders. (ps tell me if I'm going offtopic and i shall move there) https://avidemux.org/smif/index.php/topic,19507.0.html

Can I adapt the script to the following use case? (I mean of course I can but giving my almost non existent knowledge of python how hard is it?)

I have a folder called "temp" with a bunch of subfolders "film1", "film2", "filmwhatever", and so on. Each subfolder has any number of mkv files from 1 to about a dozen. The name of the files inside is sequential but does not match the containing folder.

I would like to re-encode every file in the subfolders so that they are reencoded in a idifferent path and using the name of the containing folder.

Something like this:
basepath/temp/HUNGER_GAMES/crappyname-01.mkv
basepath/re-encoded/HUNGER_GAMES/HUNGER_GAMES_01.m4v


Thanks again for your time. Tell me if this is inappropriate here and ill delete the comment and move it to the other post.