Hello,
I have 5 years of experience working in fortran software development, and 1 year in cuda.
From what I saw in your code I'm qualified for do it.
But I'm not really sure about how much time you could save from writting this function in CUDA. The time that you can save from vector operations, such as adds and assigments, is really small in comparation to the time of fop%op (If is it not parallelized). So, my advice is, you should keep in mind migrate the fop%op to CUDA.
Another point is that there are two diferents ways to migrate this function to CUDA. The first, the easiest and the one that I recommend you is write these subroutines in C and launch the CUDA kernels from it. The second is using CUDA Fortran with the PGI compiler. Install the PGI compiler will take you time, effort and troubleshooting. And you need to ask for a license.
If you are interested in my services, write me, so we can talk more about what would be the best solution for your needs.
Best regards,
Ricardo