[Mesa-users] Experience with Docker Container and r 11554

Ian Foley ifoley2008 at gmail.com
Sun Apr 14 18:33:50 EDT 2019


Hi,

As an example of the above, I have just done a 2M evolution from pre-ms to
wd in about 6500 steps in the docker container with 7GB memory. It would
have needed 65 garbage collects at 100 step intervals without the above
code in order not to crash. With that code only 3 garbage collects were
needed and the memory came down to only 200 MB. At one point that required
a total allocation of about 1.1 GB in the previous 100 steps and then after
a garbage collect a total of 6.1 GB in the next 100 steps. That means that
in order to be totally safe, I need to have 1.2 GB minimum memory to
trigger garbage collection, not 1 GB.

Kind regards
ian


On Mon, 15 Apr 2019 at 07:41, Ian Foley <ifoley2008 at gmail.com> wrote:

> Hi,
>
> I thought it might be valuable for those users using MESA on limited
> memory computers for those using r 11554. My computer has 8GB RAM and I am
> also using the docker container on Windows 10 Professional. This release of
> MESA adds additional EOS data files and these can test available memory to
> the limit and can be turned off if necessary. But I wanted to use them if
> possible.
>
> This release of MESA adds the inlist "num_steps_for_garbage_collection"
> which defaults to 1000 and is useful to remove EOS data which is no longer
> needed and taking up too much memory. The problem is that it can happen
> that the added EOS data files can total near to 1 GB within 100 steps. If
> we set garbage collection to occur every 100 steps it makes a significant
> performance hit as the re-allocation of large EOS data files takes time. It
> is also true that much of the evolution does not require a big jump in EOS
> data.
>
> Much better would be to track free memory and activate garbage collection
> when it gets to < 1GB. The code below which can be added to run_star_extras
> achieves this and I have found it very useful.
>
>          ! For version 10398 and onwards
>          ! For version 11554 onwards there is an inlist to do garbage
> collection. Default 1000 models.
>          ! However, this is fixed setting and will often lead to extra
> cost when doing the garbage
>          ! collection. So now we track memory and only do garbage
> collection when memory is less than
>          ! a certain minimum. This routine will free all the eos memory
> then re-initialize it,
>          ! when mesa takes the next step it will reload only the eos data
> mesa needs then.
>          ! Garbage collection code courtesy of Rob Farmer, 18 April 2018
>
> if (mod(s% model_number,100)==0) then
>           write(*,*) 'Process id ',getpid()
>           write(*,*) 'Output from Linux free command'
>           call execute_command_line('free >memory.txt')
>           call execute_command_line('free')
>           open(100,file='memory.txt',status='old',iostat=ierr)
>           read(100,*) string3
>           read(100,'(a8,6i12)') string1,int1, int2, int3, int4, int5, int6
>           read(100,'(a7,3i12)') string2,int7, int8, int9
>           close(100)
>           free = int3 + int9
>           write(*,*) 'Model ', s% model_number
>           write(*,*) 'Total free memory = ',free
>           if (free < 1000000) then
>             write(*,*) 'Do garbage collection'
>             call eos_shutdown()
>             call eos_init(s% job% eos_file_prefix,s% job% eosDT_cache_dir,&
>                  s% job% eosPT_cache_dir, &
>                  s% job% eosDE_cache_dir, .true.,ierr)
>             call execute_command_line('free')
>           endif
>          endif
>
> Since there is no fortran command (or other language command) to find free
> memory what we do above is to execute the Linux free command every 100
> steps redirecting its output to a text file. Then we open the text file and
> parse its content to retrieve the amount of free memory. If its <1GB we
> activate garbage collection (using code originally from Rob Farmer -
> thanks).
>
> I hope this will be useful to some
>
> Kind regards
> Ian
>
>
>
>
> On Tue, 19 Mar 2019 at 09:21, Ian Foley <ifoley2008 at gmail.com> wrote:
>
>> Thanks for your advice.
>>
>> Ian
>>
>> On Tue, 19 Mar 2019 at 8:01 am, Evan Bauer <ebauer at physics.ucsb.edu>
>> wrote:
>>
>>> Hi Ian,
>>>
>>> Increasing the frequency of garbage collection sounds like a good idea
>>> to me, especially if your star is evolving through new EOS regions quickly.
>>> There really isn’t much downside to this other than a small speed hit.
>>>
>>> If you’re very memory constrained and want to go back to the old way of
>>> doing things, you also have the option of turning off the new EOS tables
>>> with
>>> use_eosDT2 = .false.
>>> use_eosELM = .false.
>>>
>>> Cheers,
>>> Evan
>>>
>>>
>>> On Mar 18, 2019, at 10:46 AM, Ian Foley <ifoley2008 at gmail.com> wrote:
>>>
>>> Thanks Rob for the detailed explanation. I will follow your suggestion
>>> and check for memory leaks. btw I'm using Windows 10 Professional.
>>>
>>> I may also have to increase the frequency of garbage collection to avoid
>>> a crash. 2GB is a lot more memory to need in 400 models when we are a long
>>> way into the evolution and I have a limit of 8GB of real memory.
>>>
>>> Kind regards
>>> ian
>>>
>>>
>>> On Mon, 18 Mar 2019 at 20:49, Rob Farmer <r.j.farmer at uva.nl> wrote:
>>>
>>>> Hi,
>>>> >
>>>> Num EOS files loaded       13000           7           0          17
>>>>       12          17
>>>>  Num EOS files loaded       13001           0           0          10
>>>>          4          17
>>>>
>>>> The ordering of the numbers is in line 410 in
>>>> star/job/run_star_support.f90,
>>>>
>>>> write(*,*) "Num EOS files loaded", s%model_number, num_DT, num_PT, &
>>>>                               num_DT2, num_PTEH, num_ELM
>>>>
>>>> So its telling you how many of each type of eos is currently loaded
>>>> into memory. Then by comparing the before and after the garbage collection
>>>> we can see whether we removed any eos files.
>>>>
>>>> So in you case we removed 7 eosDT files, 7 eosDT2 files, 8 PTEH files
>>>> and no ELM or PT files. This is only meant as a diagnostic but does show
>>>> that in this case you removed ~40% of the loaded eos files which should be
>>>> a good memory saving.
>>>>
>>>> >What amazed me was that between model 12460 and 12810 MESA has needed
>>>> nearly 2 MB of memory! which it has had to grab from the swap space leaving
>>>> less than 1MB available. That seems a huge amount over a short evolution
>>>> period. (1475904 to 3373664)
>>>>
>>>> I assume you meant GB here? What is likely happening is your model is
>>>> entering a new region of parameter space so we need to load in more eos
>>>> data files.
>>>>
>>>> But to check that its not a memory leak, run the model once up to some
>>>> model number and record the ~memory used at the end. Then do a restart from
>>>> say a 1000 steps before the end and record its memory usage at the end. If
>>>> there ~same then that is just normal mesa memory usage for this problem. If
>>>> the first run uses alot more memory then we have leaked memory somewhere.
>>>>
>>>> Also are you using the windows home (or pro?) docker container? If
>>>> home, you can configure the memory it uses, if you look in the
>>>> win_home_dockerMESA.sh file at the docker-machine create line you can
>>>> configure the memory it has with --virtualbox-memory=2048 (in mb). You may
>>>> need to delete the old virtual machine first with the
>>>> utils/uninstall_win_home.sh script if you change the memory value.
>>>>
>>>> Rob
>>>>
>>>>
>>>> On Mon, 18 Mar 2019 at 04:31, Ian Foley via Mesa-users <
>>>> mesa-users at lists.mesastar.org> wrote:
>>>>
>>>>> Hi Evan,
>>>>>
>>>>> Thanks for setting up r11554 in the MESA-Docker container. I have
>>>>> deleted older versions as you suggested. Everything seems to be working
>>>>> well except in an inlist for a 1M model evolution it crashed in a way like
>>>>> running out of memory near model 13000. I've attached files I think
>>>>> sufficient for you to reproduce the effect.
>>>>>
>>>>> Memory is cleaned up at model 12,000 and 13,000 because of the
>>>>> following setting. I might have been able to prevent the crash by
>>>>> decreasing this setting.
>>>>>       num_steps_for_garbage_collection = 1000
>>>>>       report_garbage_collection = .true.
>>>>> After the crash, I restarted the run at model 12,000 and since I
>>>>> modified "re" and "rn" to run star in the background, I can monitor the
>>>>> memory with "free". I entered the model number in the terminal so I can
>>>>> record when I executed "free".
>>>>>
>>>>> What amazed me was that between model 12460 and 12810 MESA has needed
>>>>> nearly 2 MB of memory! which it has had to grab from the swap space leaving
>>>>> less than 1MB available. That seems a huge amount over a short evolution
>>>>> period. (1475904 to 3373664)
>>>>>
>>>>> This is the report garbage collection output at model 13000. I haven't
>>>>> yet gone to the source code to find out what the number mean.
>>>>>
>>>>>  Num EOS files loaded       13000           7           0          17
>>>>>         12          17
>>>>>  Num EOS files loaded       13001           0           0          10
>>>>>          4          17
>>>>>
>>>>> Terminal output for run from model 12000 to 13010.
>>>>>
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ 12450
>>>>> -bash: 450: command not found
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ free
>>>>>               total        used        free      shared  buff/cache
>>>>>  available
>>>>> Mem:        3056888     2885884       84456           0       86548
>>>>>    30556
>>>>> Swap:       4194300     1475904     2718396
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ 12460
>>>>> -bash: 12460: command not found
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ 12810
>>>>> -bash: 12810: command not found
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ free
>>>>>               total        used        free      shared  buff/cache
>>>>>  available
>>>>> Mem:        3056888     2895980       76212           0       84696
>>>>>    21444
>>>>> Swap:       4194300     3373664      820636
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ 12900
>>>>> -bash: 12900: command not found
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ free
>>>>>               total        used        free      shared  buff/cache
>>>>>  available
>>>>> Mem:        3056888     2893880       69184           0       93824
>>>>>    18968
>>>>> Swap:       4194300     3348584      845716
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ 12990
>>>>> -bash: 12990: command not found
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ free
>>>>>               total        used        free      shared  buff/cache
>>>>>  available
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ free
>>>>>               total        used        free      shared  buff/cache
>>>>>  available
>>>>> Mem:        3056888     2883048       79752           0       94088
>>>>>    29472
>>>>> Swap:       4194300     3935380      258920
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ 13010
>>>>> -bash: 13010: command not found
>>>>> docker at a9e770e1dc66:~/docker_work/1M$ free
>>>>>               total        used        free      shared  buff/cache
>>>>>  available
>>>>> Mem:        3056888     2905660       75560           0       75668
>>>>>    16104
>>>>> Swap:       4194300     2024256     2170044
>>>>>
>>>>> The use of such a large memory chunk in such a short number of models
>>>>> is what is concerning me. Should I expect this with r11554 or is there some
>>>>> bug?
>>>>>
>>>>> Attached files re2.txt is the redirected terminal output
>>>>> The photo is for model 12,000 used for the restart on my Windows 10
>>>>> Professional software environment.
>>>>> I hope that is all you need.
>>>>>
>>>>> kind regards
>>>>> Ian
>>>>>
>>>>>
>>>>> On Sun, 17 Mar 2019 at 06:34, Evan Bauer <ebauer at physics.ucsb.edu>
>>>>> wrote:
>>>>>
>>>>>> Hi Ian,
>>>>>>
>>>>>> 11554 should be ready to go if you just “git pull” in the MESA-docker
>>>>>> repository to update. Let me know if that isn’t working for you. I
>>>>>> definitely recommend the upgrade.
>>>>>>
>>>>>> While you’re at it, I’ll also remind you that it’s probably a good
>>>>>> idea to clean up your older docker images to save hard drive space. You can
>>>>>> remove the image of 11532 with this command:
>>>>>> docker rmi evbauer/mesa_lean:11532.01
>>>>>>
>>>>>> You can also check what other older images might be sitting around
>>>>>> (and how much space they’re using) with this command:
>>>>>> docker images
>>>>>>
>>>>>> If you’re not regularly using the older MESA versions in those
>>>>>> images, you should probably get rid of them too with the “docker rmi”
>>>>>> command.
>>>>>>
>>>>>> Cheers,
>>>>>> Evan
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>> mesa-users at lists.mesastar.org
>>>>> https://lists.mesastar.org/mailman/listinfo/mesa-users
>>>>>
>>>>>
>>> --
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mesastar.org/pipermail/mesa-users/attachments/20190415/c9cce3fa/attachment.html>


More information about the Mesa-users mailing list