[Mesa-users] Problem with MESA version 10398.

Ian Foley ifoley2008 at gmail.com
Tue Apr 17 17:53:26 EDT 2018


Hi Rob,

I have tested this proposal and it works really well! Thank you! In fact,
for the 11M star runs I sent in previous email, memory use is not much more
than 2 GB using the mod you suggest. I think the 100 model interval works
well too. This reveals that the memory cost of eos data can come close to 3
GB overall. What appeared to be a major block on using MESA and docker for
windows now becomes much more feasible on small computers with around 8 GB
of RAM. I have configured docker for windows with 3 GB of memory and 4 GB
swap space. For larger nets I might need 4 GB of memory.

This outcome is better than I imagined and I have learned much from the
experience. I suggest it might help users of docker for windows to explain
these things so they don't have jump through all the loops I have gone
through. Out of memory crashes are particularly difficult to resolve.

Kind regards and thank you again,
Ian

On 18 April 2018 at 02:08, Rob Farmer <r.j.farmer at uva.nl> wrote:

> So i came up with this that might help, add this to your
> extras_check_model:
>
> use eos_lib
>
>
> if(mod(s%model_number,100)==0) then
>    call eos_shutdown()
>    call eos_init(s% job% eos_file_prefix,s% job% eosDT_cache_dir,&
>                s% job% eosPT_cache_dir, &
>                s% job% eosDE_cache_dir, .true.,ierr)
> end if
>
> basically every 100 steps it will free all the eos memory then
> re-initialize it, when mesa takes the next step it will reload only the eos
> data mesa needs then.
>
> Rob
>
>
> On 17 April 2018 at 13:14, Rob Farmer <r.j.farmer at uva.nl> wrote:
>
>> So there are no known leaks at the moment (though there may be unknown
>> memory leaks in the code), i'm running your model under valgrind at the
>> moment but that will take time.
>>
>> The problem is mostly there are more eos files (ie a finer grid in X,Z
>> plane) which is driving up the memory usage, also the pteh eos files are
>> also larger than the older eos files. For now your best bet is to turn off
>> the new eos files (or find a machine with more ram). Increasing your swap
>> space even further might help to relieve some of the memory pressure but
>> i'm not sure what the performance will be, you probably want to make swap =
>> amount of ram you have.
>>
>> If your feeling adventurous, the current svn head (svn co svn://
>> svn.code.sf.net/p/mesa/code/trunk mesa ) has some fixes for this which
>> may help you but i'm not sure if they would be enough in your case.
>>
>> Rob
>>
>> On 17 April 2018 at 04:33, Ian Foley <ifoley2008 at gmail.com> wrote:
>>
>>> Hi Rob and Evan,
>>>
>>> I've been following up with your suggestions which have been very
>>> helpful (thank you!), but problems remain. Rob, following your
>>> recommendation, I have since set the environment variable you suggested,
>>> but problems remain.
>>>
>>> Evan, you made suggestions in the area of disk space. My disk space on
>>> the hard drive is not a problem. However, I am not clear on the
>>> relationship between disk space and the size of the container. When I
>>> reported the problem I had docker for windows configured with 3 GB memory
>>> and 1 GB of swap space.  Increasing the swap space seemed to improve the
>>> situation, but not eliminate it. The documentation on the swap space is
>>> very vague. I'm guessing it is disk space that it is reserved disk space
>>> that can be used in lieu of memory and using it will slow execution speed.
>>>
>>> So I first increased the swap space to 2 GB and then 3 GB and still had
>>> the problem. Then I realised that the problem was probably memory and I
>>> needed to be able to track what was happening with memory, so I
>>> reconfigured "rn" to run star as a background process regularly monitoring
>>> memory using :"free". I reconfigured docker for windows first with 3.5 GB,
>>> then 4 GB and then 5 GB with swap space at 1 GB. At 5 GB, which is right
>>> near the limit of what my 8 GB laptop can take, the program just made it. I
>>> have provided a trace of memory usage below copied from the terminal
>>> output, with a few inserted comments.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ ps
>>>    PID TTY          TIME CMD
>>>     43 pts/0    00:00:00 bash
>>>     57 pts/0    02:42:49 star
>>>     81 pts/0    00:00:00 pgxwin_server
>>>     95 pts/0    00:00:00 ps
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     3844488      171136         764     1033544
>>> 941636
>>> Swap:       1048572       32960     1015612
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4043912      130348         764      874908
>>> 743832
>>> Swap:       1048572       33416     1015156
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4242800      101772         764      704596
>>> 546720
>>> Swap:       1048572       33960     1014612
>>>
>>> Comment: the big drop here was associated with decreasing the
>>> mesh_delta_coeff
>>> and thereby doubling the number of zones to cater for super AGB behavior.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4640768      113848         756      294552
>>> 155820
>>> Swap:       1048572       36852     1011720
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4849480      111144         276       88544
>>>  15976
>>> Swap:       1048572      294752      753820
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4854980      113520         276       80668
>>>  15320
>>> Swap:       1048572      292968      755604
>>>
>>> Comment: the big drop in available memory here was associated with
>>> placing
>>> many eosPTEH files into the cache and was accompanied by a low CPU cost
>>> when it was happening. When the available memory became low, the swap
>>> space
>>> came into play. The amount of memory needed drop in available plus that
>>> added to
>>> cache suggests about 800K was needed. This was around model 1600.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4838820      129980         276       80368
>>>  31604
>>> Swap:       1048572      291796      756776
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4838736      129728         276       80704
>>>  31908
>>> Swap:       1048572      288176      760396
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4839172      128292         276       81704
>>>  30756
>>> Swap:       1048572      288080      760492
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4840036      127472         276       81660
>>>  30032
>>> Swap:       1048572      288072      760500
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4839896      120628         276       88644
>>>  26488
>>> Swap:       1048572      288056      760516
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4840088      120144         276       88936
>>>  26024
>>> Swap:       1048572      288056      760516
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4861784      113448          36       73936
>>>  11292
>>> Swap:       1048572      490492      558080
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4817124      106900          12      125144
>>>  30308
>>> Swap:       1048572      839220      209352
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168     4865112      101184          12       82872
>>>   4236
>>> Swap:       1048572      822080      226492
>>>
>>> Comment: around model 3000 run crashed with timestep too small error, but
>>> it is clear that memory available is right near the limit with almost no
>>> available memory left and most of the swap space used.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>>               total        used        free      shared  buff/cache
>>>  available
>>> Mem:        5049168      171544     4792304          12       85320
>>>  4696848
>>> Swap:       1048572      159496      889076
>>>
>>> Comment: restoration of memory after the crash
>>>
>>> So the extra requirements of using the latest eosPTEH features in
>>> version 10398 is right on
>>> the edge for me, needing to configure docker for windows with 5 GB of
>>> memory. Are there bugs
>>> or memory leaks in the code which are increasing the memory allocation
>>> problem?
>>>
>>> Since I was to run the inlist with docker for windows configured with 3
>>> GB of memory before, it seems
>>> like the extra memory needed to use the eosPTEH feature is about 2 GB or
>>> more of memory, not disk
>>> space unless the performance hit of high disk usage is accepted.
>>>
>>> Is there a way around this?
>>>
>>> Kind regards
>>> Ian
>>>
>>> On 13 April 2018 at 19:32, Rob Farmer <r.j.farmer at uva.nl> wrote:
>>>
>>>> > If the process has been killed I am confident that it is not a
>>>> shortage of memory problem since I increased docker for windows allocation
>>>> from 3GB to 3.5GB and still had the same problem with the same inlist.
>>>>
>>>> Others have had problems recently with machines when they only have 4Gb
>>>> of ram. The problem occurs due to us moving the cache files around which
>>>> requires a call to fork(), which needs to copy the address the space so
>>>> your machine temporarily needs 2* the memory.
>>>>
>>>> You may find setting the environment variable:
>>>> export MESA_TEMP_CACHES_DISABLE=1
>>>>
>>>> helpful, which stops us moving the cache files around.
>>>>
>>>> Rob
>>>>
>>>>
>>>>
>>>> On 13 April 2018 at 10:43, Ian Foley via Mesa-users <
>>>> mesa-users at lists.mesastar.org> wrote:
>>>>
>>>>> Hi Evan,
>>>>>
>>>>> Correction. Can increase the swap space in 0.5 GB chunks. The disk
>>>>> image can be increased to 64 GB and I am only using 31 GB, so that's not
>>>>> the limit. Will see what increasing the swap space does.
>>>>>
>>>>> Kind regards
>>>>> Ian
>>>>>
>>>>> On 13 April 2018 at 18:38, Ian Foley <ifoley2008 at gmail.com> wrote:
>>>>>
>>>>>> Hi Evan,
>>>>>>
>>>>>> Is that what defines the "swap space" limit setting in docker for
>>>>>> windows? Currently it is set at 1GB and can only be increased in 1GB
>>>>>> chunks. I have plenty of hard drive disk space. I will give your suggestion
>>>>>> a try and see what happens and let you know.
>>>>>>
>>>>>> Kind regards
>>>>>> Ian
>>>>>>
>>>>>> On 13 April 2018 at 16:36, Evan Bauer <ebauer at physics.ucsb.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ian,
>>>>>>>
>>>>>>> How is your disk usage? Caching the PTEH files can require several
>>>>>>> GB of disk space, so I wonder if this is a symptom of the container running
>>>>>>> out of available space to write the cache files?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Evan
>>>>>>>
>>>>>>>
>>>>>>> > On Apr 12, 2018, at 9:18 PM, Ian Foley via Mesa-users <
>>>>>>> mesa-users at lists.mesastar.org> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I am running docker for windows on Windows 10 Pro on a machine
>>>>>>> with 4 cores and 8GB RAM. I have been working on developing inlists to
>>>>>>> evolve 1M to 15M stars from pre_ms to end (wd or cc). I have been doing
>>>>>>> this since version 7624 and always if the run failed MESA would give a
>>>>>>> reason related to the inlist chosen e.g. timestep limit.
>>>>>>> >
>>>>>>> > However, with version 10398, the run has failed quite often by
>>>>>>> either just hanging leaving screen displays intact, but CPU drops to <10%
>>>>>>> instead of >50% (program in loop?) or the operating system kills the run.
>>>>>>> If the process has been killed I am confident that it is not a shortage of
>>>>>>> memory problem since I increased docker for windows allocation from 3GB to
>>>>>>> 3.5GB and still had the same problem with the same inlist.
>>>>>>> >
>>>>>>> > In all these cases, the run environment was one with a large
>>>>>>> envelope and very low surface pressure and density and the run was
>>>>>>> requiring to cache new eosPTEH files.
>>>>>>> >
>>>>>>> > I have attached all the files necessary for users to execute a
>>>>>>> rerun and hopefully reproduce the error and locate the problem. I have also
>>>>>>> attached all the terminal output (11M,txt) and two files showing the last
>>>>>>> screens from pgstar. This run hung and it can be seen by looking at the
>>>>>>> final output in 11M.txt.
>>>>>>> >
>>>>>>> > It perhaps should be noted that by changing the inlist to set
>>>>>>> >
>>>>>>> >       use_eosPTEH_for_low_density = .false. ! default .true.
>>>>>>> >       use_eosPTEH_for_high_Z = .false. ! default .true.
>>>>>>> >
>>>>>>> > I was able to complete the run with the final outcome being wd -
>>>>>>> just. The He core was 1.33M.
>>>>>>> >
>>>>>>> > Please note that run_star_extras.f does quite a few things whose
>>>>>>> intention is to try to get a successful run without having to stop mid-way
>>>>>>> and change parameters. One of the strategies here is to dynamically change
>>>>>>> var_control according to the number of retries in 10 models and thus allow
>>>>>>> larger delta changes in parameters and let var_control the size of logdt.
>>>>>>> This usually works, but can fail when there are a large number of retries
>>>>>>> in 10 models and var_control does not adapt fast enough.
>>>>>>> >
>>>>>>> > Kind regards
>>>>>>> > Ian
>>>>>>> >
>>>>>>> >
>>>>>>> > <11M.txt><history_columns.list><ian40r.net><inlist_11M.0><pr
>>>>>>> ofile_columns.list><run_star_extras.f><grid6_001290.png><pro
>>>>>>> file_Panels3_001290.png>____________________________________
>>>>>>> ___________
>>>>>>> > mesa-users at lists.mesastar.org
>>>>>>> > https://lists.mesastar.org/mailman/listinfo/mesa-users
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mesastar.org/pipermail/mesa-users/attachments/20180418/1e62b63c/attachment.html>


More information about the Mesa-users mailing list