[Mesa-users] Problem with MESA version 10398.
Ian Foley
ifoley2008 at gmail.com
Tue Apr 17 17:53:26 EDT 2018
Hi Rob,
I have tested this proposal and it works really well! Thank you! In fact,
for the 11M star runs I sent in previous email, memory use is not much more
than 2 GB using the mod you suggest. I think the 100 model interval works
well too. This reveals that the memory cost of eos data can come close to 3
GB overall. What appeared to be a major block on using MESA and docker for
windows now becomes much more feasible on small computers with around 8 GB
of RAM. I have configured docker for windows with 3 GB of memory and 4 GB
swap space. For larger nets I might need 4 GB of memory.
This outcome is better than I imagined and I have learned much from the
experience. I suggest it might help users of docker for windows to explain
these things so they don't have jump through all the loops I have gone
through. Out of memory crashes are particularly difficult to resolve.
Kind regards and thank you again,
Ian
On 18 April 2018 at 02:08, Rob Farmer <r.j.farmer at uva.nl> wrote:
> So i came up with this that might help, add this to your
> extras_check_model:
>
> use eos_lib
>
>
> if(mod(s%model_number,100)==0) then
> call eos_shutdown()
> call eos_init(s% job% eos_file_prefix,s% job% eosDT_cache_dir,&
> s% job% eosPT_cache_dir, &
> s% job% eosDE_cache_dir, .true.,ierr)
> end if
>
> basically every 100 steps it will free all the eos memory then
> re-initialize it, when mesa takes the next step it will reload only the eos
> data mesa needs then.
>
> Rob
>
>
> On 17 April 2018 at 13:14, Rob Farmer <r.j.farmer at uva.nl> wrote:
>
>> So there are no known leaks at the moment (though there may be unknown
>> memory leaks in the code), i'm running your model under valgrind at the
>> moment but that will take time.
>>
>> The problem is mostly there are more eos files (ie a finer grid in X,Z
>> plane) which is driving up the memory usage, also the pteh eos files are
>> also larger than the older eos files. For now your best bet is to turn off
>> the new eos files (or find a machine with more ram). Increasing your swap
>> space even further might help to relieve some of the memory pressure but
>> i'm not sure what the performance will be, you probably want to make swap =
>> amount of ram you have.
>>
>> If your feeling adventurous, the current svn head (svn co svn://
>> svn.code.sf.net/p/mesa/code/trunk mesa ) has some fixes for this which
>> may help you but i'm not sure if they would be enough in your case.
>>
>> Rob
>>
>> On 17 April 2018 at 04:33, Ian Foley <ifoley2008 at gmail.com> wrote:
>>
>>> Hi Rob and Evan,
>>>
>>> I've been following up with your suggestions which have been very
>>> helpful (thank you!), but problems remain. Rob, following your
>>> recommendation, I have since set the environment variable you suggested,
>>> but problems remain.
>>>
>>> Evan, you made suggestions in the area of disk space. My disk space on
>>> the hard drive is not a problem. However, I am not clear on the
>>> relationship between disk space and the size of the container. When I
>>> reported the problem I had docker for windows configured with 3 GB memory
>>> and 1 GB of swap space. Increasing the swap space seemed to improve the
>>> situation, but not eliminate it. The documentation on the swap space is
>>> very vague. I'm guessing it is disk space that it is reserved disk space
>>> that can be used in lieu of memory and using it will slow execution speed.
>>>
>>> So I first increased the swap space to 2 GB and then 3 GB and still had
>>> the problem. Then I realised that the problem was probably memory and I
>>> needed to be able to track what was happening with memory, so I
>>> reconfigured "rn" to run star as a background process regularly monitoring
>>> memory using :"free". I reconfigured docker for windows first with 3.5 GB,
>>> then 4 GB and then 5 GB with swap space at 1 GB. At 5 GB, which is right
>>> near the limit of what my 8 GB laptop can take, the program just made it. I
>>> have provided a trace of memory usage below copied from the terminal
>>> output, with a few inserted comments.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ ps
>>> PID TTY TIME CMD
>>> 43 pts/0 00:00:00 bash
>>> 57 pts/0 02:42:49 star
>>> 81 pts/0 00:00:00 pgxwin_server
>>> 95 pts/0 00:00:00 ps
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 3844488 171136 764 1033544
>>> 941636
>>> Swap: 1048572 32960 1015612
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4043912 130348 764 874908
>>> 743832
>>> Swap: 1048572 33416 1015156
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4242800 101772 764 704596
>>> 546720
>>> Swap: 1048572 33960 1014612
>>>
>>> Comment: the big drop here was associated with decreasing the
>>> mesh_delta_coeff
>>> and thereby doubling the number of zones to cater for super AGB behavior.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4640768 113848 756 294552
>>> 155820
>>> Swap: 1048572 36852 1011720
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4849480 111144 276 88544
>>> 15976
>>> Swap: 1048572 294752 753820
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4854980 113520 276 80668
>>> 15320
>>> Swap: 1048572 292968 755604
>>>
>>> Comment: the big drop in available memory here was associated with
>>> placing
>>> many eosPTEH files into the cache and was accompanied by a low CPU cost
>>> when it was happening. When the available memory became low, the swap
>>> space
>>> came into play. The amount of memory needed drop in available plus that
>>> added to
>>> cache suggests about 800K was needed. This was around model 1600.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4838820 129980 276 80368
>>> 31604
>>> Swap: 1048572 291796 756776
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4838736 129728 276 80704
>>> 31908
>>> Swap: 1048572 288176 760396
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4839172 128292 276 81704
>>> 30756
>>> Swap: 1048572 288080 760492
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4840036 127472 276 81660
>>> 30032
>>> Swap: 1048572 288072 760500
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4839896 120628 276 88644
>>> 26488
>>> Swap: 1048572 288056 760516
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4840088 120144 276 88936
>>> 26024
>>> Swap: 1048572 288056 760516
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4861784 113448 36 73936
>>> 11292
>>> Swap: 1048572 490492 558080
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4817124 106900 12 125144
>>> 30308
>>> Swap: 1048572 839220 209352
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 4865112 101184 12 82872
>>> 4236
>>> Swap: 1048572 822080 226492
>>>
>>> Comment: around model 3000 run crashed with timestep too small error, but
>>> it is clear that memory available is right near the limit with almost no
>>> available memory left and most of the swap space used.
>>>
>>> docker at 023217a8d7fb:~/docker_work/11M$ free
>>> total used free shared buff/cache
>>> available
>>> Mem: 5049168 171544 4792304 12 85320
>>> 4696848
>>> Swap: 1048572 159496 889076
>>>
>>> Comment: restoration of memory after the crash
>>>
>>> So the extra requirements of using the latest eosPTEH features in
>>> version 10398 is right on
>>> the edge for me, needing to configure docker for windows with 5 GB of
>>> memory. Are there bugs
>>> or memory leaks in the code which are increasing the memory allocation
>>> problem?
>>>
>>> Since I was to run the inlist with docker for windows configured with 3
>>> GB of memory before, it seems
>>> like the extra memory needed to use the eosPTEH feature is about 2 GB or
>>> more of memory, not disk
>>> space unless the performance hit of high disk usage is accepted.
>>>
>>> Is there a way around this?
>>>
>>> Kind regards
>>> Ian
>>>
>>> On 13 April 2018 at 19:32, Rob Farmer <r.j.farmer at uva.nl> wrote:
>>>
>>>> > If the process has been killed I am confident that it is not a
>>>> shortage of memory problem since I increased docker for windows allocation
>>>> from 3GB to 3.5GB and still had the same problem with the same inlist.
>>>>
>>>> Others have had problems recently with machines when they only have 4Gb
>>>> of ram. The problem occurs due to us moving the cache files around which
>>>> requires a call to fork(), which needs to copy the address the space so
>>>> your machine temporarily needs 2* the memory.
>>>>
>>>> You may find setting the environment variable:
>>>> export MESA_TEMP_CACHES_DISABLE=1
>>>>
>>>> helpful, which stops us moving the cache files around.
>>>>
>>>> Rob
>>>>
>>>>
>>>>
>>>> On 13 April 2018 at 10:43, Ian Foley via Mesa-users <
>>>> mesa-users at lists.mesastar.org> wrote:
>>>>
>>>>> Hi Evan,
>>>>>
>>>>> Correction. Can increase the swap space in 0.5 GB chunks. The disk
>>>>> image can be increased to 64 GB and I am only using 31 GB, so that's not
>>>>> the limit. Will see what increasing the swap space does.
>>>>>
>>>>> Kind regards
>>>>> Ian
>>>>>
>>>>> On 13 April 2018 at 18:38, Ian Foley <ifoley2008 at gmail.com> wrote:
>>>>>
>>>>>> Hi Evan,
>>>>>>
>>>>>> Is that what defines the "swap space" limit setting in docker for
>>>>>> windows? Currently it is set at 1GB and can only be increased in 1GB
>>>>>> chunks. I have plenty of hard drive disk space. I will give your suggestion
>>>>>> a try and see what happens and let you know.
>>>>>>
>>>>>> Kind regards
>>>>>> Ian
>>>>>>
>>>>>> On 13 April 2018 at 16:36, Evan Bauer <ebauer at physics.ucsb.edu>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ian,
>>>>>>>
>>>>>>> How is your disk usage? Caching the PTEH files can require several
>>>>>>> GB of disk space, so I wonder if this is a symptom of the container running
>>>>>>> out of available space to write the cache files?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Evan
>>>>>>>
>>>>>>>
>>>>>>> > On Apr 12, 2018, at 9:18 PM, Ian Foley via Mesa-users <
>>>>>>> mesa-users at lists.mesastar.org> wrote:
>>>>>>> >
>>>>>>> > Hi,
>>>>>>> >
>>>>>>> > I am running docker for windows on Windows 10 Pro on a machine
>>>>>>> with 4 cores and 8GB RAM. I have been working on developing inlists to
>>>>>>> evolve 1M to 15M stars from pre_ms to end (wd or cc). I have been doing
>>>>>>> this since version 7624 and always if the run failed MESA would give a
>>>>>>> reason related to the inlist chosen e.g. timestep limit.
>>>>>>> >
>>>>>>> > However, with version 10398, the run has failed quite often by
>>>>>>> either just hanging leaving screen displays intact, but CPU drops to <10%
>>>>>>> instead of >50% (program in loop?) or the operating system kills the run.
>>>>>>> If the process has been killed I am confident that it is not a shortage of
>>>>>>> memory problem since I increased docker for windows allocation from 3GB to
>>>>>>> 3.5GB and still had the same problem with the same inlist.
>>>>>>> >
>>>>>>> > In all these cases, the run environment was one with a large
>>>>>>> envelope and very low surface pressure and density and the run was
>>>>>>> requiring to cache new eosPTEH files.
>>>>>>> >
>>>>>>> > I have attached all the files necessary for users to execute a
>>>>>>> rerun and hopefully reproduce the error and locate the problem. I have also
>>>>>>> attached all the terminal output (11M,txt) and two files showing the last
>>>>>>> screens from pgstar. This run hung and it can be seen by looking at the
>>>>>>> final output in 11M.txt.
>>>>>>> >
>>>>>>> > It perhaps should be noted that by changing the inlist to set
>>>>>>> >
>>>>>>> > use_eosPTEH_for_low_density = .false. ! default .true.
>>>>>>> > use_eosPTEH_for_high_Z = .false. ! default .true.
>>>>>>> >
>>>>>>> > I was able to complete the run with the final outcome being wd -
>>>>>>> just. The He core was 1.33M.
>>>>>>> >
>>>>>>> > Please note that run_star_extras.f does quite a few things whose
>>>>>>> intention is to try to get a successful run without having to stop mid-way
>>>>>>> and change parameters. One of the strategies here is to dynamically change
>>>>>>> var_control according to the number of retries in 10 models and thus allow
>>>>>>> larger delta changes in parameters and let var_control the size of logdt.
>>>>>>> This usually works, but can fail when there are a large number of retries
>>>>>>> in 10 models and var_control does not adapt fast enough.
>>>>>>> >
>>>>>>> > Kind regards
>>>>>>> > Ian
>>>>>>> >
>>>>>>> >
>>>>>>> > <11M.txt><history_columns.list><ian40r.net><inlist_11M.0><pr
>>>>>>> ofile_columns.list><run_star_extras.f><grid6_001290.png><pro
>>>>>>> file_Panels3_001290.png>____________________________________
>>>>>>> ___________
>>>>>>> > mesa-users at lists.mesastar.org
>>>>>>> > https://lists.mesastar.org/mailman/listinfo/mesa-users
>>>>>>> >
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mesastar.org/pipermail/mesa-users/attachments/20180418/1e62b63c/attachment.html>
More information about the Mesa-users
mailing list