[Mesa-users] Problem with MESA version 10398.

Rob Farmer r.j.farmer at uva.nl
Tue Apr 17 07:14:50 EDT 2018


So there are no known leaks at the moment (though there may be unknown
memory leaks in the code), i'm running your model under valgrind at the
moment but that will take time.

The problem is mostly there are more eos files (ie a finer grid in X,Z
plane) which is driving up the memory usage, also the pteh eos files are
also larger than the older eos files. For now your best bet is to turn off
the new eos files (or find a machine with more ram). Increasing your swap
space even further might help to relieve some of the memory pressure but
i'm not sure what the performance will be, you probably want to make swap =
amount of ram you have.

If your feeling adventurous, the current svn head (svn co svn://
svn.code.sf.net/p/mesa/code/trunk mesa ) has some fixes for this which may
help you but i'm not sure if they would be enough in your case.

Rob

On 17 April 2018 at 04:33, Ian Foley <ifoley2008 at gmail.com> wrote:

> Hi Rob and Evan,
>
> I've been following up with your suggestions which have been very helpful
> (thank you!), but problems remain. Rob, following your recommendation, I
> have since set the environment variable you suggested, but problems remain.
>
> Evan, you made suggestions in the area of disk space. My disk space on the
> hard drive is not a problem. However, I am not clear on the relationship
> between disk space and the size of the container. When I reported the
> problem I had docker for windows configured with 3 GB memory and 1 GB of
> swap space.  Increasing the swap space seemed to improve the situation, but
> not eliminate it. The documentation on the swap space is very vague. I'm
> guessing it is disk space that it is reserved disk space that can be used
> in lieu of memory and using it will slow execution speed.
>
> So I first increased the swap space to 2 GB and then 3 GB and still had
> the problem. Then I realised that the problem was probably memory and I
> needed to be able to track what was happening with memory, so I
> reconfigured "rn" to run star as a background process regularly monitoring
> memory using :"free". I reconfigured docker for windows first with 3.5 GB,
> then 4 GB and then 5 GB with swap space at 1 GB. At 5 GB, which is right
> near the limit of what my 8 GB laptop can take, the program just made it. I
> have provided a trace of memory usage below copied from the terminal
> output, with a few inserted comments.
>
> docker at 023217a8d7fb:~/docker_work/11M$ ps
>    PID TTY          TIME CMD
>     43 pts/0    00:00:00 bash
>     57 pts/0    02:42:49 star
>     81 pts/0    00:00:00 pgxwin_server
>     95 pts/0    00:00:00 ps
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     3844488      171136         764     1033544
> 941636
> Swap:       1048572       32960     1015612
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4043912      130348         764      874908
> 743832
> Swap:       1048572       33416     1015156
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4242800      101772         764      704596
> 546720
> Swap:       1048572       33960     1014612
>
> Comment: the big drop here was associated with decreasing the
> mesh_delta_coeff
> and thereby doubling the number of zones to cater for super AGB behavior.
>
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4640768      113848         756      294552
> 155820
> Swap:       1048572       36852     1011720
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4849480      111144         276       88544
>  15976
> Swap:       1048572      294752      753820
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4854980      113520         276       80668
>  15320
> Swap:       1048572      292968      755604
>
> Comment: the big drop in available memory here was associated with placing
> many eosPTEH files into the cache and was accompanied by a low CPU cost
> when it was happening. When the available memory became low, the swap space
> came into play. The amount of memory needed drop in available plus that
> added to
> cache suggests about 800K was needed. This was around model 1600.
>
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4838820      129980         276       80368
>  31604
> Swap:       1048572      291796      756776
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4838736      129728         276       80704
>  31908
> Swap:       1048572      288176      760396
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4839172      128292         276       81704
>  30756
> Swap:       1048572      288080      760492
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4840036      127472         276       81660
>  30032
> Swap:       1048572      288072      760500
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4839896      120628         276       88644
>  26488
> Swap:       1048572      288056      760516
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4840088      120144         276       88936
>  26024
> Swap:       1048572      288056      760516
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4861784      113448          36       73936
>  11292
> Swap:       1048572      490492      558080
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4817124      106900          12      125144
>  30308
> Swap:       1048572      839220      209352
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168     4865112      101184          12       82872
> 4236
> Swap:       1048572      822080      226492
>
> Comment: around model 3000 run crashed with timestep too small error, but
> it is clear that memory available is right near the limit with almost no
> available memory left and most of the swap space used.
>
> docker at 023217a8d7fb:~/docker_work/11M$ free
>               total        used        free      shared  buff/cache
>  available
> Mem:        5049168      171544     4792304          12       85320
>  4696848
> Swap:       1048572      159496      889076
>
> Comment: restoration of memory after the crash
>
> So the extra requirements of using the latest eosPTEH features in version
> 10398 is right on
> the edge for me, needing to configure docker for windows with 5 GB of
> memory. Are there bugs
> or memory leaks in the code which are increasing the memory allocation
> problem?
>
> Since I was to run the inlist with docker for windows configured with 3 GB
> of memory before, it seems
> like the extra memory needed to use the eosPTEH feature is about 2 GB or
> more of memory, not disk
> space unless the performance hit of high disk usage is accepted.
>
> Is there a way around this?
>
> Kind regards
> Ian
>
> On 13 April 2018 at 19:32, Rob Farmer <r.j.farmer at uva.nl> wrote:
>
>> > If the process has been killed I am confident that it is not a shortage
>> of memory problem since I increased docker for windows allocation from 3GB
>> to 3.5GB and still had the same problem with the same inlist.
>>
>> Others have had problems recently with machines when they only have 4Gb
>> of ram. The problem occurs due to us moving the cache files around which
>> requires a call to fork(), which needs to copy the address the space so
>> your machine temporarily needs 2* the memory.
>>
>> You may find setting the environment variable:
>> export MESA_TEMP_CACHES_DISABLE=1
>>
>> helpful, which stops us moving the cache files around.
>>
>> Rob
>>
>>
>>
>> On 13 April 2018 at 10:43, Ian Foley via Mesa-users <
>> mesa-users at lists.mesastar.org> wrote:
>>
>>> Hi Evan,
>>>
>>> Correction. Can increase the swap space in 0.5 GB chunks. The disk image
>>> can be increased to 64 GB and I am only using 31 GB, so that's not the
>>> limit. Will see what increasing the swap space does.
>>>
>>> Kind regards
>>> Ian
>>>
>>> On 13 April 2018 at 18:38, Ian Foley <ifoley2008 at gmail.com> wrote:
>>>
>>>> Hi Evan,
>>>>
>>>> Is that what defines the "swap space" limit setting in docker for
>>>> windows? Currently it is set at 1GB and can only be increased in 1GB
>>>> chunks. I have plenty of hard drive disk space. I will give your suggestion
>>>> a try and see what happens and let you know.
>>>>
>>>> Kind regards
>>>> Ian
>>>>
>>>> On 13 April 2018 at 16:36, Evan Bauer <ebauer at physics.ucsb.edu> wrote:
>>>>
>>>>> Hi Ian,
>>>>>
>>>>> How is your disk usage? Caching the PTEH files can require several GB
>>>>> of disk space, so I wonder if this is a symptom of the container running
>>>>> out of available space to write the cache files?
>>>>>
>>>>> Cheers,
>>>>> Evan
>>>>>
>>>>>
>>>>> > On Apr 12, 2018, at 9:18 PM, Ian Foley via Mesa-users <
>>>>> mesa-users at lists.mesastar.org> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I am running docker for windows on Windows 10 Pro on a machine with
>>>>> 4 cores and 8GB RAM. I have been working on developing inlists to evolve 1M
>>>>> to 15M stars from pre_ms to end (wd or cc). I have been doing this since
>>>>> version 7624 and always if the run failed MESA would give a reason related
>>>>> to the inlist chosen e.g. timestep limit.
>>>>> >
>>>>> > However, with version 10398, the run has failed quite often by
>>>>> either just hanging leaving screen displays intact, but CPU drops to <10%
>>>>> instead of >50% (program in loop?) or the operating system kills the run.
>>>>> If the process has been killed I am confident that it is not a shortage of
>>>>> memory problem since I increased docker for windows allocation from 3GB to
>>>>> 3.5GB and still had the same problem with the same inlist.
>>>>> >
>>>>> > In all these cases, the run environment was one with a large
>>>>> envelope and very low surface pressure and density and the run was
>>>>> requiring to cache new eosPTEH files.
>>>>> >
>>>>> > I have attached all the files necessary for users to execute a rerun
>>>>> and hopefully reproduce the error and locate the problem. I have also
>>>>> attached all the terminal output (11M,txt) and two files showing the last
>>>>> screens from pgstar. This run hung and it can be seen by looking at the
>>>>> final output in 11M.txt.
>>>>> >
>>>>> > It perhaps should be noted that by changing the inlist to set
>>>>> >
>>>>> >       use_eosPTEH_for_low_density = .false. ! default .true.
>>>>> >       use_eosPTEH_for_high_Z = .false. ! default .true.
>>>>> >
>>>>> > I was able to complete the run with the final outcome being wd -
>>>>> just. The He core was 1.33M.
>>>>> >
>>>>> > Please note that run_star_extras.f does quite a few things whose
>>>>> intention is to try to get a successful run without having to stop mid-way
>>>>> and change parameters. One of the strategies here is to dynamically change
>>>>> var_control according to the number of retries in 10 models and thus allow
>>>>> larger delta changes in parameters and let var_control the size of logdt.
>>>>> This usually works, but can fail when there are a large number of retries
>>>>> in 10 models and var_control does not adapt fast enough.
>>>>> >
>>>>> > Kind regards
>>>>> > Ian
>>>>> >
>>>>> >
>>>>> > <11M.txt><history_columns.list><ian40r.net><inlist_11M.0><pr
>>>>> ofile_columns.list><run_star_extras.f><grid6_001290.png><pro
>>>>> file_Panels3_001290.png>____________________________________
>>>>> ___________
>>>>> > mesa-users at lists.mesastar.org
>>>>> > https://lists.mesastar.org/mailman/listinfo/mesa-users
>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mesastar.org/pipermail/mesa-users/attachments/20180417/fe484a51/attachment.html>


More information about the Mesa-users mailing list