[Mesa-users] Problem with MESA version 10398.

Ian Foley ifoley2008 at gmail.com
Mon Apr 16 22:33:55 EDT 2018


Hi Rob and Evan,

I've been following up with your suggestions which have been very helpful
(thank you!), but problems remain. Rob, following your recommendation, I
have since set the environment variable you suggested, but problems remain.

Evan, you made suggestions in the area of disk space. My disk space on the
hard drive is not a problem. However, I am not clear on the relationship
between disk space and the size of the container. When I reported the
problem I had docker for windows configured with 3 GB memory and 1 GB of
swap space.  Increasing the swap space seemed to improve the situation, but
not eliminate it. The documentation on the swap space is very vague. I'm
guessing it is disk space that it is reserved disk space that can be used
in lieu of memory and using it will slow execution speed.

So I first increased the swap space to 2 GB and then 3 GB and still had the
problem. Then I realised that the problem was probably memory and I needed
to be able to track what was happening with memory, so I reconfigured "rn"
to run star as a background process regularly monitoring memory using
:"free". I reconfigured docker for windows first with 3.5 GB, then 4 GB and
then 5 GB with swap space at 1 GB. At 5 GB, which is right near the limit
of what my 8 GB laptop can take, the program just made it. I have provided
a trace of memory usage below copied from the terminal output, with a few
inserted comments.

docker at 023217a8d7fb:~/docker_work/11M$ ps
   PID TTY          TIME CMD
    43 pts/0    00:00:00 bash
    57 pts/0    02:42:49 star
    81 pts/0    00:00:00 pgxwin_server
    95 pts/0    00:00:00 ps
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     3844488      171136         764     1033544
941636
Swap:       1048572       32960     1015612
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4043912      130348         764      874908
743832
Swap:       1048572       33416     1015156
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4242800      101772         764      704596
546720
Swap:       1048572       33960     1014612

Comment: the big drop here was associated with decreasing the
mesh_delta_coeff
and thereby doubling the number of zones to cater for super AGB behavior.

docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4640768      113848         756      294552
155820
Swap:       1048572       36852     1011720
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4849480      111144         276       88544
 15976
Swap:       1048572      294752      753820
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4854980      113520         276       80668
 15320
Swap:       1048572      292968      755604

Comment: the big drop in available memory here was associated with placing
many eosPTEH files into the cache and was accompanied by a low CPU cost
when it was happening. When the available memory became low, the swap space
came into play. The amount of memory needed drop in available plus that
added to
cache suggests about 800K was needed. This was around model 1600.

docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4838820      129980         276       80368
 31604
Swap:       1048572      291796      756776
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4838736      129728         276       80704
 31908
Swap:       1048572      288176      760396
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4839172      128292         276       81704
 30756
Swap:       1048572      288080      760492
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4840036      127472         276       81660
 30032
Swap:       1048572      288072      760500
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4839896      120628         276       88644
 26488
Swap:       1048572      288056      760516
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4840088      120144         276       88936
 26024
Swap:       1048572      288056      760516
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4861784      113448          36       73936
 11292
Swap:       1048572      490492      558080
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4817124      106900          12      125144
 30308
Swap:       1048572      839220      209352
docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168     4865112      101184          12       82872
4236
Swap:       1048572      822080      226492

Comment: around model 3000 run crashed with timestep too small error, but
it is clear that memory available is right near the limit with almost no
available memory left and most of the swap space used.

docker at 023217a8d7fb:~/docker_work/11M$ free
              total        used        free      shared  buff/cache
 available
Mem:        5049168      171544     4792304          12       85320
 4696848
Swap:       1048572      159496      889076

Comment: restoration of memory after the crash

So the extra requirements of using the latest eosPTEH features in version
10398 is right on
the edge for me, needing to configure docker for windows with 5 GB of
memory. Are there bugs
or memory leaks in the code which are increasing the memory allocation
problem?

Since I was to run the inlist with docker for windows configured with 3 GB
of memory before, it seems
like the extra memory needed to use the eosPTEH feature is about 2 GB or
more of memory, not disk
space unless the performance hit of high disk usage is accepted.

Is there a way around this?

Kind regards
Ian

On 13 April 2018 at 19:32, Rob Farmer <r.j.farmer at uva.nl> wrote:

> > If the process has been killed I am confident that it is not a shortage
> of memory problem since I increased docker for windows allocation from 3GB
> to 3.5GB and still had the same problem with the same inlist.
>
> Others have had problems recently with machines when they only have 4Gb of
> ram. The problem occurs due to us moving the cache files around which
> requires a call to fork(), which needs to copy the address the space so
> your machine temporarily needs 2* the memory.
>
> You may find setting the environment variable:
> export MESA_TEMP_CACHES_DISABLE=1
>
> helpful, which stops us moving the cache files around.
>
> Rob
>
>
>
> On 13 April 2018 at 10:43, Ian Foley via Mesa-users <
> mesa-users at lists.mesastar.org> wrote:
>
>> Hi Evan,
>>
>> Correction. Can increase the swap space in 0.5 GB chunks. The disk image
>> can be increased to 64 GB and I am only using 31 GB, so that's not the
>> limit. Will see what increasing the swap space does.
>>
>> Kind regards
>> Ian
>>
>> On 13 April 2018 at 18:38, Ian Foley <ifoley2008 at gmail.com> wrote:
>>
>>> Hi Evan,
>>>
>>> Is that what defines the "swap space" limit setting in docker for
>>> windows? Currently it is set at 1GB and can only be increased in 1GB
>>> chunks. I have plenty of hard drive disk space. I will give your suggestion
>>> a try and see what happens and let you know.
>>>
>>> Kind regards
>>> Ian
>>>
>>> On 13 April 2018 at 16:36, Evan Bauer <ebauer at physics.ucsb.edu> wrote:
>>>
>>>> Hi Ian,
>>>>
>>>> How is your disk usage? Caching the PTEH files can require several GB
>>>> of disk space, so I wonder if this is a symptom of the container running
>>>> out of available space to write the cache files?
>>>>
>>>> Cheers,
>>>> Evan
>>>>
>>>>
>>>> > On Apr 12, 2018, at 9:18 PM, Ian Foley via Mesa-users <
>>>> mesa-users at lists.mesastar.org> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > I am running docker for windows on Windows 10 Pro on a machine with 4
>>>> cores and 8GB RAM. I have been working on developing inlists to evolve 1M
>>>> to 15M stars from pre_ms to end (wd or cc). I have been doing this since
>>>> version 7624 and always if the run failed MESA would give a reason related
>>>> to the inlist chosen e.g. timestep limit.
>>>> >
>>>> > However, with version 10398, the run has failed quite often by either
>>>> just hanging leaving screen displays intact, but CPU drops to <10% instead
>>>> of >50% (program in loop?) or the operating system kills the run. If the
>>>> process has been killed I am confident that it is not a shortage of memory
>>>> problem since I increased docker for windows allocation from 3GB to 3.5GB
>>>> and still had the same problem with the same inlist.
>>>> >
>>>> > In all these cases, the run environment was one with a large envelope
>>>> and very low surface pressure and density and the run was requiring to
>>>> cache new eosPTEH files.
>>>> >
>>>> > I have attached all the files necessary for users to execute a rerun
>>>> and hopefully reproduce the error and locate the problem. I have also
>>>> attached all the terminal output (11M,txt) and two files showing the last
>>>> screens from pgstar. This run hung and it can be seen by looking at the
>>>> final output in 11M.txt.
>>>> >
>>>> > It perhaps should be noted that by changing the inlist to set
>>>> >
>>>> >       use_eosPTEH_for_low_density = .false. ! default .true.
>>>> >       use_eosPTEH_for_high_Z = .false. ! default .true.
>>>> >
>>>> > I was able to complete the run with the final outcome being wd -
>>>> just. The He core was 1.33M.
>>>> >
>>>> > Please note that run_star_extras.f does quite a few things whose
>>>> intention is to try to get a successful run without having to stop mid-way
>>>> and change parameters. One of the strategies here is to dynamically change
>>>> var_control according to the number of retries in 10 models and thus allow
>>>> larger delta changes in parameters and let var_control the size of logdt.
>>>> This usually works, but can fail when there are a large number of retries
>>>> in 10 models and var_control does not adapt fast enough.
>>>> >
>>>> > Kind regards
>>>> > Ian
>>>> >
>>>> >
>>>> > <11M.txt><history_columns.list><ian40r.net><inlist_11M.0><pr
>>>> ofile_columns.list><run_star_extras.f><grid6_001290.png><pro
>>>> file_Panels3_001290.png>_______________________________________________
>>>> > mesa-users at lists.mesastar.org
>>>> > https://lists.mesastar.org/mailman/listinfo/mesa-users
>>>> >
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mesastar.org/pipermail/mesa-users/attachments/20180417/5d29fdab/attachment.html>


More information about the Mesa-users mailing list