[Mesa-users] Problem with MESA version 10398.
Ian Foley
ifoley2008 at gmail.com
Mon Apr 16 22:33:55 EDT 2018
Hi Rob and Evan,
I've been following up with your suggestions which have been very helpful
(thank you!), but problems remain. Rob, following your recommendation, I
have since set the environment variable you suggested, but problems remain.
Evan, you made suggestions in the area of disk space. My disk space on the
hard drive is not a problem. However, I am not clear on the relationship
between disk space and the size of the container. When I reported the
problem I had docker for windows configured with 3 GB memory and 1 GB of
swap space. Increasing the swap space seemed to improve the situation, but
not eliminate it. The documentation on the swap space is very vague. I'm
guessing it is disk space that it is reserved disk space that can be used
in lieu of memory and using it will slow execution speed.
So I first increased the swap space to 2 GB and then 3 GB and still had the
problem. Then I realised that the problem was probably memory and I needed
to be able to track what was happening with memory, so I reconfigured "rn"
to run star as a background process regularly monitoring memory using
:"free". I reconfigured docker for windows first with 3.5 GB, then 4 GB and
then 5 GB with swap space at 1 GB. At 5 GB, which is right near the limit
of what my 8 GB laptop can take, the program just made it. I have provided
a trace of memory usage below copied from the terminal output, with a few
inserted comments.
docker at 023217a8d7fb:~/docker_work/11M$ ps
PID TTY TIME CMD
43 pts/0 00:00:00 bash
57 pts/0 02:42:49 star
81 pts/0 00:00:00 pgxwin_server
95 pts/0 00:00:00 ps
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 3844488 171136 764 1033544
941636
Swap: 1048572 32960 1015612
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4043912 130348 764 874908
743832
Swap: 1048572 33416 1015156
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4242800 101772 764 704596
546720
Swap: 1048572 33960 1014612
Comment: the big drop here was associated with decreasing the
mesh_delta_coeff
and thereby doubling the number of zones to cater for super AGB behavior.
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4640768 113848 756 294552
155820
Swap: 1048572 36852 1011720
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4849480 111144 276 88544
15976
Swap: 1048572 294752 753820
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4854980 113520 276 80668
15320
Swap: 1048572 292968 755604
Comment: the big drop in available memory here was associated with placing
many eosPTEH files into the cache and was accompanied by a low CPU cost
when it was happening. When the available memory became low, the swap space
came into play. The amount of memory needed drop in available plus that
added to
cache suggests about 800K was needed. This was around model 1600.
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4838820 129980 276 80368
31604
Swap: 1048572 291796 756776
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4838736 129728 276 80704
31908
Swap: 1048572 288176 760396
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4839172 128292 276 81704
30756
Swap: 1048572 288080 760492
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4840036 127472 276 81660
30032
Swap: 1048572 288072 760500
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4839896 120628 276 88644
26488
Swap: 1048572 288056 760516
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4840088 120144 276 88936
26024
Swap: 1048572 288056 760516
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4861784 113448 36 73936
11292
Swap: 1048572 490492 558080
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4817124 106900 12 125144
30308
Swap: 1048572 839220 209352
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 4865112 101184 12 82872
4236
Swap: 1048572 822080 226492
Comment: around model 3000 run crashed with timestep too small error, but
it is clear that memory available is right near the limit with almost no
available memory left and most of the swap space used.
docker at 023217a8d7fb:~/docker_work/11M$ free
total used free shared buff/cache
available
Mem: 5049168 171544 4792304 12 85320
4696848
Swap: 1048572 159496 889076
Comment: restoration of memory after the crash
So the extra requirements of using the latest eosPTEH features in version
10398 is right on
the edge for me, needing to configure docker for windows with 5 GB of
memory. Are there bugs
or memory leaks in the code which are increasing the memory allocation
problem?
Since I was to run the inlist with docker for windows configured with 3 GB
of memory before, it seems
like the extra memory needed to use the eosPTEH feature is about 2 GB or
more of memory, not disk
space unless the performance hit of high disk usage is accepted.
Is there a way around this?
Kind regards
Ian
On 13 April 2018 at 19:32, Rob Farmer <r.j.farmer at uva.nl> wrote:
> > If the process has been killed I am confident that it is not a shortage
> of memory problem since I increased docker for windows allocation from 3GB
> to 3.5GB and still had the same problem with the same inlist.
>
> Others have had problems recently with machines when they only have 4Gb of
> ram. The problem occurs due to us moving the cache files around which
> requires a call to fork(), which needs to copy the address the space so
> your machine temporarily needs 2* the memory.
>
> You may find setting the environment variable:
> export MESA_TEMP_CACHES_DISABLE=1
>
> helpful, which stops us moving the cache files around.
>
> Rob
>
>
>
> On 13 April 2018 at 10:43, Ian Foley via Mesa-users <
> mesa-users at lists.mesastar.org> wrote:
>
>> Hi Evan,
>>
>> Correction. Can increase the swap space in 0.5 GB chunks. The disk image
>> can be increased to 64 GB and I am only using 31 GB, so that's not the
>> limit. Will see what increasing the swap space does.
>>
>> Kind regards
>> Ian
>>
>> On 13 April 2018 at 18:38, Ian Foley <ifoley2008 at gmail.com> wrote:
>>
>>> Hi Evan,
>>>
>>> Is that what defines the "swap space" limit setting in docker for
>>> windows? Currently it is set at 1GB and can only be increased in 1GB
>>> chunks. I have plenty of hard drive disk space. I will give your suggestion
>>> a try and see what happens and let you know.
>>>
>>> Kind regards
>>> Ian
>>>
>>> On 13 April 2018 at 16:36, Evan Bauer <ebauer at physics.ucsb.edu> wrote:
>>>
>>>> Hi Ian,
>>>>
>>>> How is your disk usage? Caching the PTEH files can require several GB
>>>> of disk space, so I wonder if this is a symptom of the container running
>>>> out of available space to write the cache files?
>>>>
>>>> Cheers,
>>>> Evan
>>>>
>>>>
>>>> > On Apr 12, 2018, at 9:18 PM, Ian Foley via Mesa-users <
>>>> mesa-users at lists.mesastar.org> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > I am running docker for windows on Windows 10 Pro on a machine with 4
>>>> cores and 8GB RAM. I have been working on developing inlists to evolve 1M
>>>> to 15M stars from pre_ms to end (wd or cc). I have been doing this since
>>>> version 7624 and always if the run failed MESA would give a reason related
>>>> to the inlist chosen e.g. timestep limit.
>>>> >
>>>> > However, with version 10398, the run has failed quite often by either
>>>> just hanging leaving screen displays intact, but CPU drops to <10% instead
>>>> of >50% (program in loop?) or the operating system kills the run. If the
>>>> process has been killed I am confident that it is not a shortage of memory
>>>> problem since I increased docker for windows allocation from 3GB to 3.5GB
>>>> and still had the same problem with the same inlist.
>>>> >
>>>> > In all these cases, the run environment was one with a large envelope
>>>> and very low surface pressure and density and the run was requiring to
>>>> cache new eosPTEH files.
>>>> >
>>>> > I have attached all the files necessary for users to execute a rerun
>>>> and hopefully reproduce the error and locate the problem. I have also
>>>> attached all the terminal output (11M,txt) and two files showing the last
>>>> screens from pgstar. This run hung and it can be seen by looking at the
>>>> final output in 11M.txt.
>>>> >
>>>> > It perhaps should be noted that by changing the inlist to set
>>>> >
>>>> > use_eosPTEH_for_low_density = .false. ! default .true.
>>>> > use_eosPTEH_for_high_Z = .false. ! default .true.
>>>> >
>>>> > I was able to complete the run with the final outcome being wd -
>>>> just. The He core was 1.33M.
>>>> >
>>>> > Please note that run_star_extras.f does quite a few things whose
>>>> intention is to try to get a successful run without having to stop mid-way
>>>> and change parameters. One of the strategies here is to dynamically change
>>>> var_control according to the number of retries in 10 models and thus allow
>>>> larger delta changes in parameters and let var_control the size of logdt.
>>>> This usually works, but can fail when there are a large number of retries
>>>> in 10 models and var_control does not adapt fast enough.
>>>> >
>>>> > Kind regards
>>>> > Ian
>>>> >
>>>> >
>>>> > <11M.txt><history_columns.list><ian40r.net><inlist_11M.0><pr
>>>> ofile_columns.list><run_star_extras.f><grid6_001290.png><pro
>>>> file_Panels3_001290.png>_______________________________________________
>>>> > mesa-users at lists.mesastar.org
>>>> > https://lists.mesastar.org/mailman/listinfo/mesa-users
>>>> >
>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mesastar.org/pipermail/mesa-users/attachments/20180417/5d29fdab/attachment.html>
More information about the Mesa-users
mailing list