[Mesa-users] Problem with MESA version 10398.
Rob Farmer
r.j.farmer at uva.nl
Tue Apr 17 07:14:50 EDT 2018
So there are no known leaks at the moment (though there may be unknown
memory leaks in the code), i'm running your model under valgrind at the
moment but that will take time.
The problem is mostly there are more eos files (ie a finer grid in X,Z
plane) which is driving up the memory usage, also the pteh eos files are
also larger than the older eos files. For now your best bet is to turn off
the new eos files (or find a machine with more ram). Increasing your swap
space even further might help to relieve some of the memory pressure but
i'm not sure what the performance will be, you probably want to make swap =
amount of ram you have.
If your feeling adventurous, the current svn head (svn co svn://
svn.code.sf.net/p/mesa/code/trunk mesa ) has some fixes for this which may
help you but i'm not sure if they would be enough in your case.
Rob
On 17 April 2018 at 04:33, Ian Foley <ifoley2008 at gmail.com> wrote:
> Hi Rob and Evan,
>
> I've been following up with your suggestions which have been very helpful
> (thank you!), but problems remain. Rob, following your recommendation, I
> have since set the environment variable you suggested, but problems remain.
>
> Evan, you made suggestions in the area of disk space. My disk space on the
> hard drive is not a problem. However, I am not clear on the relationship
> between disk space and the size of the container. When I reported the
> problem I had docker for windows configured with 3 GB memory and 1 GB of
> swap space. Increasing the swap space seemed to improve the situation, but
> not eliminate it. The documentation on the swap space is very vague. I'm
> guessing it is disk space that it is reserved disk space that can be used
> in lieu of memory and using it will slow execution speed.
>
> So I first increased the swap space to 2 GB and then 3 GB and still had
> the problem. Then I realised that the problem was probably memory and I
> needed to be able to track what was happening with memory, so I
> reconfigured "rn" to run star as a background process regularly monitoring
> memory using :"free". I reconfigured docker for windows first with 3.5 GB,
> then 4 GB and then 5 GB with swap space at 1 GB. At 5 GB, which is right
> near the limit of what my 8 GB laptop can take, the program just made it. I
> have provided a trace of memory usage below copied from the terminal
> output, with a few inserted comments.
>
> docker at 023217a8d7fb:~/docker_work/11M$ ps
> PID TTY TIME CMD
> 43 pts/0 00:00:00 bash
> 57 pts/0 02:42:49 star
> 81 pts/0 00:00:00 pgxwin_server
> 95 pts/0 00:00:00 ps
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 3844488 171136 764 1033544
> 941636
> Swap: 1048572 32960 1015612
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4043912 130348 764 874908
> 743832
> Swap: 1048572 33416 1015156
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4242800 101772 764 704596
> 546720
> Swap: 1048572 33960 1014612
>
> Comment: the big drop here was associated with decreasing the
> mesh_delta_coeff
> and thereby doubling the number of zones to cater for super AGB behavior.
>
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4640768 113848 756 294552
> 155820
> Swap: 1048572 36852 1011720
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4849480 111144 276 88544
> 15976
> Swap: 1048572 294752 753820
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4854980 113520 276 80668
> 15320
> Swap: 1048572 292968 755604
>
> Comment: the big drop in available memory here was associated with placing
> many eosPTEH files into the cache and was accompanied by a low CPU cost
> when it was happening. When the available memory became low, the swap space
> came into play. The amount of memory needed drop in available plus that
> added to
> cache suggests about 800K was needed. This was around model 1600.
>
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4838820 129980 276 80368
> 31604
> Swap: 1048572 291796 756776
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4838736 129728 276 80704
> 31908
> Swap: 1048572 288176 760396
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4839172 128292 276 81704
> 30756
> Swap: 1048572 288080 760492
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4840036 127472 276 81660
> 30032
> Swap: 1048572 288072 760500
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4839896 120628 276 88644
> 26488
> Swap: 1048572 288056 760516
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4840088 120144 276 88936
> 26024
> Swap: 1048572 288056 760516
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4861784 113448 36 73936
> 11292
> Swap: 1048572 490492 558080
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4817124 106900 12 125144
> 30308
> Swap: 1048572 839220 209352
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 4865112 101184 12 82872
> 4236
> Swap: 1048572 822080 226492
>
> Comment: around model 3000 run crashed with timestep too small error, but
> it is clear that memory available is right near the limit with almost no
> available memory left and most of the swap space used.
>
> docker at 023217a8d7fb:~/docker_work/11M$ free
> total used free shared buff/cache
> available
> Mem: 5049168 171544 4792304 12 85320
> 4696848
> Swap: 1048572 159496 889076
>
> Comment: restoration of memory after the crash
>
> So the extra requirements of using the latest eosPTEH features in version
> 10398 is right on
> the edge for me, needing to configure docker for windows with 5 GB of
> memory. Are there bugs
> or memory leaks in the code which are increasing the memory allocation
> problem?
>
> Since I was to run the inlist with docker for windows configured with 3 GB
> of memory before, it seems
> like the extra memory needed to use the eosPTEH feature is about 2 GB or
> more of memory, not disk
> space unless the performance hit of high disk usage is accepted.
>
> Is there a way around this?
>
> Kind regards
> Ian
>
> On 13 April 2018 at 19:32, Rob Farmer <r.j.farmer at uva.nl> wrote:
>
>> > If the process has been killed I am confident that it is not a shortage
>> of memory problem since I increased docker for windows allocation from 3GB
>> to 3.5GB and still had the same problem with the same inlist.
>>
>> Others have had problems recently with machines when they only have 4Gb
>> of ram. The problem occurs due to us moving the cache files around which
>> requires a call to fork(), which needs to copy the address the space so
>> your machine temporarily needs 2* the memory.
>>
>> You may find setting the environment variable:
>> export MESA_TEMP_CACHES_DISABLE=1
>>
>> helpful, which stops us moving the cache files around.
>>
>> Rob
>>
>>
>>
>> On 13 April 2018 at 10:43, Ian Foley via Mesa-users <
>> mesa-users at lists.mesastar.org> wrote:
>>
>>> Hi Evan,
>>>
>>> Correction. Can increase the swap space in 0.5 GB chunks. The disk image
>>> can be increased to 64 GB and I am only using 31 GB, so that's not the
>>> limit. Will see what increasing the swap space does.
>>>
>>> Kind regards
>>> Ian
>>>
>>> On 13 April 2018 at 18:38, Ian Foley <ifoley2008 at gmail.com> wrote:
>>>
>>>> Hi Evan,
>>>>
>>>> Is that what defines the "swap space" limit setting in docker for
>>>> windows? Currently it is set at 1GB and can only be increased in 1GB
>>>> chunks. I have plenty of hard drive disk space. I will give your suggestion
>>>> a try and see what happens and let you know.
>>>>
>>>> Kind regards
>>>> Ian
>>>>
>>>> On 13 April 2018 at 16:36, Evan Bauer <ebauer at physics.ucsb.edu> wrote:
>>>>
>>>>> Hi Ian,
>>>>>
>>>>> How is your disk usage? Caching the PTEH files can require several GB
>>>>> of disk space, so I wonder if this is a symptom of the container running
>>>>> out of available space to write the cache files?
>>>>>
>>>>> Cheers,
>>>>> Evan
>>>>>
>>>>>
>>>>> > On Apr 12, 2018, at 9:18 PM, Ian Foley via Mesa-users <
>>>>> mesa-users at lists.mesastar.org> wrote:
>>>>> >
>>>>> > Hi,
>>>>> >
>>>>> > I am running docker for windows on Windows 10 Pro on a machine with
>>>>> 4 cores and 8GB RAM. I have been working on developing inlists to evolve 1M
>>>>> to 15M stars from pre_ms to end (wd or cc). I have been doing this since
>>>>> version 7624 and always if the run failed MESA would give a reason related
>>>>> to the inlist chosen e.g. timestep limit.
>>>>> >
>>>>> > However, with version 10398, the run has failed quite often by
>>>>> either just hanging leaving screen displays intact, but CPU drops to <10%
>>>>> instead of >50% (program in loop?) or the operating system kills the run.
>>>>> If the process has been killed I am confident that it is not a shortage of
>>>>> memory problem since I increased docker for windows allocation from 3GB to
>>>>> 3.5GB and still had the same problem with the same inlist.
>>>>> >
>>>>> > In all these cases, the run environment was one with a large
>>>>> envelope and very low surface pressure and density and the run was
>>>>> requiring to cache new eosPTEH files.
>>>>> >
>>>>> > I have attached all the files necessary for users to execute a rerun
>>>>> and hopefully reproduce the error and locate the problem. I have also
>>>>> attached all the terminal output (11M,txt) and two files showing the last
>>>>> screens from pgstar. This run hung and it can be seen by looking at the
>>>>> final output in 11M.txt.
>>>>> >
>>>>> > It perhaps should be noted that by changing the inlist to set
>>>>> >
>>>>> > use_eosPTEH_for_low_density = .false. ! default .true.
>>>>> > use_eosPTEH_for_high_Z = .false. ! default .true.
>>>>> >
>>>>> > I was able to complete the run with the final outcome being wd -
>>>>> just. The He core was 1.33M.
>>>>> >
>>>>> > Please note that run_star_extras.f does quite a few things whose
>>>>> intention is to try to get a successful run without having to stop mid-way
>>>>> and change parameters. One of the strategies here is to dynamically change
>>>>> var_control according to the number of retries in 10 models and thus allow
>>>>> larger delta changes in parameters and let var_control the size of logdt.
>>>>> This usually works, but can fail when there are a large number of retries
>>>>> in 10 models and var_control does not adapt fast enough.
>>>>> >
>>>>> > Kind regards
>>>>> > Ian
>>>>> >
>>>>> >
>>>>> > <11M.txt><history_columns.list><ian40r.net><inlist_11M.0><pr
>>>>> ofile_columns.list><run_star_extras.f><grid6_001290.png><pro
>>>>> file_Panels3_001290.png>____________________________________
>>>>> ___________
>>>>> > mesa-users at lists.mesastar.org
>>>>> > https://lists.mesastar.org/mailman/listinfo/mesa-users
>>>>> >
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.mesastar.org/pipermail/mesa-users/attachments/20180417/fe484a51/attachment.html>
More information about the Mesa-users
mailing list