From stefan at mdy.univie.ac.at Tue Dec 1 02:18:50 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Tue, 1 Dec 2009 11:18:50 +0100 Subject: [Caos] a strange time (hwclock?) phenomenon In-Reply-To: <20091130173013.GB22028@kainx.org> References: <20091130170413.GD12477@loop.mdy.univie.ac.at> <20091130173013.GB22028@kainx.org> Message-ID: <20091201101850.GH12477@loop.mdy.univie.ac.at> Dear Michael, On Mon, Nov 30, 2009 at 09:30:13AM -0800, Michael Jennings wrote: > On Monday, 30 November 2009, at 18:04:13 (+0100), > Stefan Boresch wrote: > > > As I mentioned in passing ("ahci snafu", which seems resolved in Greg's > > experimental ISO images), I have a freshly installed caos linux > > system where I occasionally get hung upon restart > > > > with > > > > Manual file system repair needed!" > > Login and run 'fsck -A -T' to resolve." > > > > If you keep your hardware clock in local time instead of UTC, this > discrepancy will appear on boot. A fix for this has been committed > and is currently in the "testing" repo. > Thanks! I am running testing on this machine by default, so I got your patches overnight. Upon my first reboot, I still encountered the problem. I lost patience and switched to UTC=true in the new /etc/sysconfig/clock. Since then, I rebooted three times without error. So, (maybe) my problem is solved. A few things still confuse me though. 1) I was never asked whether I wanted to use UTC or LOCAL (is there an option in (the new) sidekick?) The way the old scripts work (at least if I understand hwclock correctly), everything should have defaulted to LOCAL (definitely /etc/adjtime contains LOCAL); thus, the fixes foremost enable UTC support. 2) As an observation, Ubuntu (9.10) doesn't ask about UTC anymore during install either, but silently defaults to UTC. 3) I have a second machine almost identical to the one I am having problems, with the exception of normal harddisk (probably in enhanced ide mode instead of ahci) instead of an intel ssd. I never encountered this problem there, and hwclock is implicitly running with LOCAL (since /etc/adjtime contains LOCAL) Thanks again, Stefan (still confused) PS: I believe in the new sysinit the second call to hwclock now contains --hctosys twice (once on the commandline and once as part of CLOCKFLAGS set earlier ...) -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From stefan at mdy.univie.ac.at Tue Dec 1 05:00:09 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Tue, 1 Dec 2009 14:00:09 +0100 Subject: [Caos] sidekick openoffice installation broken In-Reply-To: <20091126132038.GA14289@loop.mdy.univie.ac.at> References: <20091126132038.GA14289@loop.mdy.univie.ac.at> Message-ID: <20091201130009.GK12477@loop.mdy.univie.ac.at> Dear everyone, On Thu, Nov 26, 2009 at 02:20:38PM +0100, Stefan Boresch wrote: > the 3rd party option in sidekick for openoffice is broken as it looks > for OO 3.0.0, which is not available anymore. > the new sidekick in testing now knows the latest openoffice downlink, so that part works. However (on amd64) something breaks later, and one ends up with most of the stuff in /opt/openoffice, but no soffice ... I tried to scroll up, but can't figure out what's going on (it complains that all the java stuff is already there, but I think the problem occurs later ...) Best regards, Stefan PS: I think it might be worthwhile to ask the user whether he wants to save the download -- bandwidth at the office is wide and cheap, but if I had tried this at home you'd hear me cursing very loudly ... (i.e., if sidekick fails, the user could use the tar.gz file to attempt an installation by hand ...) -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From stefan at mdy.univie.ac.at Tue Dec 1 05:31:53 2009 From: stefan at mdy.univie.ac.at (stefan at mdy.univie.ac.at) Date: Tue, 1 Dec 2009 14:31:53 +0100 (CET) Subject: [Caos] java support in firefox Message-ID: <41405.131.130.40.60.1259674313.squirrel@www.mdy.univie.ac.at> After a default (desktop) install of caos, there is no java support in firefox, although java-1.6.0-sun provides an up2date jre One would think that installing java-1.6.0-sun-plugin would do the trick. However, the link /usr/lib64/mozilla/plugins/libjavaplugin_jni.so -> /etc/alternatives/libjavaplugin_jni.so -> XXX points to nowhere, i.e., java-1.6.0-sun-plugin installs libjavaplugin_jni.so someplace else. But even afterwards, there is no java. Apparently, plugin support is now provided by libnpjp2.so which is ALREADY PROVIDED by java-1.6.0-sun. (/usr/lib/jvm/java-1.6.0-sun-1.6.0.17/jre/lib/amd64/libnpjp2.so) All one needs to do is set the link ... see: http://www.profarius.com/content/64bit-java-flash-deathroll So, java in firefox is just a link away, but does not work out of the box. Further, one wonders as to the purpose of java-1.6.0-sun-plugin Best regards, Stefan From gartim at gmail.com Tue Dec 1 11:44:36 2009 From: gartim at gmail.com (gary artim) Date: Tue, 1 Dec 2009 11:44:36 -0800 Subject: [Caos] slurm finding no nodes. Message-ID: HI (newbie) -- I'm running a small stateless cluster with the default slurm config, defined through sidekick. When I do an sinfo or scontrol show node n0000 I get: node not found. Has anyone experienced this? Last week I spend hours trying rebooting, etc, then suddenly they appeared, now after a reboot I'm back to no nodes. If I do perceus node list it shows: n0000 n0001 Any help would be great, thanks, -- Gary From gmk at infiscale.org Tue Dec 1 17:07:52 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Tue, 1 Dec 2009 17:07:52 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: References: Message-ID: <20091202010752.GK31788@infiscale.org> Hi Gary, How did you configure SLURM? Is it the default sidekick config? Check the nodes themselves to make sure they are running the slurm client daemons and that they can reach the master by the hostname specified in the slurm.conf. Thanks, Greg On Tuesday, 01 December 2009, at 11:44:36 (-0800), gary artim wrote: > HI (newbie) -- > > I'm running a small stateless cluster with the default slurm config, > defined through sidekick. > When I do an sinfo or scontrol show node n0000 I get: node not found. > Has anyone experienced this? > Last week I spend hours trying rebooting, etc, then suddenly they > appeared, now after a reboot I'm back > to no nodes. If I do perceus node list it shows: > n0000 > n0001 > > Any help would be great, thanks, > > -- Gary > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From gartim at gmail.com Tue Dec 1 17:44:29 2009 From: gartim at gmail.com (gary artim) Date: Tue, 1 Dec 2009 17:44:29 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: <20091202010752.GK31788@infiscale.org> References: <20091202010752.GK31788@infiscale.org> Message-ID: Hi Greg -- Thanks much for getting back. I did check and could ping the headnode by ipaddr, but not by hostname. Been really frustrated by this, and began reinstalling, doing full install. I had just done a cluster install only. Will email tomorrow if this helps. With my limited knowledge of both slurm and caos, this is what I think i need to do to get slurm working: 1) (sidekick) perceus. 2) (sidekick) vnfs, for the compute nodes. 3) (sidekick) slurm 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs 5) roll out the vnfs (reboot or perceus command) I could be all wet here and the full install does some of this. btw, I really like the speed of caos, but now i find i'm more trigger happy and get in my own way. Thanks for guidance... -- Gary On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: > Hi Gary, > > How did you configure SLURM? Is it the default sidekick config? > > Check the nodes themselves to make sure they are running the slurm client > daemons and that they can reach the master by the hostname specified in the > slurm.conf. > > Thanks, > Greg > > On Tuesday, 01 December 2009, at 11:44:36 (-0800), > gary artim wrote: > >> HI (newbie) -- >> >> I'm running a small stateless cluster with the default slurm config, >> defined through sidekick. >> When I do an sinfo or scontrol show node n0000 I get: node not found. >> Has anyone experienced this? >> Last week I spend hours trying rebooting, etc, then suddenly they >> appeared, now after a reboot I'm back >> to no nodes. If I do perceus node list it shows: >> n0000 >> n0001 >> >> Any help would be great, thanks, >> >> -- Gary >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > From gartim at gmail.com Tue Dec 1 17:46:38 2009 From: gartim at gmail.com (gary artim) Date: Tue, 1 Dec 2009 17:46:38 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: References: <20091202010752.GK31788@infiscale.org> Message-ID: slight revision -- I never tested ping by slurm.conf hostname. will try...Gary On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: > Hi Greg -- > > Thanks much for getting back. I did check and could ping the headnode > by ipaddr, but not by > hostname. Been really frustrated by this, and began reinstalling, > doing full install. > I had just done a cluster install only. Will email > tomorrow if this helps. With my limited knowledge of both slurm and > caos, this is what I think i > need to do to get slurm working: > > 1) (sidekick) perceus. > 2) (sidekick) vnfs, for the compute nodes. > 3) (sidekick) slurm > 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs > 5) roll out the vnfs (reboot or perceus command) > > I could be all wet here and the full install does some of this. > > btw, I really like the speed of caos, but now i find i'm more trigger > happy and get in my > own way. Thanks for guidance... > > -- Gary > > > On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: >> Hi Gary, >> >> How did you configure SLURM? Is it the default sidekick config? >> >> Check the nodes themselves to make sure they are running the slurm client >> daemons and that they can reach the master by the hostname specified in the >> slurm.conf. >> >> Thanks, >> Greg >> >> On Tuesday, 01 December 2009, at 11:44:36 (-0800), >> gary artim wrote: >> >>> HI (newbie) -- >>> >>> I'm running a small stateless cluster with the default slurm config, >>> defined through sidekick. >>> When I do an sinfo or scontrol show node n0000 I get: node not found. >>> Has anyone experienced this? >>> Last week I spend hours trying rebooting, etc, then suddenly they >>> appeared, now after a reboot I'm back >>> to no nodes. If I do perceus node list it shows: >>> n0000 >>> n0001 >>> >>> Any help would be great, thanks, >>> >>> -- Gary >>> _______________________________________________ >>> Caos mailing list >>> Caos at lists.infiscale.org >>> http://lists.infiscale.org/mailman/listinfo/caos >> >> -- >> Greg M. Kurtzer >> Chief Technology Officer >> HPC Systems Architect >> Infiscale, Inc. - http://www.infiscale.com >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos >> > From stefan at mdy.univie.ac.at Wed Dec 2 01:43:18 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Wed, 2 Dec 2009 10:43:18 +0100 Subject: [Caos] a strange time (hwclock?) phenomenon In-Reply-To: <20091201101850.GH12477@loop.mdy.univie.ac.at> References: <20091130170413.GD12477@loop.mdy.univie.ac.at> <20091130173013.GB22028@kainx.org> <20091201101850.GH12477@loop.mdy.univie.ac.at> Message-ID: <20091202094318.GM12477@loop.mdy.univie.ac.at> Michael, just to follow up on my earlier mail: since switching to UTC in combination with the most recent startup scripts, the problems have disappeared. Unfortunately, I can't tell you that way whether it would work / works for LOCAL. Incidentally, I was setting up an Ubuntu machine yesterday, identical to one I had set up a few days earlier, and I did encounter a similar problem there (despite Ubuntu being UTC by default ...). So I conclude that the interactions of hwclock with the BIOS and the kernel depend on the phase of the moon ... Best regards, Stefan On Tue, Dec 01, 2009 at 11:18:50AM +0100, Stefan Boresch wrote: > Dear Michael, > > On Mon, Nov 30, 2009 at 09:30:13AM -0800, Michael Jennings wrote: > > On Monday, 30 November 2009, at 18:04:13 (+0100), > > Stefan Boresch wrote: > > > > > As I mentioned in passing ("ahci snafu", which seems resolved in Greg's > > > experimental ISO images), I have a freshly installed caos linux > > > system where I occasionally get hung upon restart > > > > > > with > > > > > > Manual file system repair needed!" > > > Login and run 'fsck -A -T' to resolve." > > > > > > > If you keep your hardware clock in local time instead of UTC, this > > discrepancy will appear on boot. A fix for this has been committed > > and is currently in the "testing" repo. > > > > Thanks! > > I am running testing on this machine by default, so I got your patches > overnight. Upon my first reboot, I still encountered the problem. > > I lost patience and > switched to UTC=true in the new /etc/sysconfig/clock. Since then, I > rebooted three times without error. > > So, (maybe) my problem is solved. A few things still confuse me though. > > 1) I was never asked whether I wanted to use UTC or LOCAL (is there an > option in (the new) sidekick?) The way the old scripts work (at least > if I understand hwclock correctly), everything should have defaulted to LOCAL > (definitely /etc/adjtime contains LOCAL); thus, the fixes foremost > enable UTC support. > > 2) As an observation, Ubuntu (9.10) doesn't ask about UTC anymore > during install either, but silently defaults to UTC. > > 3) I have a second machine almost identical to the one I am having problems, > with the exception of normal harddisk (probably in enhanced ide mode instead > of ahci) instead of an intel ssd. I never encountered this problem there, and > hwclock is implicitly running with LOCAL (since /etc/adjtime contains LOCAL) > > Thanks again, > > Stefan (still confused) > > PS: I believe in the new sysinit the second call to hwclock now contains > --hctosys twice (once on the commandline and once as part of CLOCKFLAGS > set earlier ...) > > -- > Stefan Boresch > Institute for Computational Biological Chemistry > University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria > Phone: -43-1-427752715 Fax: -43-1-427752790 > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From gartim at gmail.com Wed Dec 2 10:00:11 2009 From: gartim at gmail.com (gary artim) Date: Wed, 2 Dec 2009 10:00:11 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: References: <20091202010752.GK31788@infiscale.org> Message-ID: did a full install and checked if i could ping my headnode (hn0000) from computenode (n0000) and yes it worked. yet sinfo only shows available hn0000, ie no compute nodes. tried restarting all nodes no dice. should I move this problem to the slurm list? thanks much... On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary > > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: >> Hi Greg -- >> >> Thanks much for getting back. I did check and could ping the headnode >> by ipaddr, but not by >> hostname. Been really frustrated by this, and began reinstalling, >> doing full install. >> I had just done a cluster install only. Will email >> tomorrow if this helps. With my limited knowledge of both slurm and >> caos, this is what I think i >> need to do to get slurm working: >> >> 1) (sidekick) perceus. >> 2) (sidekick) vnfs, for the compute nodes. >> 3) (sidekick) slurm >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs >> 5) roll out the vnfs (reboot or perceus command) >> >> I could be all wet here and the full install does some of this. >> >> btw, I really like the speed of caos, but now i find i'm more trigger >> happy and get in my >> own way. Thanks for guidance... >> >> -- Gary >> >> >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: >>> Hi Gary, >>> >>> How did you configure SLURM? Is it the default sidekick config? >>> >>> Check the nodes themselves to make sure they are running the slurm client >>> daemons and that they can reach the master by the hostname specified in the >>> slurm.conf. >>> >>> Thanks, >>> Greg >>> >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), >>> gary artim wrote: >>> >>>> HI (newbie) -- >>>> >>>> I'm running a small stateless cluster with the default slurm config, >>>> defined through sidekick. >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. >>>> Has anyone experienced this? >>>> Last week I spend hours trying rebooting, etc, then suddenly they >>>> appeared, now after a reboot I'm back >>>> to no nodes. If I do perceus node list it shows: >>>> n0000 >>>> n0001 >>>> >>>> Any help would be great, thanks, >>>> >>>> -- Gary >>>> _______________________________________________ >>>> Caos mailing list >>>> Caos at lists.infiscale.org >>>> http://lists.infiscale.org/mailman/listinfo/caos >>> >>> -- >>> Greg M. Kurtzer >>> Chief Technology Officer >>> HPC Systems Architect >>> Infiscale, Inc. - http://www.infiscale.com >>> _______________________________________________ >>> Caos mailing list >>> Caos at lists.infiscale.org >>> http://lists.infiscale.org/mailman/listinfo/caos >>> >> > From gmk at infiscale.org Wed Dec 2 11:19:06 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Wed, 2 Dec 2009 11:19:06 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: References: <20091202010752.GK31788@infiscale.org> Message-ID: <20091202191906.GM31788@infiscale.org> Hi Gary, Can you verify that the contents of /etc/slurm are the same on the master and the nodes? If so, can you forward your slurm.conf? Greg On Wednesday, 02 December 2009, at 10:00:11 (-0800), gary artim wrote: > did a full install and checked if i could ping my headnode (hn0000) > from computenode (n0000) and yes > it worked. yet sinfo only shows available hn0000, ie no compute nodes. > tried restarting all nodes no dice. > > should I move this problem to the slurm list? thanks much... > > On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: > > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary > > > > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: > >> Hi Greg -- > >> > >> Thanks much for getting back. I did check and could ping the headnode > >> by ipaddr, but not by > >> hostname. Been really frustrated by this, and began reinstalling, > >> doing full install. > >> I had just done a cluster install only. Will email > >> tomorrow if this helps. With my limited knowledge of both slurm and > >> caos, this is what I think i > >> need to do to get slurm working: > >> > >> 1) (sidekick) perceus. > >> 2) (sidekick) vnfs, for the compute nodes. > >> 3) (sidekick) slurm > >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs > >> 5) roll out the vnfs (reboot or perceus command) > >> > >> I could be all wet here and the full install does some of this. > >> > >> btw, I really like the speed of caos, but now i find i'm more trigger > >> happy and get in my > >> own way. Thanks for guidance... > >> > >> -- Gary > >> > >> > >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: > >>> Hi Gary, > >>> > >>> How did you configure SLURM? Is it the default sidekick config? > >>> > >>> Check the nodes themselves to make sure they are running the slurm client > >>> daemons and that they can reach the master by the hostname specified in the > >>> slurm.conf. > >>> > >>> Thanks, > >>> Greg > >>> > >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), > >>> gary artim wrote: > >>> > >>>> HI (newbie) -- > >>>> > >>>> I'm running a small stateless cluster with the default slurm config, > >>>> defined through sidekick. > >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. > >>>> Has anyone experienced this? > >>>> Last week I spend hours trying rebooting, etc, then suddenly they > >>>> appeared, now after a reboot I'm back > >>>> to no nodes. If I do perceus node list it shows: > >>>> n0000 > >>>> n0001 > >>>> > >>>> Any help would be great, thanks, > >>>> > >>>> -- Gary > >>>> _______________________________________________ > >>>> Caos mailing list > >>>> Caos at lists.infiscale.org > >>>> http://lists.infiscale.org/mailman/listinfo/caos > >>> > >>> -- > >>> Greg M. Kurtzer > >>> Chief Technology Officer > >>> HPC Systems Architect > >>> Infiscale, Inc. - http://www.infiscale.com > >>> _______________________________________________ > >>> Caos mailing list > >>> Caos at lists.infiscale.org > >>> http://lists.infiscale.org/mailman/listinfo/caos > >>> > >> > > > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From gartim at gmail.com Wed Dec 2 13:33:06 2009 From: gartim at gmail.com (gary artim) Date: Wed, 2 Dec 2009 13:33:06 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: <20091202191906.GM31788@infiscale.org> References: <20091202010752.GK31788@infiscale.org> <20091202191906.GM31788@infiscale.org> Message-ID: did check, diff'ed them, attached slurm.com On Wed, Dec 2, 2009 at 11:19 AM, Greg Kurtzer wrote: > Hi Gary, > > Can you verify that the contents of /etc/slurm are the same on the master > and the nodes? If so, can you forward your slurm.conf? > > Greg > > On Wednesday, 02 December 2009, at 10:00:11 (-0800), > gary artim wrote: > >> did a full install and checked if i could ping my headnode (hn0000) >> from computenode (n0000) and yes >> it worked. yet sinfo only shows available hn0000, ie no compute nodes. >> tried restarting all nodes no dice. >> >> should I move this problem to the slurm list? thanks much... >> >> On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: >> > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary >> > >> > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: >> >> Hi Greg -- >> >> >> >> Thanks much for getting back. I did check and could ping the headnode >> >> by ipaddr, but not by >> >> hostname. Been really frustrated by this, and began reinstalling, >> >> doing full install. >> >> I had just done a cluster install only. Will email >> >> tomorrow if this helps. With my limited knowledge of both slurm and >> >> caos, this is what I think i >> >> need to do to get slurm working: >> >> >> >> 1) (sidekick) perceus. >> >> 2) (sidekick) vnfs, for the compute nodes. >> >> 3) (sidekick) slurm >> >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs >> >> 5) roll out the vnfs (reboot or perceus command) >> >> >> >> I could be all wet here and the full install does some of this. >> >> >> >> btw, I really like the speed of caos, but now i find i'm more trigger >> >> happy and get in my >> >> own way. Thanks for guidance... >> >> >> >> -- Gary >> >> >> >> >> >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: >> >>> Hi Gary, >> >>> >> >>> How did you configure SLURM? Is it the default sidekick config? >> >>> >> >>> Check the nodes themselves to make sure they are running the slurm client >> >>> daemons and that they can reach the master by the hostname specified in the >> >>> slurm.conf. >> >>> >> >>> Thanks, >> >>> Greg >> >>> >> >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), >> >>> gary artim wrote: >> >>> >> >>>> HI (newbie) -- >> >>>> >> >>>> I'm running a small stateless cluster with the default slurm config, >> >>>> defined through sidekick. >> >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. >> >>>> Has anyone experienced this? >> >>>> Last week I spend hours trying rebooting, etc, then suddenly they >> >>>> appeared, now after a reboot I'm back >> >>>> to no nodes. If I do perceus node list it shows: >> >>>> n0000 >> >>>> n0001 >> >>>> >> >>>> Any help would be great, thanks, >> >>>> >> >>>> -- Gary >> >>>> _______________________________________________ >> >>>> Caos mailing list >> >>>> Caos at lists.infiscale.org >> >>>> http://lists.infiscale.org/mailman/listinfo/caos >> >>> >> >>> -- >> >>> Greg M. Kurtzer >> >>> Chief Technology Officer >> >>> HPC Systems Architect >> >>> Infiscale, Inc. - http://www.infiscale.com >> >>> _______________________________________________ >> >>> Caos mailing list >> >>> Caos at lists.infiscale.org >> >>> http://lists.infiscale.org/mailman/listinfo/caos >> >>> >> >> >> > >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > -------------- next part -------------- A non-text attachment was scrubbed... Name: slurm.conf Type: application/octet-stream Size: 1977 bytes Desc: not available Url : http://lists.infiscale.org/pipermail/caos/attachments/20091202/faf30cf2/attachment.obj From gartim at gmail.com Wed Dec 2 15:00:03 2009 From: gartim at gmail.com (gary artim) Date: Wed, 2 Dec 2009 15:00:03 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: References: <20091202010752.GK31788@infiscale.org> <20091202191906.GM31788@infiscale.org> Message-ID: Hi Greg -- scontrol update NodeName=n0000 State=resume get them up and available. wierd, maybe a network timing issue? i could script add a script as a circumvention, but still an outstanding problem. Let me know if you want any debug output. -- Gary On Wed, Dec 2, 2009 at 1:33 PM, gary artim wrote: > did check, diff'ed them, > attached slurm.com > > On Wed, Dec 2, 2009 at 11:19 AM, Greg Kurtzer wrote: >> Hi Gary, >> >> Can you verify that the contents of /etc/slurm are the same on the master >> and the nodes? If so, can you forward your slurm.conf? >> >> Greg >> >> On Wednesday, 02 December 2009, at 10:00:11 (-0800), >> gary artim wrote: >> >>> did a full install and checked if i could ping my headnode (hn0000) >>> from computenode (n0000) and yes >>> it worked. yet sinfo only shows available hn0000, ie no compute nodes. >>> tried restarting all nodes no dice. >>> >>> should I move this problem to the slurm list? thanks much... >>> >>> On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: >>> > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary >>> > >>> > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: >>> >> Hi Greg -- >>> >> >>> >> Thanks much for getting back. I did check and could ping the headnode >>> >> by ipaddr, but not by >>> >> hostname. Been really frustrated by this, and began reinstalling, >>> >> doing full install. >>> >> I had just done a cluster install only. Will email >>> >> tomorrow if this helps. With my limited knowledge of both slurm and >>> >> caos, this is what I think i >>> >> need to do to get slurm working: >>> >> >>> >> 1) (sidekick) perceus. >>> >> 2) (sidekick) vnfs, for the compute nodes. >>> >> 3) (sidekick) slurm >>> >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs >>> >> 5) roll out the vnfs (reboot or perceus command) >>> >> >>> >> I could be all wet here and the full install does some of this. >>> >> >>> >> btw, I really like the speed of caos, but now i find i'm more trigger >>> >> happy and get in my >>> >> own way. Thanks for guidance... >>> >> >>> >> -- Gary >>> >> >>> >> >>> >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: >>> >>> Hi Gary, >>> >>> >>> >>> How did you configure SLURM? Is it the default sidekick config? >>> >>> >>> >>> Check the nodes themselves to make sure they are running the slurm client >>> >>> daemons and that they can reach the master by the hostname specified in the >>> >>> slurm.conf. >>> >>> >>> >>> Thanks, >>> >>> Greg >>> >>> >>> >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), >>> >>> gary artim wrote: >>> >>> >>> >>>> HI (newbie) -- >>> >>>> >>> >>>> I'm running a small stateless cluster with the default slurm config, >>> >>>> defined through sidekick. >>> >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. >>> >>>> Has anyone experienced this? >>> >>>> Last week I spend hours trying rebooting, etc, then suddenly they >>> >>>> appeared, now after a reboot I'm back >>> >>>> to no nodes. If I do perceus node list it shows: >>> >>>> n0000 >>> >>>> n0001 >>> >>>> >>> >>>> Any help would be great, thanks, >>> >>>> >>> >>>> -- Gary >>> >>>> _______________________________________________ >>> >>>> Caos mailing list >>> >>>> Caos at lists.infiscale.org >>> >>>> http://lists.infiscale.org/mailman/listinfo/caos >>> >>> >>> >>> -- >>> >>> Greg M. Kurtzer >>> >>> Chief Technology Officer >>> >>> HPC Systems Architect >>> >>> Infiscale, Inc. - http://www.infiscale.com >>> >>> _______________________________________________ >>> >>> Caos mailing list >>> >>> Caos at lists.infiscale.org >>> >>> http://lists.infiscale.org/mailman/listinfo/caos >>> >>> >>> >> >>> > >>> _______________________________________________ >>> Caos mailing list >>> Caos at lists.infiscale.org >>> http://lists.infiscale.org/mailman/listinfo/caos >> >> -- >> Greg M. Kurtzer >> Chief Technology Officer >> HPC Systems Architect >> Infiscale, Inc. - http://www.infiscale.com >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos >> > From gmk at infiscale.org Wed Dec 2 21:16:34 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Wed, 2 Dec 2009 21:16:34 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: References: <20091202010752.GK31788@infiscale.org> <20091202191906.GM31788@infiscale.org> Message-ID: <20091203051634.GP31788@infiscale.org> Hi Gary, Can you try to change the "NodeName=" to only reference your nodes (e.g. "n[0000-0002]") and remove the DownNodes line from the config. Then repropogate the configuration to the nodes and restart all daemons. Let me know if that helps! Thanks, Greg On Wednesday, 02 December 2009, at 15:00:03 (-0800), gary artim wrote: > Hi Greg -- > > scontrol update NodeName=n0000 State=resume > > get them up and available. wierd, maybe a network timing issue? > > i could script add a script as a circumvention, but still an > outstanding problem. > Let me know if you want any debug output. > > -- Gary > > > > On Wed, Dec 2, 2009 at 1:33 PM, gary artim wrote: > > did check, diff'ed them, > > attached slurm.com > > > > On Wed, Dec 2, 2009 at 11:19 AM, Greg Kurtzer wrote: > >> Hi Gary, > >> > >> Can you verify that the contents of /etc/slurm are the same on the master > >> and the nodes? If so, can you forward your slurm.conf? > >> > >> Greg > >> > >> On Wednesday, 02 December 2009, at 10:00:11 (-0800), > >> gary artim wrote: > >> > >>> did a full install and checked if i could ping my headnode (hn0000) > >>> from computenode (n0000) and yes > >>> it worked. yet sinfo only shows available hn0000, ie no compute nodes. > >>> tried restarting all nodes no dice. > >>> > >>> should I move this problem to the slurm list? thanks much... > >>> > >>> On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: > >>> > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary > >>> > > >>> > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: > >>> >> Hi Greg -- > >>> >> > >>> >> Thanks much for getting back. I did check and could ping the headnode > >>> >> by ipaddr, but not by > >>> >> hostname. Been really frustrated by this, and began reinstalling, > >>> >> doing full install. > >>> >> I had just done a cluster install only. Will email > >>> >> tomorrow if this helps. With my limited knowledge of both slurm and > >>> >> caos, this is what I think i > >>> >> need to do to get slurm working: > >>> >> > >>> >> 1) (sidekick) perceus. > >>> >> 2) (sidekick) vnfs, for the compute nodes. > >>> >> 3) (sidekick) slurm > >>> >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs > >>> >> 5) roll out the vnfs (reboot or perceus command) > >>> >> > >>> >> I could be all wet here and the full install does some of this. > >>> >> > >>> >> btw, I really like the speed of caos, but now i find i'm more trigger > >>> >> happy and get in my > >>> >> own way. Thanks for guidance... > >>> >> > >>> >> -- Gary > >>> >> > >>> >> > >>> >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: > >>> >>> Hi Gary, > >>> >>> > >>> >>> How did you configure SLURM? Is it the default sidekick config? > >>> >>> > >>> >>> Check the nodes themselves to make sure they are running the slurm client > >>> >>> daemons and that they can reach the master by the hostname specified in the > >>> >>> slurm.conf. > >>> >>> > >>> >>> Thanks, > >>> >>> Greg > >>> >>> > >>> >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), > >>> >>> gary artim wrote: > >>> >>> > >>> >>>> HI (newbie) -- > >>> >>>> > >>> >>>> I'm running a small stateless cluster with the default slurm config, > >>> >>>> defined through sidekick. > >>> >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. > >>> >>>> Has anyone experienced this? > >>> >>>> Last week I spend hours trying rebooting, etc, then suddenly they > >>> >>>> appeared, now after a reboot I'm back > >>> >>>> to no nodes. If I do perceus node list it shows: > >>> >>>> n0000 > >>> >>>> n0001 > >>> >>>> > >>> >>>> Any help would be great, thanks, > >>> >>>> > >>> >>>> -- Gary > >>> >>>> _______________________________________________ > >>> >>>> Caos mailing list > >>> >>>> Caos at lists.infiscale.org > >>> >>>> http://lists.infiscale.org/mailman/listinfo/caos > >>> >>> > >>> >>> -- > >>> >>> Greg M. Kurtzer > >>> >>> Chief Technology Officer > >>> >>> HPC Systems Architect > >>> >>> Infiscale, Inc. - http://www.infiscale.com > >>> >>> _______________________________________________ > >>> >>> Caos mailing list > >>> >>> Caos at lists.infiscale.org > >>> >>> http://lists.infiscale.org/mailman/listinfo/caos > >>> >>> > >>> >> > >>> > > >>> _______________________________________________ > >>> Caos mailing list > >>> Caos at lists.infiscale.org > >>> http://lists.infiscale.org/mailman/listinfo/caos > >> > >> -- > >> Greg M. Kurtzer > >> Chief Technology Officer > >> HPC Systems Architect > >> Infiscale, Inc. - http://www.infiscale.com > >> _______________________________________________ > >> Caos mailing list > >> Caos at lists.infiscale.org > >> http://lists.infiscale.org/mailman/listinfo/caos > >> > > > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From gartim at gmail.com Thu Dec 3 06:37:57 2009 From: gartim at gmail.com (gary artim) Date: Thu, 3 Dec 2009 06:37:57 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: <20091203051634.GP31788@infiscale.org> References: <20091202010752.GK31788@infiscale.org> <20091202191906.GM31788@infiscale.org> <20091203051634.GP31788@infiscale.org> Message-ID: Hi Greg -- I did try this a while back, but can redo the test. Out of the office till Monday, will get back then. Is there any other service that needs to run to permit compute nodes to be auto recognized? I read the slurm faq and trouble shooting pages and they did mention that if the nodes time-of-day are inconsistent with the head node the slurm would not activate them. thanks much, -- Gary On Wed, Dec 2, 2009 at 9:16 PM, Greg Kurtzer wrote: > Hi Gary, > > Can you try to change the "NodeName=" to only reference your nodes (e.g. > "n[0000-0002]") and remove the DownNodes line from the config. Then > repropogate the configuration to the nodes and restart all daemons. > > Let me know if that helps! > > Thanks, > Greg > > > On Wednesday, 02 December 2009, at 15:00:03 (-0800), > gary artim wrote: > >> Hi Greg -- >> >> scontrol update NodeName=n0000 State=resume >> >> get them up and available. wierd, maybe a network timing issue? >> >> i could script add a script as a circumvention, but still an >> outstanding problem. >> Let me know if you want any debug output. >> >> -- Gary >> >> >> >> On Wed, Dec 2, 2009 at 1:33 PM, gary artim wrote: >> > did check, diff'ed them, >> > attached slurm.com >> > >> > On Wed, Dec 2, 2009 at 11:19 AM, Greg Kurtzer wrote: >> >> Hi Gary, >> >> >> >> Can you verify that the contents of /etc/slurm are the same on the master >> >> and the nodes? If so, can you forward your slurm.conf? >> >> >> >> Greg >> >> >> >> On Wednesday, 02 December 2009, at 10:00:11 (-0800), >> >> gary artim wrote: >> >> >> >>> did a full install and checked if i could ping my headnode (hn0000) >> >>> from computenode (n0000) and yes >> >>> it worked. yet sinfo only shows available hn0000, ie no compute nodes. >> >>> tried restarting all nodes no dice. >> >>> >> >>> should I move this problem to the slurm list? thanks much... >> >>> >> >>> On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: >> >>> > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary >> >>> > >> >>> > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: >> >>> >> Hi Greg -- >> >>> >> >> >>> >> Thanks much for getting back. I did check and could ping the headnode >> >>> >> by ipaddr, but not by >> >>> >> hostname. Been really frustrated by this, and began reinstalling, >> >>> >> doing full install. >> >>> >> I had just done a cluster install only. Will email >> >>> >> tomorrow if this helps. With my limited knowledge of both slurm and >> >>> >> caos, this is what I think i >> >>> >> need to do to get slurm working: >> >>> >> >> >>> >> 1) (sidekick) perceus. >> >>> >> 2) (sidekick) vnfs, for the compute nodes. >> >>> >> 3) (sidekick) slurm >> >>> >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs >> >>> >> 5) roll out the vnfs (reboot or perceus command) >> >>> >> >> >>> >> I could be all wet here and the full install does some of this. >> >>> >> >> >>> >> btw, I really like the speed of caos, but now i find i'm more trigger >> >>> >> happy and get in my >> >>> >> own way. Thanks for guidance... >> >>> >> >> >>> >> -- Gary >> >>> >> >> >>> >> >> >>> >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: >> >>> >>> Hi Gary, >> >>> >>> >> >>> >>> How did you configure SLURM? Is it the default sidekick config? >> >>> >>> >> >>> >>> Check the nodes themselves to make sure they are running the slurm client >> >>> >>> daemons and that they can reach the master by the hostname specified in the >> >>> >>> slurm.conf. >> >>> >>> >> >>> >>> Thanks, >> >>> >>> Greg >> >>> >>> >> >>> >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), >> >>> >>> gary artim wrote: >> >>> >>> >> >>> >>>> HI (newbie) -- >> >>> >>>> >> >>> >>>> I'm running a small stateless cluster with the default slurm config, >> >>> >>>> defined through sidekick. >> >>> >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. >> >>> >>>> Has anyone experienced this? >> >>> >>>> Last week I spend hours trying rebooting, etc, then suddenly they >> >>> >>>> appeared, now after a reboot I'm back >> >>> >>>> to no nodes. If I do perceus node list it shows: >> >>> >>>> n0000 >> >>> >>>> n0001 >> >>> >>>> >> >>> >>>> Any help would be great, thanks, >> >>> >>>> >> >>> >>>> -- Gary >> >>> >>>> _______________________________________________ >> >>> >>>> Caos mailing list >> >>> >>>> Caos at lists.infiscale.org >> >>> >>>> http://lists.infiscale.org/mailman/listinfo/caos >> >>> >>> >> >>> >>> -- >> >>> >>> Greg M. Kurtzer >> >>> >>> Chief Technology Officer >> >>> >>> HPC Systems Architect >> >>> >>> Infiscale, Inc. - http://www.infiscale.com >> >>> >>> _______________________________________________ >> >>> >>> Caos mailing list >> >>> >>> Caos at lists.infiscale.org >> >>> >>> http://lists.infiscale.org/mailman/listinfo/caos >> >>> >>> >> >>> >> >> >>> > >> >>> _______________________________________________ >> >>> Caos mailing list >> >>> Caos at lists.infiscale.org >> >>> http://lists.infiscale.org/mailman/listinfo/caos >> >> >> >> -- >> >> Greg M. Kurtzer >> >> Chief Technology Officer >> >> HPC Systems Architect >> >> Infiscale, Inc. - http://www.infiscale.com >> >> _______________________________________________ >> >> Caos mailing list >> >> Caos at lists.infiscale.org >> >> http://lists.infiscale.org/mailman/listinfo/caos >> >> >> > >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > From stefan at mdy.univie.ac.at Thu Dec 3 07:58:58 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Thu, 3 Dec 2009 16:58:58 +0100 Subject: [Caos] /dev/rtc* does not always have MAJOR 252 Message-ID: <20091203155858.GG5921@loop.mdy.univie.ac.at> At least the /etc/init.d/sysinit script from the sysinit package in caos-testing contains the line mknod /dev/rtc c 252 0 to enable the first call to hwclock. This is necessary since in caos' boot sequence at this point udev is not activated yet ... It just so happens that apparently the major device number of rtc is not set in stone ..., on one machine here it's 251. On this machine at this stage hwclock fails with ERRNO=14 bad address, and the first superblock has a time in the future and fsck -a fails and yadayadayada ... back to where I was a few days ago (on a diff. machine though). What works for me is the following: major=`cat /proc/devices | awk "/rtc/ {print \\$1}"` echo "mknod /dev/rtc c $major 0" [adapted from an example in the O'Reilly LinuxDeviceDrivers book]. Best regards, Stefan PS: Change of topic: I come mostly from the Debian/Ubuntu world, and there it's a very strict rule to leave everything in /etc alone. Thus, apt/aptitude and friends never overwrite (user modified) configuration files in /etc. Is there any similar policy in caos or any policy regarding (configuration) files in /etc? Occasionally, I see *.rpmnew or *.rpmsave files, but I don't recognize a system. I definitely have a vanishing /etc/rc.local on a machine which I would like to keep as it makes nvidia devices on a headless GPU computing machine :-/ -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From gmk at infiscale.org Thu Dec 3 08:21:31 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Thu, 3 Dec 2009 08:21:31 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: References: <20091202010752.GK31788@infiscale.org> <20091202191906.GM31788@infiscale.org> <20091203051634.GP31788@infiscale.org> Message-ID: <20091203162131.GR31788@infiscale.org> Hi! Ah-ha. Thats a good catch! Let me know if syncing the time helps! Greg On Thursday, 03 December 2009, at 06:37:57 (-0800), gary artim wrote: > Hi Greg -- > > I did try this a while back, but can redo the test. Out of the office > till Monday, will get back then. Is there any other service that needs > to run to permit compute nodes to be auto recognized? I read the slurm > faq and trouble shooting pages and they did mention that if the nodes > time-of-day are inconsistent with the head node the slurm would not > activate them. thanks much, > > -- Gary > > On Wed, Dec 2, 2009 at 9:16 PM, Greg Kurtzer wrote: > > Hi Gary, > > > > Can you try to change the "NodeName=" to only reference your nodes (e.g. > > "n[0000-0002]") and remove the DownNodes line from the config. Then > > repropogate the configuration to the nodes and restart all daemons. > > > > Let me know if that helps! > > > > Thanks, > > Greg > > > > > > On Wednesday, 02 December 2009, at 15:00:03 (-0800), > > gary artim wrote: > > > >> Hi Greg -- > >> > >> scontrol update NodeName=n0000 State=resume > >> > >> get them up and available. wierd, maybe a network timing issue? > >> > >> i could script add a script as a circumvention, but still an > >> outstanding problem. > >> Let me know if you want any debug output. > >> > >> -- Gary > >> > >> > >> > >> On Wed, Dec 2, 2009 at 1:33 PM, gary artim wrote: > >> > did check, diff'ed them, > >> > attached slurm.com > >> > > >> > On Wed, Dec 2, 2009 at 11:19 AM, Greg Kurtzer wrote: > >> >> Hi Gary, > >> >> > >> >> Can you verify that the contents of /etc/slurm are the same on the master > >> >> and the nodes? If so, can you forward your slurm.conf? > >> >> > >> >> Greg > >> >> > >> >> On Wednesday, 02 December 2009, at 10:00:11 (-0800), > >> >> gary artim wrote: > >> >> > >> >>> did a full install and checked if i could ping my headnode (hn0000) > >> >>> from computenode (n0000) and yes > >> >>> it worked. yet sinfo only shows available hn0000, ie no compute nodes. > >> >>> tried restarting all nodes no dice. > >> >>> > >> >>> should I move this problem to the slurm list? thanks much... > >> >>> > >> >>> On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: > >> >>> > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary > >> >>> > > >> >>> > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: > >> >>> >> Hi Greg -- > >> >>> >> > >> >>> >> Thanks much for getting back. I did check and could ping the headnode > >> >>> >> by ipaddr, but not by > >> >>> >> hostname. Been really frustrated by this, and began reinstalling, > >> >>> >> doing full install. > >> >>> >> I had just done a cluster install only. Will email > >> >>> >> tomorrow if this helps. With my limited knowledge of both slurm and > >> >>> >> caos, this is what I think i > >> >>> >> need to do to get slurm working: > >> >>> >> > >> >>> >> 1) (sidekick) perceus. > >> >>> >> 2) (sidekick) vnfs, for the compute nodes. > >> >>> >> 3) (sidekick) slurm > >> >>> >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs > >> >>> >> 5) roll out the vnfs (reboot or perceus command) > >> >>> >> > >> >>> >> I could be all wet here and the full install does some of this. > >> >>> >> > >> >>> >> btw, I really like the speed of caos, but now i find i'm more trigger > >> >>> >> happy and get in my > >> >>> >> own way. Thanks for guidance... > >> >>> >> > >> >>> >> -- Gary > >> >>> >> > >> >>> >> > >> >>> >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: > >> >>> >>> Hi Gary, > >> >>> >>> > >> >>> >>> How did you configure SLURM? Is it the default sidekick config? > >> >>> >>> > >> >>> >>> Check the nodes themselves to make sure they are running the slurm client > >> >>> >>> daemons and that they can reach the master by the hostname specified in the > >> >>> >>> slurm.conf. > >> >>> >>> > >> >>> >>> Thanks, > >> >>> >>> Greg > >> >>> >>> > >> >>> >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), > >> >>> >>> gary artim wrote: > >> >>> >>> > >> >>> >>>> HI (newbie) -- > >> >>> >>>> > >> >>> >>>> I'm running a small stateless cluster with the default slurm config, > >> >>> >>>> defined through sidekick. > >> >>> >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. > >> >>> >>>> Has anyone experienced this? > >> >>> >>>> Last week I spend hours trying rebooting, etc, then suddenly they > >> >>> >>>> appeared, now after a reboot I'm back > >> >>> >>>> to no nodes. If I do perceus node list it shows: > >> >>> >>>> n0000 > >> >>> >>>> n0001 > >> >>> >>>> > >> >>> >>>> Any help would be great, thanks, > >> >>> >>>> > >> >>> >>>> -- Gary > >> >>> >>>> _______________________________________________ > >> >>> >>>> Caos mailing list > >> >>> >>>> Caos at lists.infiscale.org > >> >>> >>>> http://lists.infiscale.org/mailman/listinfo/caos > >> >>> >>> > >> >>> >>> -- > >> >>> >>> Greg M. Kurtzer > >> >>> >>> Chief Technology Officer > >> >>> >>> HPC Systems Architect > >> >>> >>> Infiscale, Inc. - http://www.infiscale.com > >> >>> >>> _______________________________________________ > >> >>> >>> Caos mailing list > >> >>> >>> Caos at lists.infiscale.org > >> >>> >>> http://lists.infiscale.org/mailman/listinfo/caos > >> >>> >>> > >> >>> >> > >> >>> > > >> >>> _______________________________________________ > >> >>> Caos mailing list > >> >>> Caos at lists.infiscale.org > >> >>> http://lists.infiscale.org/mailman/listinfo/caos > >> >> > >> >> -- > >> >> Greg M. Kurtzer > >> >> Chief Technology Officer > >> >> HPC Systems Architect > >> >> Infiscale, Inc. - http://www.infiscale.com > >> >> _______________________________________________ > >> >> Caos mailing list > >> >> Caos at lists.infiscale.org > >> >> http://lists.infiscale.org/mailman/listinfo/caos > >> >> > >> > > >> _______________________________________________ > >> Caos mailing list > >> Caos at lists.infiscale.org > >> http://lists.infiscale.org/mailman/listinfo/caos > > > > -- > > Greg M. Kurtzer > > Chief Technology Officer > > HPC Systems Architect > > Infiscale, Inc. - http://www.infiscale.com > > _______________________________________________ > > Caos mailing list > > Caos at lists.infiscale.org > > http://lists.infiscale.org/mailman/listinfo/caos > > > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From gmk at infiscale.org Thu Dec 3 08:44:22 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Thu, 3 Dec 2009 08:44:22 -0800 Subject: [Caos] /dev/rtc* does not always have MAJOR 252 In-Reply-To: <20091203155858.GG5921@loop.mdy.univie.ac.at> References: <20091203155858.GG5921@loop.mdy.univie.ac.at> Message-ID: <20091203164422.GS31788@infiscale.org> On Thursday, 03 December 2009, at 16:58:58 (+0100), Stefan Boresch wrote: > > At least the /etc/init.d/sysinit script from the sysinit package > in caos-testing contains the line > > mknod /dev/rtc c 252 0 > > to enable the first call to hwclock. This is necessary since in caos' > boot sequence at this point udev is not activated yet ... It just so > happens that apparently the major device number of rtc is not set > in stone ..., on one machine here it's 251. On this machine at this > stage hwclock fails with ERRNO=14 bad address, and the first superblock > has a time in the future and fsck -a fails and yadayadayada ... back > to where I was a few days ago (on a diff. machine though). > > What works for me is the following: > > major=`cat /proc/devices | awk "/rtc/ {print \\$1}"` > echo "mknod /dev/rtc c $major 0" Great catch! sysinit-1.2.26-1.caos includes this fix and is available in nsa-testing. > [adapted from an example in the O'Reilly LinuxDeviceDrivers book]. > > Best regards, > > Stefan > > PS: Change of topic: I come mostly from the Debian/Ubuntu world, and > there it's a very strict rule to leave everything in /etc alone. > Thus, apt/aptitude and friends never overwrite (user modified) > configuration files in /etc. Is there any similar policy in caos or > any policy regarding (configuration) files in /etc? Occasionally, I > see *.rpmnew or *.rpmsave files, but I don't recognize a system. I > definitely have a vanishing /etc/rc.local on a machine which I would > like to keep as it makes nvidia devices on a headless GPU computing > machine :-/ This is specified by the SPEC file within the package itself. When one lists out the files which should be included in the end package there is a special tag for configuration files. The sysinit package did not contain the configuration tag for rc.local so that was also fixed in the above version. *.rpmnew occurs when a configuration file has been modified and there is a package update installed (including a newer config file). It doesn't overwrite the existing configuration file but does append .rpmnew to the new one so the user can manually integrate if you wish. *.rpmsave occurs when a configuration file has been modified and then the package has been uninstalled. We do not have a global standard for files in /etc because not all files there should have this feature (rc.sysinit for example). Greg -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From mej at caoslinux.org Thu Dec 3 16:07:59 2009 From: mej at caoslinux.org (Michael Jennings) Date: Thu, 3 Dec 2009 16:07:59 -0800 Subject: [Caos] /dev/rtc* does not always have MAJOR 252 In-Reply-To: <20091203164422.GS31788@infiscale.org> References: <20091203155858.GG5921@loop.mdy.univie.ac.at> <20091203164422.GS31788@infiscale.org> Message-ID: <20091204000759.GA2224@kainx.org> On Thursday, 03 December 2009, at 08:44:22 (-0800), Greg Kurtzer wrote: > This is specified by the SPEC file within the package itself. When one > lists out the files which should be included in the end package there > is a special tag for configuration files. The sysinit package did not > contain the configuration tag for rc.local so that was also fixed in > the above version. > > *.rpmnew occurs when a configuration file has been modified and > there is a package update installed (including a newer config > file). It doesn't overwrite the existing configuration file but does > append .rpmnew to the new one so the user can manually integrate if > you wish. I should also point out that this is only true for config files marked as "noreplace." Here is a good table showing the effects of %config(noreplace), %config, and the default disposition: http://www-uxsup.csx.cam.ac.uk/~jw35/docs/rpm_config.html It also shows how .rpmnew/.rpmsave files are created during upgrades (which are, technically, installs followed by erasures). HTH, Michael -- Michael Jennings (a.k.a. KainX) http://www.kainx.org/ Linux Server/Cluster Admin, LBL.gov Author, Eterm (www.eterm.org) ----------------------------------------------------------------------- "Unix is like a Vorlon: It is incredibly powerful, gives terse, cryptic answers, and has a lot of things going on in the background." -- Jeff Dubrule From stefan at mdy.univie.ac.at Thu Dec 3 23:42:26 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Fri, 4 Dec 2009 08:42:26 +0100 Subject: [Caos] /dev/rtc* does not always have MAJOR 252 In-Reply-To: <20091203164422.GS31788@infiscale.org> References: <20091203155858.GG5921@loop.mdy.univie.ac.at> <20091203164422.GS31788@infiscale.org> Message-ID: <20091204074226.GL5921@loop.mdy.univie.ac.at> On Thu, Dec 03, 2009 at 08:44:22AM -0800, Greg Kurtzer wrote: > On Thursday, 03 December 2009, at 16:58:58 (+0100), > Stefan Boresch wrote: > > major=`cat /proc/devices | awk "/rtc/ {print \\$1}"` > > echo "mknod /dev/rtc c $major 0" > > Great catch! Thanks! That's what I like about Caos -- it gives even people with intermediate skills like me a chance to fix things like that --- I have completely given up trying, e.g., to understand Ubuntu's boot sequence, the last time I tried I was dizzy after ten minutes .. And thanks for the hints about how rpms handle files in /etc. There is something to be said for the Debian way (strict rules), but I also do agree that there are instances where left over config files can be a major pain ... (been there, done that ...) Best regards, Stefan -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From gmk at infiscale.org Fri Dec 4 08:37:12 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Fri, 4 Dec 2009 08:37:12 -0800 Subject: [Caos] /dev/rtc* does not always have MAJOR 252 In-Reply-To: <20091204074226.GL5921@loop.mdy.univie.ac.at> References: <20091203155858.GG5921@loop.mdy.univie.ac.at> <20091203164422.GS31788@infiscale.org> <20091204074226.GL5921@loop.mdy.univie.ac.at> Message-ID: <20091204163712.GU31788@infiscale.org> On Friday, 04 December 2009, at 08:42:26 (+0100), Stefan Boresch wrote: > On Thu, Dec 03, 2009 at 08:44:22AM -0800, Greg Kurtzer wrote: > > On Thursday, 03 December 2009, at 16:58:58 (+0100), > > Stefan Boresch wrote: > > > major=`cat /proc/devices | awk "/rtc/ {print \\$1}"` > > > echo "mknod /dev/rtc c $major 0" > > > > Great catch! > > Thanks! That's what I like about Caos -- it gives even people with intermediate > skills like me a chance to fix things like that --- I have completely given > up trying, e.g., to understand Ubuntu's boot sequence, the last time I > tried I was dizzy after ten minutes .. Speaking of complicated boot sequences, I developed a parallel boot sysinit solution that in my opinion is wayyy better then what we have now. It is a much simplier and a more elegent solution but it needs more attention. I would be interested to know if there are testers and/or hackers that want to play with it and help get it ready for prime time. Please let me know! Greg > And thanks for the hints about how rpms handle files in /etc. There is > something to be said for the Debian way (strict rules), but I also do > agree that there are instances where left over config files can be a > major pain ... (been there, done that ...) > > Best regards, > > Stefan > > -- > Stefan Boresch > Institute for Computational Biological Chemistry > University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria > Phone: -43-1-427752715 Fax: -43-1-427752790 > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From cdmaestas at gmail.com Fri Dec 4 09:36:59 2009 From: cdmaestas at gmail.com (Christopher Maestas) Date: Fri, 4 Dec 2009 10:36:59 -0700 Subject: [Caos] /dev/rtc* does not always have MAJOR 252 In-Reply-To: <20091204163712.GU31788@infiscale.org> References: <20091203155858.GG5921@loop.mdy.univie.ac.at> <20091203164422.GS31788@infiscale.org> <20091204074226.GL5921@loop.mdy.univie.ac.at> <20091204163712.GU31788@infiscale.org> Message-ID: Yes I am interested! On Fri, Dec 4, 2009 at 9:37 AM, Greg Kurtzer wrote: > On Friday, 04 December 2009, at 08:42:26 (+0100), > Stefan Boresch wrote: > > > On Thu, Dec 03, 2009 at 08:44:22AM -0800, Greg Kurtzer wrote: > > > On Thursday, 03 December 2009, at 16:58:58 (+0100), > > > Stefan Boresch wrote: > > > > major=`cat /proc/devices | awk "/rtc/ {print \\$1}"` > > > > echo "mknod /dev/rtc c $major 0" > > > > > > Great catch! > > > > Thanks! That's what I like about Caos -- it gives even people with > intermediate > > skills like me a chance to fix things like that --- I have completely > given > > up trying, e.g., to understand Ubuntu's boot sequence, the last time I > > tried I was dizzy after ten minutes .. > > Speaking of complicated boot sequences, I developed a parallel boot sysinit > solution that in my opinion is wayyy better then what we have now. It is a > much simplier and a more elegent solution but it needs more attention. > > I would be interested to know if there are testers and/or hackers that want > to > play with it and help get it ready for prime time. > > Please let me know! > > Greg > > > And thanks for the hints about how rpms handle files in /etc. There is > > something to be said for the Debian way (strict rules), but I also do > > agree that there are instances where left over config files can be a > > major pain ... (been there, done that ...) > > > > Best regards, > > > > Stefan > > > > -- > > Stefan Boresch > > Institute for Computational Biological Chemistry > > University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria > > Phone: -43-1-427752715 Fax: -43-1-427752790 > > _______________________________________________ > > Caos mailing list > > Caos at lists.infiscale.org > > http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > -------------- next part -------------- An HTML attachment was scrubbed... URL: http://lists.infiscale.org/pipermail/caos/attachments/20091204/db9db059/attachment.html From slaton at berkeley.edu Sat Dec 5 18:15:03 2009 From: slaton at berkeley.edu (Slaton Lipscomb) Date: Sat, 5 Dec 2009 18:15:03 -0800 Subject: [Caos] caos2 to caos-nsa upgrade, various questions In-Reply-To: <89cd2f950912051813g728f0897qf7187585bb9ba862@mail.gmail.com> References: <89cd2f950911301913w286f1d04jf2c53c07122c8b83@mail.gmail.com> <20091201062539.GI31788@infiscale.org> <89cd2f950912051813g728f0897qf7187585bb9ba862@mail.gmail.com> Message-ID: <89cd2f950912051815m16664169qf82bd1c9c0d66318@mail.gmail.com> Greg, thanks! Here are a few more issues and observations I've run across after playing with caos-nsa this week, plus a question about VNFS capsules at the bottom. (1) Motherboard has two eth ports (eth0 & eth1). During install I also had a dual Intel pro/1000 card installed (eth2 & eth3). This card was subsequently pulled and a single-port pro/1000 card installed. However, upon reboot eth2 & eth3 are not relinquished and this new card has been assigned eth4. e1000 seems to have it right during boot: ?e1000 0000:03:04.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 ?e1000: 0000:03:04.0: e1000_probe: (PCI:33MHz:32-bit) 00:1b:21:16:75:61 ?e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection ?e1000 0000:02:03.0: PCI INT A -> GSI 27 (level, low) -> IRQ 27 ?e1000: 0000:02:03.0: e1000_probe: (PCI:66MHz:32-bit) 00:d0:68:06:b0:8c ?e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection ?e1000 0000:02:04.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24 ?e1000: 0000:02:04.0: e1000_probe: (PCI:66MHz:32-bit) 00:d0:68:06:b0:8d ?e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection but then udev does some weird stuff: ?udev: renamed network interface eth0 to eth4 ?udev: renamed network interface eth2_rename to eth1 ?udev: renamed network interface eth1_rename to eth0 I assume this is happening because of some rules in /etc/udev/rules.d/70-persistent-net.rules that were created during my install. My question is, is there a way to regenerate this file based on the currently installed hardware, or is the proper approach here to edit it by hand? It does seem like this is something that should be handled automatically. (2) 'bind' mounts (mount --bind, such as used for nfs4 exports) placed in /etc/fstab don't seem to automatically mount upon boot. User has to issue a "mount -a" after boot for these to be mounted. Example: ?/dev/sda1 ? ? ?/sw ? ? ext3 ? ? defaults ? ? 0 0 ?/sw ? ? /exports/sw ? ? none ? ? bind ? ? 0 0 Upon boot, only /sw is mounted. But after "mount -a", /exports/sw is also mounted. (3) Something is not being unmounted cleanly on shutdown, but I can't figure out what it is because the status messages scroll by so quickly. Likewise I get this on every boot: EXT3-fs: INFO: recovery required on readonly filesystem. EXT3-fs: write access will be enabled during recovery. kjournald starting. ?Commit interval 5 seconds EXT3-fs: recovery complete. It flies by in a hurry. Any suggestions on how to track this one down? (4) This kernel is quite insistent on loading ieee1394 & ohci1394 (on a machine with no firewire hardware), which leads to ohci1394 probe errors in my logs throughout the day. So I added these to /etc/modprobe.d/blacklist, which had the desired effect. One more thing - are there any alternative VNFS capsules available for caos-nsa, something with X11 and some basic user apps required on an interactive node? I only see a basic node VNFS here: ?http://mirror.caoslinux.org/Caos-NSA-1.0/vnfs/x86_64/ "Roll your own" is an acceptable answer but if there's already something like this out there it would save a bit of time. thanks slaton On Mon, Nov 30, 2009 at 10:25 PM, Greg Kurtzer wrote: > > On Monday, 30 November 2009, at 19:13:05 (-0800), > Slaton Lipscomb wrote: > > > Hi guys, > > Hiya Slaton! > > > So I'm finally upgrading our cluster from caos2 & warewulf 2.6, to > > caos-nsa / perceus. Many thanks for all the hard work you guys have > > put into this. > > :) > > > I've spent the last couple of days getting up to speed on the docs and > > listserv, and doing the initial master node setup. In the process I've > > racked up a few questions. > > > > (1) Is it possible to specify manual partitioning with the new live > > media installer, or must I roll my own installation media with a > > custom kickstart file? > > You can boot into the LiveMedia installer, get a command prompt. From there > you can either copy from the existing kickstart templates and modify to > suit your needs, or create a new one. Once you look at the syntax you will > see how to create additional partitions or modify the existing config. > > > (2) I came across some nearly year old listserv messages suggesting an > > autofs package would be coming soon, but I don't see this in testing. > > Is this still planned? We make heavy use of autofs here. > > Planned?... sure. ;) > > > (3) The pdsh packages appears to be build without the genders module. > > Can it be used with the "-a" flag and a simple machine file, and if so > > where is pdsh looking for this file? I tried populating > > /etc/hosts.pdsh which worked with caos2, but this was unsuccessful. > > Newer pdsh versions don't work the same way with the -a option AFAIK. I just > wrap an alias which included "-w n[0000-xxxx]" options. > > > (4) I daily jump back and forth between Debian, RHEL and caos2, and > > frankly it's hard to keep track of the exact rpc service names each > > uses for tcp wrappers (hosts.allow/deny). Does caos-nsa want to see > > "rpc-" prepended here, i.e. rpc.mountd, rpc.nfsd, rpc.lockd, > > rpc.statd, rpc.idmapd, rpc.rquotad? > > I believe so, but I don't use tcp_wrappers often enough to remember. > > Thanks, > Greg From gmk at infiscale.org Sun Dec 6 09:25:30 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Sun, 6 Dec 2009 09:25:30 -0800 Subject: [Caos] caos2 to caos-nsa upgrade, various questions In-Reply-To: <89cd2f950912051815m16664169qf82bd1c9c0d66318@mail.gmail.com> References: <89cd2f950911301913w286f1d04jf2c53c07122c8b83@mail.gmail.com> <20091201062539.GI31788@infiscale.org> <89cd2f950912051813g728f0897qf7187585bb9ba862@mail.gmail.com> <89cd2f950912051815m16664169qf82bd1c9c0d66318@mail.gmail.com> Message-ID: <20091206172530.GY31788@infiscale.org> Hiya Slaton, On Saturday, 05 December 2009, at 18:15:03 (-0800), Slaton Lipscomb wrote: > Greg, thanks! > > Here are a few more issues and observations I've run across after > playing with caos-nsa this week, plus a question about VNFS capsules > at the bottom. > > > (1) Motherboard has two eth ports (eth0 & eth1). During install I also > had a dual Intel pro/1000 card installed (eth2 & eth3). > > This card was subsequently pulled and a single-port pro/1000 card > installed. However, upon reboot eth2 & eth3 are not relinquished and > this new card has been assigned eth4. > > e1000 seems to have it right during boot: > > ?e1000 0000:03:04.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16 > ?e1000: 0000:03:04.0: e1000_probe: (PCI:33MHz:32-bit) 00:1b:21:16:75:61 > ?e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection > ?e1000 0000:02:03.0: PCI INT A -> GSI 27 (level, low) -> IRQ 27 > ?e1000: 0000:02:03.0: e1000_probe: (PCI:66MHz:32-bit) 00:d0:68:06:b0:8c > ?e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection > ?e1000 0000:02:04.0: PCI INT A -> GSI 24 (level, low) -> IRQ 24 > ?e1000: 0000:02:04.0: e1000_probe: (PCI:66MHz:32-bit) 00:d0:68:06:b0:8d > ?e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection > > but then udev does some weird stuff: > > ?udev: renamed network interface eth0 to eth4 > ?udev: renamed network interface eth2_rename to eth1 > ?udev: renamed network interface eth1_rename to eth0 Hehe, I see you have found one of the new'ish UDEV features. ;) > I assume this is happening because of some rules in > /etc/udev/rules.d/70-persistent-net.rules that were created during my > install. Udev creates this automatically, not Caos. > My question is, is there a way to regenerate this file based on the > currently installed hardware, or is the proper approach here to edit > it by hand? It does seem like this is something that should be handled > automatically. Yes, it can be regenerated but that isn't the design goal if the persistent rules. Perhaps there can be a sidekick module to remove these which will force a reprobe...? > (2) 'bind' mounts (mount --bind, such as used for nfs4 exports) placed > in /etc/fstab don't seem to automatically mount upon boot. User has to > issue a "mount -a" after boot for these to be mounted. Example: > > ?/dev/sda1 ? ? ?/sw ? ? ext3 ? ? defaults ? ? 0 0 > ?/sw ? ? /exports/sw ? ? none ? ? bind ? ? 0 0 > > Upon boot, only /sw is mounted. But after "mount -a", /exports/sw is > also mounted. Gotcha, this can be fixed in rc.sysinit: -mount -a -t tmpfs,ext2,ext3,ext4,reiserfs,xfs,jfs,btrfs,iso9660 -O no_netdev +mount -a -t tmpfs,ext2,ext3,ext4,reiserfs,xfs,jfs,btrfs,iso9660,bind -O no_netdev > (3) Something is not being unmounted cleanly on shutdown, but I can't > figure out what it is because the status messages scroll by so > quickly. Likewise I get this on every boot: > > EXT3-fs: INFO: recovery required on readonly filesystem. > EXT3-fs: write access will be enabled during recovery. > kjournald starting. ?Commit interval 5 seconds > EXT3-fs: recovery complete. > > It flies by in a hurry. Any suggestions on how to track this one down? In /etc/init.d/sysdown you will find a umount command that is redirecting its stdout/stderr to /dev/null. Try letting it print what it is failing on and add a sleep statement. > (4) This kernel is quite insistent on loading ieee1394 & ohci1394 (on > a machine with no firewire hardware), which leads to ohci1394 probe > errors in my logs throughout the day. So I added these to > /etc/modprobe.d/blacklist, which had the desired effect. My guess is that this is also a UDEV feature. Can you run: # /sbin/detect to see if Caos probes a firewire device on the PCI? What does the entry in your blacklist file look like? > One more thing - are there any alternative VNFS capsules available for > caos-nsa, something with X11 and some basic user apps required on an > interactive node? I only see a basic node VNFS here: > > ?http://mirror.caoslinux.org/Caos-NSA-1.0/vnfs/x86_64/ > > "Roll your own" is an acceptable answer but if there's already > something like this out there it would save a bit of time. Once you get a base VNFS imported, I would recommend to use smart to install additional packages. You can also use metapkg's to get groups of packages: # smart -o rpm-root=/mnt/vnfs install metapkg-xorg > thanks > slaton Anytime! Greg -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From slaton at berkeley.edu Sun Dec 6 10:57:51 2009 From: slaton at berkeley.edu (Slaton Lipscomb) Date: Sun, 6 Dec 2009 10:57:51 -0800 Subject: [Caos] caos2 to caos-nsa upgrade, various questions In-Reply-To: <89cd2f950912061056x7f2c07a0h73f9c92cd90a4ba@mail.gmail.com> References: <89cd2f950911301913w286f1d04jf2c53c07122c8b83@mail.gmail.com> <20091201062539.GI31788@infiscale.org> <89cd2f950912051813g728f0897qf7187585bb9ba862@mail.gmail.com> <89cd2f950912051815m16664169qf82bd1c9c0d66318@mail.gmail.com> <20091206172530.GY31788@infiscale.org> <89cd2f950912061056x7f2c07a0h73f9c92cd90a4ba@mail.gmail.com> Message-ID: <89cd2f950912061057n352ac4d8ta58e551dee8736b6@mail.gmail.com> > Yes, it can be regenerated but that isn't the design goal if the persistent > rules. Perhaps there can be a sidekick module to remove these which will > force a reprobe...? Easy enough to fix by hand, but yes some kind of reprobe... >> Upon boot, only /sw is mounted. But after "mount -a", /exports/sw is >> also mounted. > > Gotcha, this can be fixed in rc.sysinit: > > -mount -a -t tmpfs,ext2,ext3,ext4,reiserfs,xfs,jfs,btrfs,iso9660 -O no_netdev > +mount -a -t tmpfs,ext2,ext3,ext4,reiserfs,xfs,jfs,btrfs,iso9660,bind -O no_netdev Should work, but for some reason it does not. I get the same behavior after making this change. >> It flies by in a hurry. Any suggestions on how to track this one down? > > In /etc/init.d/sysdown you will find a umount command that is redirecting > its stdout/stderr to /dev/null. Try letting it print what it is failing > on and add a sleep statement. Thanks. It appeared /tmp (which I have on a separate filesystem) was the culprit: ?umount: /tmp: device is busy. lsof and fuser both give no output here, though. I thought it likely perceus still has a lock on /tmp because there is a /tmp/.perceusd file whose modification date is in the last couple of minutes. Giving /tmp back to the root filesystem was not an effective work-around, though. The stderr output (device busy) from the umount command in sysdown goes away, but when the system comes back up there is still output from an ext3 recovery in the kernel msg buffer. Puzzled. > Can you run: > > ? # /sbin/detect > > to see if Caos probes a firewire device on the PCI? nope: ?BRIDGE amd76xrom ?SCSI pata_amd ?SCSI amd74xx ?BRIDGE i2c-amd756 ?BRIDGE amd-rng ?BRIDGE k8temp ?BRIDGE k8temp ?SCSI 3w-xxxx ?NETWORK e1000 ?NETWORK e1000 ?NETWORK e1000 > What does the entry in your blacklist file look like? ?blacklist ieee1394 ?blacklist ohci1394 >> "Roll your own" is an acceptable answer but if there's already >> something like this out there it would save a bit of time. > > Once you get a base VNFS imported, I would recommend to use smart to install > additional packages. You can also use metapkg's to get groups of packages: > > # smart -o rpm-root=/mnt/vnfs install metapkg-xorg This is the approach I'm taking -- err, trying to take ;) (see my email to the perceus list for details) thanks slaton From gartim at gmail.com Mon Dec 7 13:19:16 2009 From: gartim at gmail.com (gary artim) Date: Mon, 7 Dec 2009 13:19:16 -0800 Subject: [Caos] second nic for nfs mount on compute nodes. Message-ID: Hi -- is the best way to mount a nfs drive on an alternate nic/server (ie not eth0/not head-node) to run a dhcp server on whoever is hosting the nfs drive. Or is there a better way? My nfs drive is on a seperate server, not the headnode and would like data access through an alternate nic. thanks for any advice... -- Gary From gmk at infiscale.org Mon Dec 7 13:37:57 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Mon, 7 Dec 2009 13:37:57 -0800 Subject: [Caos] second nic for nfs mount on compute nodes. In-Reply-To: References: Message-ID: <20091207213757.GB31788@infiscale.org> Hi Gary, This is really up to you. It sounds like this is a cluster, if running Perceus I would recommend to use the "ipaddr" Perceus module. Thanks, Greg On Monday, 07 December 2009, at 13:19:16 (-0800), gary artim wrote: > Hi -- > > is the best way to mount a nfs drive on an alternate nic/server (ie > not eth0/not head-node) to run a dhcp server on whoever is hosting the > nfs drive. Or is there a better way? > My nfs drive is on a seperate server, not the headnode and would like > data access through an alternate nic. thanks for any advice... > > > -- Gary > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From gartim at gmail.com Mon Dec 7 16:46:39 2009 From: gartim at gmail.com (gary artim) Date: Mon, 7 Dec 2009 16:46:39 -0800 Subject: [Caos] second nic for nfs mount on compute nodes. In-Reply-To: <20091207213757.GB31788@infiscale.org> References: <20091207213757.GB31788@infiscale.org> Message-ID: Hi Greg -- I change the ipaddr control file like so: * eth0:[default]/[default] eth1:dhcp then: perceus module deactivate ipaddr perceus module activate ipaddr this pretty much disabled both interfaces, couldn't get to compute nodes even though they would boot fine. I think what it did was look for dhcp address on the head nodes eth1 and this may be why. I want is to be able to mount from a non-headnode nfs server. I think messing with ipaddr fouls up my connections to the compute nodes. Maybe I should qualify and only say the compute nodes eth1 should use dhcp. Something like * eth0:[default]/[default] hn0000 eth1:[default]/[default] n0000-n0001 eth1:dhcp Does this sound right?...thanks much! -- Gary On Mon, Dec 7, 2009 at 1:37 PM, Greg Kurtzer wrote: > Hi Gary, > > This is really up to you. It sounds like this is a cluster, if running > Perceus I would recommend to use the "ipaddr" Perceus module. > > Thanks, > Greg > > On Monday, 07 December 2009, at 13:19:16 (-0800), > gary artim wrote: > >> Hi -- >> >> is the best way to mount a nfs drive on an alternate nic/server (ie >> not eth0/not head-node) to run a dhcp server on whoever is hosting the >> nfs drive. Or is there a better way? >> My nfs drive is on a seperate server, not the headnode and would like >> data access through an alternate nic. thanks for any advice... >> >> >> -- Gary >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > From glen at callident.com Mon Dec 7 21:42:11 2009 From: glen at callident.com (Glen Otero) Date: Mon, 7 Dec 2009 21:42:11 -0800 Subject: [Caos] nvidia driver issues Message-ID: <89886A3B-FF30-43D6-AE29-7683DD825805@callident.com> Hi- Just installed Caos-NSA-1.0 and my cuda capable Nvidia card works fine, X starts fine, and XFCE is launched no problem. I do the package upgrade via Sideckick and install nvidia-cuda and nvidia-devel, and my display no longer shows anything after the boot messages. The X server is started, but no gui or command line results. I tried manually installing the 190.18 nvidia driver with the same results. Installing the older 185 driver didn't work either. Tinkered with nvidia-xconfig and nvidia-settings with no improvement. Suggestions? Thanks! Glen From gmk at infiscale.org Mon Dec 7 22:14:49 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Mon, 7 Dec 2009 22:14:49 -0800 Subject: [Caos] nvidia driver issues In-Reply-To: <89886A3B-FF30-43D6-AE29-7683DD825805@callident.com> References: <89886A3B-FF30-43D6-AE29-7683DD825805@callident.com> Message-ID: <20091208061448.GE31788@infiscale.org> Hi Glen, I usually debug X issues by logging into the text console and running X by hand and capturing the output messages or checking /var/log/Xorg.0.log. Are there any errors present that seem like they maybe causing issues? Also, please feel free to give our latest installers under testing a shot: http://altruistic.infiscale.org/~gmk/isos Good luck! Greg On Monday, 07 December 2009, at 21:42:11 (-0800), Glen Otero wrote: > Hi- > > Just installed Caos-NSA-1.0 and my cuda capable Nvidia card works fine, X starts fine, and XFCE is launched no problem. I do the package upgrade via Sideckick and install nvidia-cuda and nvidia-devel, and my display no longer shows anything after the boot messages. The X server is started, but no gui or command line results. I tried manually installing the 190.18 nvidia driver with the same results. Installing the older 185 driver didn't work either. Tinkered with nvidia-xconfig and nvidia-settings with no improvement. Suggestions? > > Thanks! > > Glen > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From glen at callident.com Mon Dec 7 22:34:49 2009 From: glen at callident.com (Glen Otero) Date: Mon, 7 Dec 2009 22:34:49 -0800 Subject: [Caos] nvidia driver issues In-Reply-To: <20091208061448.GE31788@infiscale.org> References: <89886A3B-FF30-43D6-AE29-7683DD825805@callident.com> <20091208061448.GE31788@infiscale.org> Message-ID: <9BBB4847-7165-4F6D-9508-78C7EF0D6F48@callident.com> On Dec 7, 2009, at 10:14 PM, Greg Kurtzer wrote: > Hi Glen, > > I usually debug X issues by logging into the text console and running X > by hand and capturing the output messages or checking /var/log/Xorg.0.log. There's no text console. But when I ssh into the box, I get kicked off in about 10 minutes and can't reconnect. Running X by hand showed that there was an X server already running. I'd kill that, but restarting doesn't help. > Are there any errors present that seem like they maybe causing issues? I did notice that $DISPLAY isn't defined for any users, but xorg.conf seemed normal. Only things I see in the Xorg.0.log are: (WW) Dec 07 21:29:38 NVIDIA(0): Failed to allocate GLX video capture device array. ... (WW) Couldn't load XKB keymap, falling back to pre-XKB keymap ... [config/hal] couldn't initialise context: (null) ((null)) Also: $ sudo dmesg | grep nvidia nvidia: module license 'NVIDIA' taints kernel. nvidia 0000:05:00.0: PCI INT A -> Link[LK2E] -> GSI 22 (level, high) -> IRQ 22 nvidia 0000:05:00.0: setting latency timer to 64 NVRM: This can occur when a driver such as rivafb, nvidiafb or NVRM: Try unloading the rivafb, nvidiafb or rivatv kernel module NVRM: (and/or reconfigure your kernel without rivafb/nvidiafb Maybe should revert to older kernel (didn't see that in GRUB) or try to unload rivafb and rivatv? Thanks! Glen > > Also, please feel free to give our latest installers under testing a shot: > > http://altruistic.infiscale.org/~gmk/isos > > Good luck! > > Greg > > On Monday, 07 December 2009, at 21:42:11 (-0800), > Glen Otero wrote: > >> Hi- >> >> Just installed Caos-NSA-1.0 and my cuda capable Nvidia card works fine, X starts fine, and XFCE is launched no problem. I do the package upgrade via Sideckick and install nvidia-cuda and nvidia-devel, and my display no longer shows anything after the boot messages. The X server is started, but no gui or command line results. I tried manually installing the 190.18 nvidia driver with the same results. Installing the older 185 driver didn't work either. Tinkered with nvidia-xconfig and nvidia-settings with no improvement. Suggestions? >> >> Thanks! >> >> Glen >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > From gmk at infiscale.org Mon Dec 7 23:52:58 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Mon, 7 Dec 2009 23:52:58 -0800 Subject: [Caos] nvidia driver issues In-Reply-To: <9BBB4847-7165-4F6D-9508-78C7EF0D6F48@callident.com> References: <89886A3B-FF30-43D6-AE29-7683DD825805@callident.com> <20091208061448.GE31788@infiscale.org> <9BBB4847-7165-4F6D-9508-78C7EF0D6F48@callident.com> Message-ID: <20091208075258.GF31788@infiscale.org> On Monday, 07 December 2009, at 22:34:49 (-0800), Glen Otero wrote: > > On Dec 7, 2009, at 10:14 PM, Greg Kurtzer wrote: > > > Hi Glen, > > > > I usually debug X issues by logging into the text console and running X > > by hand and capturing the output messages or checking /var/log/Xorg.0.log. > > There's no text console. But when I ssh into the box, I get kicked off in about 10 minutes and can't reconnect. Running X by hand showed that there was an X server already running. I'd kill that, but restarting doesn't help. > > > Are there any errors present that seem like they maybe causing issues? > > I did notice that $DISPLAY isn't defined for any users, but xorg.conf seemed normal. > > Only things I see in the Xorg.0.log are: > > (WW) Dec 07 21:29:38 NVIDIA(0): Failed to allocate GLX video capture device array. > ... > (WW) Couldn't load XKB keymap, falling back to pre-XKB keymap > ... > [config/hal] couldn't initialise context: (null) ((null)) > > Also: > > $ sudo dmesg | grep nvidia > nvidia: module license 'NVIDIA' taints kernel. > nvidia 0000:05:00.0: PCI INT A -> Link[LK2E] -> GSI 22 (level, high) -> IRQ 22 > nvidia 0000:05:00.0: setting latency timer to 64 > NVRM: This can occur when a driver such as rivafb, nvidiafb or > NVRM: Try unloading the rivafb, nvidiafb or rivatv kernel module > NVRM: (and/or reconfigure your kernel without rivafb/nvidiafb Maybe try to add those modules to a blacklist entry in /etc/modprobe.conf. e.g.: blacklist rivafb blacklist nvidiafb blacklist rivatv > Maybe should revert to older kernel (didn't see that in GRUB) or try to unload rivafb and rivatv? What output do you get from "/sbin/detect"? Good luck! Greg -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From stefan at mdy.univie.ac.at Tue Dec 8 08:08:04 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Tue, 8 Dec 2009 17:08:04 +0100 Subject: [Caos] LANG="en_US.UTF-8" causing pain ... Message-ID: <20091208160804.GA6128@loop.mdy.univie.ac.at> Given that I occasionally need "Umlaute" (a,o,u with dots above them ;-), I thought / hoped changing LANG=C to LANG="en_US.UTF-8" in /etc/sysconfig/i18n would do the trick. Unfortunately, no. 1. Once this is set, /etc/profile.d/lang.sh wants to call /sbin/consoletype which is not present (one to two errors in each new shell). Installing that from a manual compile of the centos initscripts removes this error. 2. However, e.g., in mutt any non-ascii characters are completely garbled. I start to suspect, however, that this is not a mutt problem but something deeper (mutt uses utf-8 as charset!) . Locales look OK, i.e., [stefan at loop ~]$ locale LANG=en_US.UTF-8 LC_CTYPE="en_US.UTF-8" LC_NUMERIC="en_US.UTF-8" LC_TIME="en_US.UTF-8" LC_COLLATE="en_US.UTF-8" LC_MONETARY="en_US.UTF-8" LC_MESSAGES="en_US.UTF-8" LC_PAPER="en_US.UTF-8" LC_NAME="en_US.UTF-8" LC_ADDRESS="en_US.UTF-8" LC_TELEPHONE="en_US.UTF-8" LC_MEASUREMENT="en_US.UTF-8" LC_IDENTIFICATION="en_US.UTF-8" LC_ALL= and perl -e "" works (as opposed to env LC_ALL=nocharset perl -e ""). However, both xfce and Eterm seem incapable of showing non-ascii characters. Following a test from http://wiki.mutt.org/?MuttFaq/Charset [stefan at loop ~]$ touch ??? (in case this gets out garbled the "filename" consists of three "Umlauts") leads to a file that cannot be displayed neither in Eterm nor in the xfce terminal. Interestingly, a plain xterm works!! Similarly, only an xterm can display the arrows used by mutt in thread mode. It almost looks like Eterm and the xfce terminal don't understand a utf-8 charset ... However, even when running mutt in an xterm I have no "Umlauts" ... Does anyone have any insights (this really is a showstopper for me!) -- thanks in advance! Stefan -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From gartim at gmail.com Tue Dec 8 09:52:20 2009 From: gartim at gmail.com (gary artim) Date: Tue, 8 Dec 2009 09:52:20 -0800 Subject: [Caos] second nic for nfs mount on compute nodes. In-Reply-To: References: <20091207213757.GB31788@infiscale.org> Message-ID: yes, this worked. just and fyi... -- Gary On Mon, Dec 7, 2009 at 4:46 PM, gary artim wrote: > Hi Greg -- > > I change the ipaddr control file like so: > * eth0:[default]/[default] eth1:dhcp > > then: > > perceus module deactivate ipaddr > perceus module activate ipaddr > > > this pretty much disabled both interfaces, couldn't get to ?compute > nodes even though they > would boot fine. > > I think what it did was look for dhcp address on the head nodes eth1 > and this may be why. > > I want is to be able to mount from a non-headnode nfs server. I think > messing with ipaddr > fouls up my connections to the compute nodes. Maybe I should qualify > and only say the compute > nodes eth1 should use dhcp. Something like > > * eth0:[default]/[default] > hn0000 eth1:[default]/[default] > n0000-n0001 eth1:dhcp > > Does this sound right?...thanks much! > > -- Gary > > > > On Mon, Dec 7, 2009 at 1:37 PM, Greg Kurtzer wrote: >> Hi Gary, >> >> This is really up to you. It sounds like this is a cluster, if running >> Perceus I would recommend to use the "ipaddr" Perceus module. >> >> Thanks, >> Greg >> >> On Monday, 07 December 2009, at 13:19:16 (-0800), >> gary artim wrote: >> >>> Hi -- >>> >>> is the best way to mount a nfs drive on an alternate nic/server (ie >>> not eth0/not head-node) to run a dhcp server on whoever is hosting the >>> nfs drive. Or is there a better way? >>> My nfs drive is on a seperate server, not the headnode and would like >>> data access through an alternate nic. thanks for any advice... >>> >>> >>> -- Gary >>> _______________________________________________ >>> Caos mailing list >>> Caos at lists.infiscale.org >>> http://lists.infiscale.org/mailman/listinfo/caos >> >> -- >> Greg M. Kurtzer >> Chief Technology Officer >> HPC Systems Architect >> Infiscale, Inc. - http://www.infiscale.com >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos >> > From mej at caoslinux.org Tue Dec 8 22:06:03 2009 From: mej at caoslinux.org (Michael Jennings) Date: Tue, 8 Dec 2009 22:06:03 -0800 Subject: [Caos] LANG="en_US.UTF-8" causing pain ... In-Reply-To: <20091208160804.GA6128@loop.mdy.univie.ac.at> References: <20091208160804.GA6128@loop.mdy.univie.ac.at> Message-ID: <20091209060603.GC16792@kainx.org> On Tuesday, 08 December 2009, at 17:08:04 (+0100), Stefan Boresch wrote: > Given that I occasionally need "Umlaute" (a,o,u with dots above them > ;-), I thought / hoped changing LANG=C to LANG="en_US.UTF-8" in > /etc/sysconfig/i18n would do the trick. Unfortunately, no. If that's all you need, why not use ISO-8859-1? > However, both xfce and Eterm seem incapable of showing non-ascii > characters. Following a test from > http://wiki.mutt.org/?MuttFaq/Charset Eterm definitely does not support UTF-8. It's possible XFCE doesn't either. It doesn't sound like you actually need UTF-8, but if you do, xterm -u8 should support it just fine. Unfortunately we don't have a lot of internationalization experience currently, so assistance with that is most welcome. Michael -- Michael Jennings (a.k.a. KainX) http://www.kainx.org/ Linux Server/Cluster Admin, LBL.gov Author, Eterm (www.eterm.org) ----------------------------------------------------------------------- "A kiss is a lovely trick designed by nature to stop speech when words become superfluous." -- Ingrid Bergman From stefan at mdy.univie.ac.at Wed Dec 9 01:00:22 2009 From: stefan at mdy.univie.ac.at (stefan at mdy.univie.ac.at) Date: Wed, 9 Dec 2009 10:00:22 +0100 (CET) Subject: [Caos] LANG="en_US.UTF-8" causing pain ... In-Reply-To: <20091209060603.GC16792@kainx.org> References: <20091208160804.GA6128@loop.mdy.univie.ac.at> <20091209060603.GC16792@kainx.org> Message-ID: <49454.131.130.40.53.1260349222.squirrel@www.mdy.univie.ac.at> Michael, thanks for the reply! > On Tuesday, 08 December 2009, at 17:08:04 (+0100), > Stefan Boresch wrote: > >> Given that I occasionally need "Umlaute" (a,o,u with dots above them >> ;-), I thought / hoped changing LANG=C to LANG="en_US.UTF-8" in >> /etc/sysconfig/i18n would do the trick. Unfortunately, no. > > If that's all you need, why not use ISO-8859-1? > we are (workstation wise) an Ubuntu shop; my Caos desktop is the "stray dog". Thus, everyone else (including me when I am on other machines) is on UTF-8, so I'd really like avoid mixing locales ... > Eterm definitely does not support UTF-8. It's possible XFCE doesn't > either. It doesn't sound like you actually need UTF-8, but if you do, > xterm -u8 should support it just fine. > I figured that out soon after I wrote the mail. Both xterm / uxterm / xterm -u8 work fine. Unfortunately, having sorted through utf-8 issues in terminals, I now really seem to be stuck with mutt not doing what it should. If I run mutt in an xterm on an Ubuntu machine (8.04, 9.10) everything is fine. If I run mutt on caos (in xterm) all Umlauts (and similar stuff) are garbled in mails. So far no luck figuring this out. The differences between Ubuntu's and Caos' mutt seem irrelevant (Ubuntu supports add. encryption libs and lib_idn), and using Ubuntu's /etc/Muttrc makes the display nicer, but still no Umlauts ... Should anyone have an idea, please drop me a line ... Stefan PS: I realize now that consoletype is not really needed for term. emulators, but its absence still strikes me as bug ... From gartim at gmail.com Wed Dec 9 08:06:06 2009 From: gartim at gmail.com (gary artim) Date: Wed, 9 Dec 2009 08:06:06 -0800 Subject: [Caos] slurm finding no nodes. In-Reply-To: <20091203162131.GR31788@infiscale.org> References: <20091202191906.GM31788@infiscale.org> <20091203051634.GP31788@infiscale.org> <20091203162131.GR31788@infiscale.org> Message-ID: Hi -- Time syncing didn't seem to make a difference. More food for thought. At least I can get them online with a scontrol command. cheers, -- gary On Thu, Dec 3, 2009 at 8:21 AM, Greg Kurtzer wrote: > Hi! > > Ah-ha. Thats a good catch! Let me know if syncing the time helps! > > Greg > > On Thursday, 03 December 2009, at 06:37:57 (-0800), > gary artim wrote: > >> Hi Greg -- >> >> I did try this a while back, but can redo the test. Out of the office >> till Monday, will get back then. Is there any other service that needs >> to run to permit compute nodes to be auto recognized? I read the slurm >> faq and trouble shooting pages and they did mention that if the nodes >> time-of-day are inconsistent with the head node the slurm would not >> activate them. thanks much, >> >> -- Gary >> >> On Wed, Dec 2, 2009 at 9:16 PM, Greg Kurtzer wrote: >> > Hi Gary, >> > >> > Can you try to change the "NodeName=" to only reference your nodes (e.g. >> > "n[0000-0002]") and remove the DownNodes line from the config. Then >> > repropogate the configuration to the nodes and restart all daemons. >> > >> > Let me know if that helps! >> > >> > Thanks, >> > Greg >> > >> > >> > On Wednesday, 02 December 2009, at 15:00:03 (-0800), >> > gary artim wrote: >> > >> >> Hi Greg -- >> >> >> >> scontrol update NodeName=n0000 State=resume >> >> >> >> get them up and available. wierd, maybe a network timing issue? >> >> >> >> i could script add a script as a circumvention, but still an >> >> outstanding problem. >> >> Let me know if you want any debug output. >> >> >> >> -- Gary >> >> >> >> >> >> >> >> On Wed, Dec 2, 2009 at 1:33 PM, gary artim wrote: >> >> > did check, diff'ed them, >> >> > attached slurm.com >> >> > >> >> > On Wed, Dec 2, 2009 at 11:19 AM, Greg Kurtzer wrote: >> >> >> Hi Gary, >> >> >> >> >> >> Can you verify that the contents of /etc/slurm are the same on the master >> >> >> and the nodes? If so, can you forward your slurm.conf? >> >> >> >> >> >> Greg >> >> >> >> >> >> On Wednesday, 02 December 2009, at 10:00:11 (-0800), >> >> >> gary artim wrote: >> >> >> >> >> >>> did a full install and checked if i could ping my headnode (hn0000) >> >> >>> from computenode (n0000) and yes >> >> >>> it worked. yet sinfo only shows available hn0000, ie no compute nodes. >> >> >>> tried restarting all nodes no dice. >> >> >>> >> >> >>> should I move this problem to the slurm list? thanks much... >> >> >>> >> >> >>> On Tue, Dec 1, 2009 at 5:46 PM, gary artim wrote: >> >> >>> > slight revision -- I never tested ping by slurm.conf hostname. will try...Gary >> >> >>> > >> >> >>> > On Tue, Dec 1, 2009 at 5:44 PM, gary artim wrote: >> >> >>> >> Hi Greg -- >> >> >>> >> >> >> >>> >> Thanks much for getting back. I did check and could ping the headnode >> >> >>> >> by ipaddr, but not by >> >> >>> >> hostname. Been really frustrated by this, and began reinstalling, >> >> >>> >> doing full install. >> >> >>> >> I had just done a cluster install only. Will email >> >> >>> >> tomorrow if this helps. With my limited knowledge of both slurm and >> >> >>> >> caos, this is what I think i >> >> >>> >> need to do to get slurm working: >> >> >>> >> >> >> >>> >> 1) (sidekick) perceus. >> >> >>> >> 2) (sidekick) vnfs, for the compute nodes. >> >> >>> >> 3) (sidekick) slurm >> >> >>> >> 4) mount, cp, unmount the vnfs -- copying the /etc/slurm/* to the vnfs >> >> >>> >> 5) roll out the vnfs (reboot or perceus command) >> >> >>> >> >> >> >>> >> I could be all wet here and the full install does some of this. >> >> >>> >> >> >> >>> >> btw, I really like the speed of caos, but now i find i'm more trigger >> >> >>> >> happy and get in my >> >> >>> >> own way. Thanks for guidance... >> >> >>> >> >> >> >>> >> -- Gary >> >> >>> >> >> >> >>> >> >> >> >>> >> On Tue, Dec 1, 2009 at 5:07 PM, Greg Kurtzer wrote: >> >> >>> >>> Hi Gary, >> >> >>> >>> >> >> >>> >>> How did you configure SLURM? Is it the default sidekick config? >> >> >>> >>> >> >> >>> >>> Check the nodes themselves to make sure they are running the slurm client >> >> >>> >>> daemons and that they can reach the master by the hostname specified in the >> >> >>> >>> slurm.conf. >> >> >>> >>> >> >> >>> >>> Thanks, >> >> >>> >>> Greg >> >> >>> >>> >> >> >>> >>> On Tuesday, 01 December 2009, at 11:44:36 (-0800), >> >> >>> >>> gary artim wrote: >> >> >>> >>> >> >> >>> >>>> HI (newbie) -- >> >> >>> >>>> >> >> >>> >>>> I'm running a small stateless cluster with the default slurm config, >> >> >>> >>>> defined through sidekick. >> >> >>> >>>> When I do an sinfo or scontrol show node n0000 I get: node not found. >> >> >>> >>>> Has anyone experienced this? >> >> >>> >>>> Last week I spend hours trying rebooting, etc, then suddenly they >> >> >>> >>>> appeared, now after a reboot I'm back >> >> >>> >>>> to no nodes. If I do perceus node list it shows: >> >> >>> >>>> n0000 >> >> >>> >>>> n0001 >> >> >>> >>>> >> >> >>> >>>> Any help would be great, thanks, >> >> >>> >>>> >> >> >>> >>>> -- Gary >> >> >>> >>>> _______________________________________________ >> >> >>> >>>> Caos mailing list >> >> >>> >>>> Caos at lists.infiscale.org >> >> >>> >>>> http://lists.infiscale.org/mailman/listinfo/caos >> >> >>> >>> >> >> >>> >>> -- >> >> >>> >>> Greg M. Kurtzer >> >> >>> >>> Chief Technology Officer >> >> >>> >>> HPC Systems Architect >> >> >>> >>> Infiscale, Inc. - http://www.infiscale.com >> >> >>> >>> _______________________________________________ >> >> >>> >>> Caos mailing list >> >> >>> >>> Caos at lists.infiscale.org >> >> >>> >>> http://lists.infiscale.org/mailman/listinfo/caos >> >> >>> >>> >> >> >>> >> >> >> >>> > >> >> >>> _______________________________________________ >> >> >>> Caos mailing list >> >> >>> Caos at lists.infiscale.org >> >> >>> http://lists.infiscale.org/mailman/listinfo/caos >> >> >> >> >> >> -- >> >> >> Greg M. Kurtzer >> >> >> Chief Technology Officer >> >> >> HPC Systems Architect >> >> >> Infiscale, Inc. - http://www.infiscale.com >> >> >> _______________________________________________ >> >> >> Caos mailing list >> >> >> Caos at lists.infiscale.org >> >> >> http://lists.infiscale.org/mailman/listinfo/caos >> >> >> >> >> > >> >> _______________________________________________ >> >> Caos mailing list >> >> Caos at lists.infiscale.org >> >> http://lists.infiscale.org/mailman/listinfo/caos >> > >> > -- >> > Greg M. Kurtzer >> > Chief Technology Officer >> > HPC Systems Architect >> > Infiscale, Inc. - http://www.infiscale.com >> > _______________________________________________ >> > Caos mailing list >> > Caos at lists.infiscale.org >> > http://lists.infiscale.org/mailman/listinfo/caos >> > >> _______________________________________________ >> Caos mailing list >> Caos at lists.infiscale.org >> http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos > From mej at caoslinux.org Wed Dec 9 14:30:06 2009 From: mej at caoslinux.org (Michael Jennings) Date: Wed, 9 Dec 2009 14:30:06 -0800 Subject: [Caos] LANG="en_US.UTF-8" causing pain ... In-Reply-To: <49454.131.130.40.53.1260349222.squirrel@www.mdy.univie.ac.at> References: <20091208160804.GA6128@loop.mdy.univie.ac.at> <20091209060603.GC16792@kainx.org> <49454.131.130.40.53.1260349222.squirrel@www.mdy.univie.ac.at> Message-ID: <20091209223006.GB16721@kainx.org> On Wednesday, 09 December 2009, at 10:00:22 (+0100), stefan at mdy.univie.ac.at wrote: > Unfortunately, having sorted through utf-8 issues in terminals, I > now really seem to be stuck with mutt not doing what it should. If I > run mutt in an xterm on an Ubuntu machine (8.04, 9.10) everything is > fine. If I run mutt on caos (in xterm) all Umlauts (and similar > stuff) are garbled in mails. So far no luck figuring this out. The > differences between Ubuntu's and Caos' mutt seem irrelevant (Ubuntu > supports add. encryption libs and lib_idn), and using Ubuntu's > /etc/Muttrc makes the display nicer, but still no Umlauts ... If you run xterm on Ubuntu, ssh to Caos, and run mutt, what happens? How about going the other way? > PS: I realize now that consoletype is not really needed for term. emulators, > but its absence still strikes me as bug ... What do you mean by "consoletype?" Michael -- Michael Jennings (a.k.a. KainX) http://www.kainx.org/ Linux Server/Cluster Admin, LBL.gov Author, Eterm (www.eterm.org) ----------------------------------------------------------------------- "Even in my heart I see you're not being true to me. Deep within my soul I feel nothing's like it used to be." -- Backstreet Boys, "Quit Playing Games (With My Heart)" From mehna.jain at orkash.com Wed Dec 9 20:09:39 2009 From: mehna.jain at orkash.com (Mehna) Date: Thu, 10 Dec 2009 09:39:39 +0530 Subject: [Caos] How to start with Abstractual?? Message-ID: <4B207483.70503@orkash.com> Hi I have installed the abstractual tool for cloud virtualization on my Caos NSA 1.0.25 using smart installer. But I am not able to start with it. Is there any documentation for it?? Can any of you provide me any help in this regard. Best Regards Mehna Jain From stefan at mdy.univie.ac.at Wed Dec 9 23:40:52 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Thu, 10 Dec 2009 08:40:52 +0100 Subject: [Caos] LANG="en_US.UTF-8" causing pain ... In-Reply-To: <20091209223006.GB16721@kainx.org> References: <20091208160804.GA6128@loop.mdy.univie.ac.at> <20091209060603.GC16792@kainx.org> <49454.131.130.40.53.1260349222.squirrel@www.mdy.univie.ac.at> <20091209223006.GB16721@kainx.org> Message-ID: <20091210074052.GB18124@loop.mdy.univie.ac.at> Dear Michael, replies in reverse order! On Wed, Dec 09, 2009 at 02:30:06PM -0800, Michael Jennings wrote: > On Wednesday, 09 December 2009, at 10:00:22 (+0100), > stefan at mdy.univie.ac.at wrote: > > PS: I realize now that consoletype is not really needed for term. emulators, > > but its absence still strikes me as bug ... > > What do you mean by "consoletype?" > See beginning of first mail in the thread. If LANG is set to utf8, then /etc/profile.d/lang.sh wants to do something with /sbin/consoletype which is not present in caos. (it can be found in the initscripts package of centos/RH). However, I believe this has nothing to do with my problem, this would apply to enabling utf-8 support on non-X consoles ... > > Unfortunately, having sorted through utf-8 issues in terminals, I [snip] > > If you run xterm on Ubuntu, ssh to Caos, and run mutt, what happens? > How about going the other way? > caos/xterm -> ubuntu, mutt there == everything works I did not try the other direction, and, given some pending deadlines I have reverted my workstation to Ubuntu. I have a secondary machine on which I can eventually test this ... Best regards, Stefan -- Stefan Boresch Institute for Computational Biological Chemistry University of Vienna, Waehringerstr. 17 A-1090 Vienna, Austria Phone: -43-1-427752715 Fax: -43-1-427752790 From sound.insulation at gmx.com Thu Dec 10 01:50:49 2009 From: sound.insulation at gmx.com (sound.insulation at gmx.com) Date: Thu, 10 Dec 2009 10:50:49 +0100 Subject: [Caos] iscsi automatic login at startup Message-ID: <20091210095049.85480@gmx.net> Hi, thank you very much for providing Caos + Perceus. I am setting up a small compute cluster (master + 7 nodes) and everything is working so far. When I wanted to add a iscsi-to-sas box (infortrend S16E-R1130) I found a problem setting up the iscsi connection. Node discovery was ok, login works, everything was fine, except that the "node.startup = automatic" setting in /etc/iscsi/iscsi.conf and the respective settings in /var/lib/iscsi/nodes/... /default had no effect and there was no automatic startup at boot or start/restart of iscsi. I think I solved the problem by fixing /etc/init.d/iscsi in line 42 the "&&" must be replaced by "||", else the script exits at this line in every case Thank you very much, Ennes From gmk at infiscale.org Thu Dec 10 20:23:54 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Thu, 10 Dec 2009 20:23:54 -0800 Subject: [Caos] iscsi automatic login at startup In-Reply-To: <20091210095049.85480@gmx.net> References: <20091210095049.85480@gmx.net> Message-ID: <20091211042354.GG31788@infiscale.org> Hi Ennes, Please test iscsi-initiator-2.0.870.3-2.caos in the "testing" package repository. http://wiki.caoslinux.org/Activating_nsa-testing_repo_with_Sidekick Thanks for pointing this out, and welcome to the club! :) Greg On Thursday, 10 December 2009, at 10:50:49 (+0100), sound.insulation at gmx.com wrote: > Hi, > > thank you very much for providing Caos + Perceus. I am setting up a small compute cluster (master + 7 nodes) and everything is working so far. > When I wanted to add a iscsi-to-sas box (infortrend S16E-R1130) I found a problem setting up the iscsi connection. > > Node discovery was ok, login works, everything was fine, except that the > > "node.startup = automatic" setting in /etc/iscsi/iscsi.conf and the respective settings in /var/lib/iscsi/nodes/... /default had no effect and there was no automatic startup at boot or start/restart of iscsi. > > I think I solved the problem by fixing /etc/init.d/iscsi > > in line 42 the "&&" must be replaced by "||", else the script exits at this line in every case > > Thank you very much, > > Ennes > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From stefan at mdy.univie.ac.at Mon Dec 14 00:04:22 2009 From: stefan at mdy.univie.ac.at (Stefan Boresch) Date: Mon, 14 Dec 2009 09:04:22 +0100 Subject: [Caos] postfix cron job Message-ID: <20091214080422.GA18968@loop.mdy.univie.ac.at> I finally set up a caos system for the first time so far that mail is sent as from any other workstation in the local network. In doing so, I replaced the caos/rpm provided main.cf by our custom main.cf (basically doing very little aside from specifying a smart host). However, the cron.daily/postfix job coming with the postfix rpm relies on command_directory being specified in main.cf, which we do not as everything is installed in the default locatin anyways. Thus, I now receive daily mails from a broken cron job. The quick fix is easy: replace the fancy sed stuff by the hardwired path (exec /usr/sbin/postfix check). However, I assume this cron job is going to be overwritten by the next postfix update rpm ... To me, the sed trickery seems a bit overkill and should maybe be only used if the postfix executable cannot be found otherwise .... I guess if /usr/sbin/postfix exists, it should be called directly; if not, then one could use the trick to search for it ... Even this may be too much, according to my understanding of the postfix system, the postfix executable is expected to be in root's path... -- i.e., if someone has postfix living in some obscure location, why would s/he keep main.cf in /etc/postfix ...? Just my five cents ... Stefan From slaton at berkeley.edu Mon Dec 14 10:54:21 2009 From: slaton at berkeley.edu (Slaton Lipscomb) Date: Mon, 14 Dec 2009 10:54:21 -0800 Subject: [Caos] Ganglia annoyances Message-ID: <89cd2f950912141054l112e8f16l28634a897658884c@mail.gmail.com> I'm experiencing some annoyance with getting Ganglia working on my perceus cluster. I'm using the packages provided by caos-nsa. Specifically, I've installed are ganglia, ganglia-gmetad, and ganglia-web (master node), and ganglia (interactive & compute nodes). On the master node, gmond appears to function properly, but it is writing this error x5 to syslog every 30 seconds: slurpfile() open() error on file /proc/net/dev: No such file or directory A quick search revealed only a brief thread on the ganglia list from a couple of years ago (from Bernard, actually) which suggests that this is caused by "the fact that the kernel doesn't have 'cpufreq' ACPI governor loaded". http://www.mail-archive.com/ganglia-developers at lists.sourceforge.net/msg03204.html There is indeed a message that appears during my master node's boot about a cpu governor not being loaded, although I don't have access to the console at the moment and can't fetch the exact error. Is this a concern, or can this be ignored? If the latter, Bernard (or anyone else) is there a way to suppress this error? On the compute nodes, gmond initially refused to start. Running with --debug 1 gave: udp_recv_channel mcast_join=NULL mcast_if=NULL port=8649 bind=10.0.7.250 Error creating UDP server on port 8649 bind=10.0.7.250. Exiting. I was able to work around this by setting deaf = true in gmond.conf, which is acceptable as the compute nodes don't need to listen to any other ganglia nodes. The next issue was that by default ganglia services run as user 'nobody'. This user didn't (apparently) initially have write permission to /var/lib/ganglia/rrds/[clustername], and so wasn't creating the rrd files used to track the various ganglia metrics. As a result all metrics were listed as '0', boot time is December 1969, etc. The workaround was for each compute node, to start gmond manually as root just once so that these files could be created. Subsequently gmond ran properly as user nobody. So I consider this resolved on the compute nodes. There are also PHP issues with the ganglia web frontend. It fills apache errors logs with thousands of messages like this: PHP Warning: gettimeofday() [function.gettimeofday]: It is not safe to rely on the system's timezone settings. You are *required* to use the date.timezone setting or the date_default_timezone_set() function. In case you used any of those methods and you are still getting this warning, you most likely misspelled the timezone identifier. We selected 'America/Los_Angeles' for 'PST/-8.0/no DST' instead in /srv/www/html/ganglia/ganglia.php on line 340, referer: http://10.0.3.250/ PHP Notice: Undefined index: cpu_num in /srv/www/html/ganglia/cluster_view.php on line 86, referer: http://10.0.3.250/ PHP Notice: Undefined index: load_one in /srv/www/html/ganglia/cluster_view.php on line 88, referer: http://10.0.3.250/ PHP Warning: number_format() expects parameter 1 to be double, string given in /srv/www/html/ganglia/graph.d/metric.php on line 52, referer: http://10.0.3.250/ganglia/ ERROR: opening '/ql01-admin/load_one.rrd': No such file or directory' Any suggestions on how to correct (or suppress) these errors? As an aside, it seems perhaps that the ganglia-web package should also have php-httpd as a dependency. thanks slaton From gmk at infiscale.org Mon Dec 14 11:53:07 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Mon, 14 Dec 2009 11:53:07 -0800 Subject: [Caos] Ganglia annoyances In-Reply-To: <89cd2f950912141054l112e8f16l28634a897658884c@mail.gmail.com> References: <89cd2f950912141054l112e8f16l28634a897658884c@mail.gmail.com> Message-ID: <20091214195307.GM31788@infiscale.org> Hi Slaton, On Monday, 14 December 2009, at 10:54:21 (-0800), Slaton Lipscomb wrote: > I'm experiencing some annoyance with getting Ganglia working on my > perceus cluster. I'm using the packages provided by caos-nsa. > Specifically, I've installed are ganglia, ganglia-gmetad, and > ganglia-web (master node), and ganglia (interactive & compute nodes). > > > On the master node, gmond appears to function properly, but it is > writing this error x5 to syslog every 30 seconds: > > slurpfile() open() error on file /proc/net/dev: No such file or directory > You may want to give the ganglia user (nobody, or create a new user for it) access to /proc by adding them to the "secadm" group in /etc/group. > A quick search revealed only a brief thread on the ganglia list from a > couple of years ago (from Bernard, actually) which suggests that this > is caused by "the fact that the kernel doesn't have 'cpufreq' ACPI > governor loaded". > > http://www.mail-archive.com/ganglia-developers at lists.sourceforge.net/msg03204.html > > There is indeed a message that appears during my master node's boot > about a cpu governor not being loaded, although I don't have access to > the console at the moment and can't fetch the exact error. Is this a > concern, or can this be ignored? If the latter, Bernard (or anyone > else) is there a way to suppress this error? > > > On the compute nodes, gmond initially refused to start. Running with > --debug 1 gave: > > udp_recv_channel mcast_join=NULL mcast_if=NULL port=8649 bind=10.0.7.250 > Error creating UDP server on port 8649 bind=10.0.7.250. Exiting. > > I was able to work around this by setting deaf = true in gmond.conf, > which is acceptable as the compute nodes don't need to listen to any > other ganglia nodes. I have no idea why that would be happening. .... weird > The next issue was that by default ganglia services run as user > 'nobody'. This user didn't (apparently) initially have write > permission to /var/lib/ganglia/rrds/[clustername], and so wasn't > creating the rrd files used to track the various ganglia metrics. As a > result all metrics were listed as '0', boot time is December 1969, > etc. Gotcha. That should be fixed in packaging so that the directory is owned by the correct user. Can you create a ticket for this at http://bugs.infiscale.org? > The workaround was for each compute node, to start gmond manually as > root just once so that these files could be created. Subsequently > gmond ran properly as user nobody. So I consider this resolved on the > compute nodes. > > > There are also PHP issues with the ganglia web frontend. It fills > apache errors logs with thousands of messages like this: > > PHP Warning: gettimeofday() [ href='function.gettimeofday'>function.gettimeofday]: It is not > safe to rely on the system's timezone settings. You are *required* to > use the date.timezone setting or the date_default_timezone_set() > function. In case you used any of those methods and you are still > getting this warning, you most likely misspelled the timezone > identifier. We selected 'America/Los_Angeles' for 'PST/-8.0/no DST' > instead in /srv/www/html/ganglia/ganglia.php on line 340, referer: > http://10.0.3.250/ > > PHP Notice: Undefined index: cpu_num in > /srv/www/html/ganglia/cluster_view.php on line 86, referer: > http://10.0.3.250/ > > PHP Notice: Undefined index: load_one in > /srv/www/html/ganglia/cluster_view.php on line 88, referer: > http://10.0.3.250/ > > PHP Warning: number_format() expects parameter 1 to be double, string > given in /srv/www/html/ganglia/graph.d/metric.php on line 52, referer: > http://10.0.3.250/ganglia/ > ERROR: opening '/ql01-admin/load_one.rrd': No such file or directory' > > > Any suggestions on how to correct (or suppress) these errors? These will have to be answered by a Ganglia user. Perhaps Bernard is lurking. ;) > As an aside, it seems perhaps that the ganglia-web package should also > have php-httpd as a dependency. Can you create a ticket for this as well in bugs.infiscale.org? Thanks! -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com From slaton at berkeley.edu Mon Dec 14 18:43:31 2009 From: slaton at berkeley.edu (Slaton Lipscomb) Date: Mon, 14 Dec 2009 18:43:31 -0800 Subject: [Caos] Ganglia annoyances In-Reply-To: <89cd2f950912141842r74ced6ddv41764c56366b566f@mail.gmail.com> References: <89cd2f950912141054l112e8f16l28634a897658884c@mail.gmail.com> <20091214195307.GM31788@infiscale.org> <89cd2f950912141842r74ced6ddv41764c56366b566f@mail.gmail.com> Message-ID: <89cd2f950912141843q70f6937fn5535990f9e80f497@mail.gmail.com> Hi Greg, >> On the master node, gmond appears to function properly, but it is >> writing this error x5 to syslog every 30 seconds: >> >> ?slurpfile() open() error on file /proc/net/dev: No such file or directory >> > You may want to give the ganglia user (nobody, or create a new user for it) > access to /proc by adding them to the "secadm" group in /etc/group. That sounded like just the thing, but alas after adding nobody to 'secadm' and restarting gmond & gmetad, the errors continue. ?slurpfile() open() error on file /proc/net/dev: No such file or directory ?update_file() got an error from slurpfile() reading /proc/net/dev >> The next issue was that by default ganglia services run as user >> 'nobody'. This user didn't (apparently) initially have write >> permission to /var/lib/ganglia/rrds/[clustername], and so wasn't >> creating the rrd files used to track the various ganglia metrics. As a >> result all metrics were listed as '0', boot time is December 1969, >> etc. > > Gotcha. That should be fixed in packaging so that the directory is owned by > the correct user. Okay, this issue isn't as I originally characterized it. The issue recurs anytime the cluster nodes are rebooted. gmond service starts automatically, but beyond checking in, the master can't get any other data from the nodes. A manual 'ssh nXX /etc/init.d/gmond restart' is all that is required to get everything working. I hacked /etc/init.d/gmond to redirect output to a file so I could investigate it on the nodes, and this turns out to be the same issue I'm seeing on the master. Lots of these: ?slurpfile() open() error on file /proc/net/dev: No such file or directory ?update_file() got an error from slurpfile() reading /proc/net/dev Upon performing the manual restart, these errors go away and the node communicates properly with the master. Thinking this had something to do with the order gmond was being started, I renamed all the S70gmond scripts in rc.d to S99gmond, but this didn't make a difference. I also tested whether the issue was (1) with starting gmond during boot, or (2) that the 1st time gmond runs it always throws these errors. I disabled the service from starting at boot in vnfs, rebooted a node, and started the service manually. No errors. So there is something that prevents gmond from reading /proc/net/dev during the boot process. Configuring gmond to run as root makes all errors go away (on master and compute nodes), but that's not a good solution of course. How does the mechanism work whereby members of secadm group can access /proc? I assume it's through grsec. Is there a reason why this mechanism wouldn't be functional during the boot process? >> There are also PHP issues with the ganglia web frontend. It fills >> apache errors logs with thousands of messages like this: >> >> PHP Warning: ?gettimeofday() [> href='function.gettimeofday'>function.gettimeofday]: It is not >> safe to rely on the system's timezone settings. You are *required* to >> use the date.timezone setting or the date_default_timezone_set() >> function. In case you used any of those methods and you are still >> getting this warning, you most likely misspelled the timezone >> identifier. We selected 'America/Los_Angeles' for 'PST/-8.0/no DST' >> instead in /srv/www/html/ganglia/ganglia.php on line 340, referer: >> http://10.0.3.250/ >> >> PHP Notice: ?Undefined index: cpu_num in >> /srv/www/html/ganglia/cluster_view.php on line 86, referer: >> http://10.0.3.250/ >> >> PHP Notice: ?Undefined index: load_one in >> /srv/www/html/ganglia/cluster_view.php on line 88, referer: >> http://10.0.3.250/ >> >> PHP Warning: ?number_format() expects parameter 1 to be double, string >> given in /srv/www/html/ganglia/graph.d/metric.php on line 52, referer: >> http://10.0.3.250/ganglia/ >> ERROR: opening '/ql01-admin/load_one.rrd': No such file or directory' >> >> >> Any suggestions on how to correct (or suppress) these errors? > > These will have to be answered by a Ganglia user. Perhaps Bernard is lurking. > > ;) indeed... ;) >> As an aside, it seems perhaps that the ganglia-web package should also >> have php-httpd as a dependency. > > Can you create a ticket for this as well in bugs.infiscale.org? No problem. thanks slaton From sound.insulation at gmx.com Wed Dec 16 05:21:14 2009 From: sound.insulation at gmx.com (sound.insulation at gmx.com) Date: Wed, 16 Dec 2009 14:21:14 +0100 Subject: [Caos] iscsi automatic login at startup In-Reply-To: <20091211042354.GG31788@infiscale.org> References: <20091210095049.85480@gmx.net> <20091211042354.GG31788@infiscale.org> Message-ID: <20091216132114.120590@gmx.net> Hi Greg, the package from "testing" seems to work. Thank you for fixing this. BTW: What is the preferred way to mount the iscsi-dev automatically after automatic iscsi login? I inserted: sleep(10) mount -a -O _netdev in /etc/init.d/iscsi and use that togehter with suitable fstab entries. But this is surely not a clean way to accomplish mounting. The problem is that it may take some time until the iscsi device becomes available for mounting even after the iscsiadm autologin command exits, so it may fail if mounting is tried to soon after iscsi login. Ennes > Hi Ennes, > > Please test iscsi-initiator-2.0.870.3-2.caos in the "testing" package > repository. > > http://wiki.caoslinux.org/Activating_nsa-testing_repo_with_Sidekick > > Thanks for pointing this out, and welcome to the club! :) > > Greg > > > > On Thursday, 10 December 2009, at 10:50:49 (+0100), > sound.insulation at gmx.com wrote: > > > Hi, > > > > thank you very much for providing Caos + Perceus. I am setting up a > small compute cluster (master + 7 nodes) and everything is working so far. > > When I wanted to add a iscsi-to-sas box (infortrend S16E-R1130) I found > a problem setting up the iscsi connection. > > > > Node discovery was ok, login works, everything was fine, except that the > > > > "node.startup = automatic" setting in /etc/iscsi/iscsi.conf and the > respective settings in /var/lib/iscsi/nodes/... /default had no effect and > there was no automatic startup at boot or start/restart of iscsi. > > > > I think I solved the problem by fixing /etc/init.d/iscsi > > > > in line 42 the "&&" must be replaced by "||", else the script exits at > this line in every case > > > > Thank you very much, > > > > Ennes > > _______________________________________________ > > Caos mailing list > > Caos at lists.infiscale.org > > http://lists.infiscale.org/mailman/listinfo/caos > > -- > Greg M. Kurtzer > Chief Technology Officer > HPC Systems Architect > Infiscale, Inc. - http://www.infiscale.com > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos From slaton at berkeley.edu Wed Dec 16 12:57:25 2009 From: slaton at berkeley.edu (Slaton Lipscomb) Date: Wed, 16 Dec 2009 12:57:25 -0800 Subject: [Caos] Ganglia annoyances In-Reply-To: <89cd2f950912141843q70f6937fn5535990f9e80f497@mail.gmail.com> References: <89cd2f950912141054l112e8f16l28634a897658884c@mail.gmail.com> <20091214195307.GM31788@infiscale.org> <89cd2f950912141842r74ced6ddv41764c56366b566f@mail.gmail.com> <89cd2f950912141843q70f6937fn5535990f9e80f497@mail.gmail.com> Message-ID: <89cd2f950912161257tc1233c1o21d1afc5fea373f3@mail.gmail.com> While I'm still hoping for additional input on these issues (perhaps I should post to the perceus list), I thought I'd add one small clarification regarding the outstanding issue of running ganglia on the compute nodes. Even when running as root, ganglia doesn't initialize or set itself up properly on a first-time provisioned compute node when it is registered as a service to start at boot. The various rrd files under /var/lib/ganglia/rrds/[clustername] are not created for this node. root needs to log into the compute node, stop the service, and restart it manually. This causes these files to be created. Although I find this behavior odd, I don't mind adding it to my own protocol for the provisioning of a new node. What I'm most interested in this point is making these errors go away, on both the ganglia master and clients: slurpfile() open() error on file /proc/net/dev: No such file or directory update_file() got an error from slurpfile() reading /proc/net/dev ...as well as the incredibly verbose PHP errors and warnings generated by php-web. thanks slaton >>> The next issue was that by default ganglia services run as user >>> 'nobody'. This user didn't (apparently) initially have write >>> permission to /var/lib/ganglia/rrds/[clustername], and so wasn't >>> creating the rrd files used to track the various ganglia metrics. As a >>> result all metrics were listed as '0', boot time is December 1969, >>> etc. >> >> Gotcha. That should be fixed in packaging so that the directory is owned by >> the correct user. > > Okay, this issue isn't as I originally characterized it. > > The issue recurs anytime the cluster nodes are rebooted. gmond service > starts automatically, but beyond checking in, the master can't get any > other data from the nodes. A manual 'ssh nXX /etc/init.d/gmond > restart' is all that is required to get everything working. > > I hacked /etc/init.d/gmond to redirect output to a file so I could > investigate it on the nodes, and this turns out to be the same issue > I'm seeing on the master. Lots of these: > > ?slurpfile() open() error on file /proc/net/dev: No such file or directory > ?update_file() got an error from slurpfile() reading /proc/net/dev > > Upon performing the manual restart, these errors go away and the node > communicates properly with the master. > > Thinking this had something to do with the order gmond was being > started, I renamed all the S70gmond scripts in rc.d to S99gmond, but > this didn't make a difference. > > I also tested whether the issue was (1) with starting gmond during > boot, or (2) that the 1st time gmond runs it always throws these > errors. I disabled the service from starting at boot in vnfs, rebooted > a node, and started the service manually. No errors. So there is > something that prevents gmond from reading /proc/net/dev during the > boot process. > > Configuring gmond to run as root makes all errors go away (on master > and compute nodes), but that's not a good solution of course. > > How does the mechanism work whereby members of secadm group can access > /proc? I assume it's through grsec. Is there a reason why this > mechanism wouldn't be functional during the boot process? From mrozek at chemia.uj.edu.pl Mon Dec 21 03:43:33 2009 From: mrozek at chemia.uj.edu.pl (Janusz Mrozek) Date: Mon, 21 Dec 2009 12:43:33 +0100 Subject: [Caos] upgrade problem Message-ID: <4B2F5F65.2030801@chemia.uj.edu.pl> I am trying to set up a small cluster with head node based on Asus P5Q WS motherboard (Intel P45 chipset, Intel Q9400 processor) and two compute nodes (with Gigabyte EG45M-DS2H motherboard and Intel Q9400 processor). It was running without any problems when set up with and "old" CAOS NSA installer v. 1.0.8. I had to repartition the disk on a head node and decided to use the new and improved v. 1.0.25 installer which started from a DVD disk without any problems but when attempting to install the system the installer simply crashed. After reading the recent letters by Stefan Boresch and Greg Kurtzer I took the experimental 1.0.29 installer which, again, was able to run smoothly from a DVD, but when ordered to install the system gave a message: "Unable to mount CDROM" and quit. Exactly the same behaviour was observed when trying to use this installer on on of compute nodes. We are about to get a couple of new compute nodes based on P55 chipsets and I7 processors so I thought it would be wise to use the most recent kernel. I went back to the working 1.0.8 installer and after the CAOS system was installed and running I tried a "smart upgrade" option. Everything appeared to work smoothly with one exception: the eth0 is able to communicate only with our local network, it does not reach to gateway and beyond. I checked the iptables rules and they are empty, so it it does not look like a firewall problem. What else should I check? With best regards Janusz From mrozek at chemia.uj.edu.pl Mon Dec 21 06:40:14 2009 From: mrozek at chemia.uj.edu.pl (Janusz Mrozek) Date: Mon, 21 Dec 2009 15:40:14 +0100 Subject: [Caos] upgrade problem - solved Message-ID: <4B2F88CE.9080207@chemia.uj.edu.pl> It turned out to be a wrong routing setting. Still do not know how this happened. With best regards Janusz From gmk at infiscale.org Mon Dec 21 09:22:45 2009 From: gmk at infiscale.org (Greg Kurtzer) Date: Mon, 21 Dec 2009 09:22:45 -0800 Subject: [Caos] upgrade problem - solved In-Reply-To: <4B2F88CE.9080207@chemia.uj.edu.pl> References: <4B2F88CE.9080207@chemia.uj.edu.pl> Message-ID: <20091221172244.GB6697@infiscale.org> Hi Janusz, I am glad to hear you solved the routing issue, but I am curious as to what the problem was with the installer. Can you elaborate a bit further as to what happened with the DVD install of 1.0.29 and the hardware information? Thanks, Greg On Monday, 21 December 2009, at 15:40:14 (+0100), Janusz Mrozek wrote: > It turned out to be a wrong routing setting. Still do not know how this > happened. > > With best regards > Janusz > > _______________________________________________ > Caos mailing list > Caos at lists.infiscale.org > http://lists.infiscale.org/mailman/listinfo/caos -- Greg M. Kurtzer Chief Technology Officer HPC Systems Architect Infiscale, Inc. - http://www.infiscale.com