Tuesday, 2010-05-18

*** nkinkade has joined #cc14:52
*** Pascalcmoi has joined #cc15:22
PascalcmoiA website using Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License mean that only usa redisent can legaly read the page?15:23
paroneayeaPascalcmoi: no, it just means that the legal code is tuned particularly to the legal system of United States law15:23
Pascalcmoithanks paroneayea15:24
paroneayeankinkade: Yo15:31
paroneayeaah, never mind15:31
paroneayeawe're running sanity again15:38
paroneayeanow with caching of RDF queries and new zeland added to jurisdictions.rdf15:39
nkinkadeparoneayea: What's the bit about cc.engine taking a while to start up after restarting Apache?16:53
nkinkadeHow long?16:53
paroneayeankinkade: a few seconds16:53
nkinkadeOh, that's not too bad.  Do you think the problems with paster running away with memory should be gone too?16:54
nkinkadeBefore a script had to run every few minutes to check memory usage adn would have to kill paster about every 7 or 8 hours to get it to release memory.16:54
paroneayeaI hope so, but I'm not sure what would have been causing the memory load in paster16:55
nkinkadeNRY had some ideas, but I never knew/understood just what they were.16:55
nkinkadeWe'll know soon enough, because I get mails when the script reloads paster, but I suspect I'll need to change the script and pid files may have moved around.16:55
paroneayeathat's part of the thing16:56
paroneayeawe switched this over to fastcgi so16:56
paroneayeathere's no separate daemon16:56
paroneayeaapache / mod_fcgid is starting and managing the process16:56
paroneayeaand handling forking and etc16:56
nkinkadeparoneayea: Cool and I see that the cc.engine pid is still at /var/run/cc.engine.pid16:57
nkinkadeAnd the script should still work if necessary.16:57
paroneayeahm, no.. I don't think that does anything16:57
paroneayeaI just haven't shut off the old system16:57
paroneayeain case we need to switch things over fast16:57
paroneayeaso technically we are running doubletime16:58
paroneayeatwo cc.engines are running right now16:58
nkinkadeHow long does it take to shut the old system down?  Nothing more than /etc/init.d/cc-engine stop/start, right?16:58
paroneayeayou have to do it from the old cc.engine directory16:58
nkinkadeThe old init script no longer works?16:58
paroneayeait does16:58
paroneayeabut you always seemed to need to run it from there16:59
paroneayeaif I ran it from anywhere else it did nothing16:59
nkinkadeI never found that to be true.  Hmm.16:59
nkinkadeCan I try it now?16:59
paroneayeago for it16:59
paroneayeashouldn't affect the running engine at all16:59
nkinkadeI just ran it from my home directory and it worked.17:00
nkinkadeIt also just reclaimed about 500M of memory.17:00
paroneayeaah :)17:01
nkinkadeWhich can now be used for other things.17:01
nkinkadeLike disc caching, etc.17:01
paroneayeaanyway cc.engine has been running all morning, I've noticed no problems, and nobody's emailed webmaster with any new problems17:02
paroneayeaheading out to lunch17:11
nkinkadeI just realized that Gmail for Google Apps has been sending all info@ and webamaster@ emails to spam!17:57
paroneayeankinkade: oh no! :(18:14
nkinkadeIt's not uncommon for Google to send legitimate email to spam, which is why I check over it every single day.  I'm just not sure why I didn't catch it earlier.18:16
*** akozak has joined #cc18:50
dithyramblehey akozak18:57
akozakdithyramble, just wondering when you were flying out to see if we could get to the airport together18:58
paulproteusakozak: FWIW I'm heading to greg-g's place Thu night18:58
dithyrambleI'm flying out of LAN at 7:11pm (on United 5702)18:58
paulproteusWanna join us? (-:18:58
paulproteus(Then I'm flying out of DTW that night, I think 7 PM)18:59
paulproteus(er, I'm flying out of DTW Fri night at 7 PM)18:59
paulproteusLet me rephrase it. You should join us.18:59
paulproteusYou can work remotely on Friday, surely.19:00
akozakpaulproteus, thanks, but I shouldn't be away from home that long... will still be in the process of moving.19:00
paulproteusAww, okaaaaayyyyyy.19:00
paroneayeankinkade: regarding cpu usage on a519:27
paroneayeait looks like most often it's all the apache processes'19:27
nkinkadeparoneayea: What do you think?19:27
paroneayealooking at top19:27
paroneayeankinkade: what do you say we switch over to nginx19:27
paroneayeakidding kidding kidding19:28
paroneayeain seriousness though, it could simply be that a5 just gets a lot of http traffic all the time19:29
nkinkadeparoneayea: So does a8, much more in fact, but a8 is less loaded.19:29
nkinkadeGranted a8 is mostly small static files.19:29
paroneayeais a8 where the buttons are then?19:29
paroneayeathe embeddable images I mean19:30
paroneayeaI've always thought that must be a huge amount of http traffic19:30
nkinkadeYeah, a8 hosts all those icons and buttons.19:31
nkinkadea5 right now is pumping out steadily around 600KB/s to 800KB/s19:31
paroneayeawhat could make apache really cpu intensive?19:31
paroneayeamaybe a lot of rewrite rules and etc?19:32
nkinkadeVarnish's hitrate seems to hover around 95%, which doesn't seem *too* bad, though I guess it could be better.19:34
nkinkadeAnd we have APC caching PHPs opcode and that all looks nice: http://a5.creativecommons.org/apc.php19:35
nkinkadeA 100% hit rate for APC.19:35
nkinkadea5 does have a quite a few rewrite rules.19:35
paroneayeaI turned on the rewrite log for a bit19:36
paroneayeait looked like every static file hit hits a LOT of rules19:36
nkinkadeparoneayea: Yeah, don't leave the log on for long.19:39
nkinkadeEspecially if your loglevel is more than 5 or 619:39
nkinkadeIt will produce vast amounts of data and just slow things down even more.19:39
nkinkadeTo see  how many rewrite rules it has to hit, one just needs to look at the vhost config in confg/19:40
nkinkadeWe are also gzipping things when a client can accept it, so that may be taking some CPU, but I would expect Varnish to cache the gzipped response.19:41
paroneayearunning pages from the new license engine through the validator19:42
paroneayeait's not validating, though the templates should be the same as the old engine19:42
paroneayeathough other pages on cc.org aren't validating either :(19:42
nkinkadeMost pages on CC.org have never validated.19:46
paroneayeaLooks like there's another RDF issue with the new engine19:47
paroneayeaas in, since the new engine uses the RDF as the database19:47
paroneayeait's exposing problems we have with missing things in our RDF19:47
nkinkadeparoneayea: Did you just see that webmaster@ email?19:48
paroneayeahttp://creativecommons.org/international/br/ <- the licenses on here not showing up19:48
paroneayeaand here's why:19:49
paroneayeawe only have RDF for 2.5 licenses w/ brazil19:49
paroneayeanot 3.019:49
nkinkadeSeems like maybe the unit tests should check this ... run through each jurisdiction and fetch the deed.  If a 404 comes back, the the test fails.19:49
paroneayeabut based on what data?19:51
paroneayeaaccording to the RDF this is correct19:51
paroneayeawe don't have those licenses in the RDF is what19:51
paroneayeaI could put together a scraper possibly that checks pages like: http://creativecommons.org/international/br/19:51
paroneayeaand looks for all the licenses they're expecting to exist19:52
nkinkadeparoneayea: It's ugly but it wouldn't be hard to scrape international/19:52
nkinkadeThen again paulproteus might just find it beautiful.19:52
nkinkadeHe's like that. :-)19:52
paroneayeawhy don't I roll back the engine and write a tool/test to do that19:53
paroneayeaso we can make sure we're not screwing over any other jurisdictions with missing RDF data19:53
nkinkadeparoneayea: As a suggestion, you could also look into grabbing the data directly from the WP database.19:58
nkinkadeBut I guess a list of the juris. is not what you need.19:59
nkinkadeAnd the db has nothing about version number, I think.19:59
nkinkadeWhich apparently is what you need.20:01
nkinkadepaulproteus: dithyramble:  Where on a6 is the crawl data being stored?20:04
nkinkadeDo you feel that it's vital that it be backed up?20:04
paulproteusnkinkade: We haven't really done anything with a6. We're developing locally.20:04
nkinkadeSomewhere something is eating up a lot of space in the last week.20:05
paulproteusBased on this conversation, what I'll do is remember to tell you when we deploy and start wanting backups.20:05
nkinkadepaulproteus: Soon it shouldn't matter.20:05
paulproteus(Because of the Rapture?)20:05
nkinkadeOnce we move over to hosting at the ISC (hopefully) we'll have more bandwidth and also a lot more disc space and I intend to backup / from top to bottom.20:06
paulproteusOh my GOD20:06
* paulproteus is so jealous.20:06
nkinkadeBut for the moment, we are still in the CC office and disc space is down to 35G.20:06
paulproteusOh, you just mean backups' hosting?20:06
nkinkadepaulproteus: I was being unfair to you.  I noticed it go up and up over the past week and I automatically assumed it coincided with your return. :-)20:07
paulproteusnkinkade: Heh (-:20:07
nkinkadeIn the past, with resource issues, I could usually single you out and be right about 50% of the time. :-)20:07
nkinkadeWith that type of percentage, I usually just shot first and asked questions later.20:08
nkinkadeAlthough a6 *is* looking suspcious:20:09
nkinkadebackup:/media/1TB/backups/creativecommons# du -sh *20:09
nkinkadeNot to say 281G is unreasonable, but a5 being our "main" machine and using on 20% of the disc space that a6 is using seems odd.20:10
paulproteusI'll leave this to you, nkinkade (-:20:10
nkinkadepaulproteus: So to sum up ... you know of no places on a6 that may have lately been loaded up with lots of data.20:10
paulproteusI know of nothing related to me lately on a6.20:10
paroneayeankinkade: I'm going to start the old engine back up, just fyi20:11
nkinkadeparoneayea: Cool.20:12
paroneayeapaulproteus: I have a question for you, scraping related20:36
paroneayeahttp://creativecommons.org/international/ what's the best way to scrape for the licenses that appear under Completed Licenses vs the ones that appear under Project Jurisdictions?20:36
paroneayeamy guess is that since they aren't distinguished by appearing in separate divs and etc there's no way to really do things via xpath20:37
paroneayeaso will I just have to "iterate until I hit that point"?20:37
paulproteusYeah, I think that's what's you have to do.20:39
paulproteusYou could also change the template so they have a class or something.20:39
paroneayeayes I suppose I could look to change that page itself20:39
paroneayeankinkade: are all the /international/ pages managed by wordpress, I assume?20:40
nkinkadeparoneayea: Yes.20:40
paulproteusIf they're managed by WordPress, then I would treat them as unchangeable.20:40
paulproteusAnd just scrape messily.20:40
nkinkadeparoneayea: How about just selecting any all divs with class of ifloat in the first div with class icontainer?20:40
nkinkadeI feel like BeautifulSoup would allow for that, but I haven't used it in a while.20:41
paroneayeabecause they're in the same div20:41
nkinkadeparoneayea: From what I can tell, they are in two separate divs.20:42
paroneayeayeah you're right20:42
paroneayeaI was being too reliant on the highlighting via firebug's inspector :)20:43
paroneayeawhich made it look like the div that held those jurisdiction icons was just a block above them20:43
paroneayeawell never mind then, this should be very easy :)20:43
jed_just fancied up a tool that i've been using to test some of my work http://code.creativecommons.org/~john/21:57
*** jed_ is now known as JED321:57
nkinkadeJED3: How is this?  .... http://us3.php.net/manual/en/function.mysql-fetch-assoc.php : "does not contain CC-REL Metadata."22:05
nkinkadeOoops ... wrong ULR paste. :-)22:05
nkinkadeThere it is.22:05
JED3nkinkade: it worked?22:07
nkinkadeJED3: It said: "does not contain CC-REL Metadata."22:07
nkinkadeBut that shouldn't be right for the deeds.22:07
JED3well, i don't believe the deeds have any self-referential rel=license's do they?22:08
JED3nkinkade: ^^22:09
JED3so i guess that message of "no cc-rel metadata" is a bit misleading22:10
JED3perhaps it would be better suited as "no CC rel=license link found"22:10
paroneayeaJED3: that's awesome!22:11
paroneayeait's super pretty :)22:11
paroneayeaand minimal and nice working :)22:11
JED3paroneayea: thanks!22:11
JED3this is the same thing we use on the deeds for the referer checking22:11
nkinkadeJED3: That could be, but the deeds do contain plenty of  cc-rel metadata.22:12
nkinkadeMaybe I've just misunderstood what the tool does and is for.,22:12
JED3nkinkade: its for extracting CC metadata from a page and displaying in a human form22:13
JED3try inputting "http://joi.ito.com/" as an example22:13
nkinkadeJED3: So it will only extract the metadata under certain circumstances?22:14
JED3nkinkade: it extracts everything, but will display when its able to make assertions from a work's triples graph22:15
JED3for instance if you include cc:attributionName or cc:attributionURL on a page but are not specifying a license for that work, those 2 triples are worthless for our sake22:16
nkinkadeCool.  So it's not an all purpose cc-rel metadata extractor, but meant more for checking, for example, the marking on a site, perhaps using the chooser HTML or something similar.22:19
JED3nkinkade: correct22:27
mralexJED3: if you really wanted to procrastinate, you could add :hover and :active states for that Scrape button ;))23:23
JED3mralex: oOo good idea23:23
mralexmmm, css323:25
JED3mralex: mmm html5+webgl http://www.youtube.com/watch?v=OxoFcyKYwr0&fmt=2223:28
mralexif webgl ever goes anywhere23:30
mralexor the zombie uprising of vrml23:30
akozakdithyramble, youre flying out of GRR right?23:34
akozakpaulproteus, do you happen to know what airport he's flying out of on the 17th? :P23:37
akozakI forgot to ask23:37
akozakpaulproteus, oh nevermind23:37
akozakhe sais LAN23:37
paulproteusWow I'm Full23:51
dithyrambleakozak: at least it's not called BRR23:51

