Wednesday, 2008-06-04

*** rejon has quit IRC00:25
*** rejon has joined #cc00:50
*** ourbunny has joined #cc00:54
*** pmiller has quit IRC00:56
*** ourbunny has left #cc01:03
*** UltraMagnus has joined #cc01:07
*** presroi has joined #cc01:21
*** Yaco has quit IRC02:06
*** edward has left #cc02:45
*** hdworak has joined #cc02:49
hdworakwhat a beautiful *morning*02:49
*** presroi has quit IRC02:59
*** presroi has joined #cc03:06
*** pmiller has joined #cc03:22
*** pmiller has left #cc03:22
*** presroi has quit IRC03:51
*** UltraMagnus has quit IRC04:21
*** bheekling has joined #cc05:36
*** bheekling has quit IRC06:01
hdworakhi, bring306:19
hdworakpaulproteus: which of the license embedding methods did the old ccValidator support?06:20
hdworakpaulproteus: I assume that no RDFa06:20
*** tvol has joined #CC06:25
*** bheekling has joined #cc06:42
*** tvol has quit IRC06:50
*** rohitj has joined #cc06:56
*** bheekling has quit IRC06:56
*** bheekling has joined #cc07:04
*** grahl has joined #cc07:24
*** rohitj has quit IRC07:34
*** tvol has joined #CC07:40
*** strayd has joined #cc07:54
*** strayd has left #cc07:54
*** ankitg has joined #cc08:16
hdworakpaulproteus: what if RDF is broken (not the HTML/XHTML/RSS it is embedded in)? stop parsing?08:19
ankitgHi hdworak ... I see you still busy with badgering paulproteus with questions ... how goes?08:21
hdworakankitg: I'm trying to understand what should be done08:22
hdworakhello :)08:22
ankitgaren't we all ...08:22
ankitgAfter two sleepless nights of trying to met this deadline, I was beat by the computational inability of our modern day computers, and I just just asked my prof. for an extension and am now waiting for a reply ...08:24
hdworakwhat/which deadline?08:25
*** presroi has joined #cc08:25
hdworakfor the research project on USA patents in China?08:25
ankitgI was taking this course - an independent study ... ^^ yes that one ...08:26
ankitgand I am supposed to submit a term paper for it ... I am still extracting data coz we got it so darn late and it was to say the least in a bad format ...08:26
hdworakJPEGs in a PDF?08:28
hdworakencrypted ofc.08:29
ankitgthey had tiff versions too ... but we purchased a text version which was in a proprietary software [windows only] ... which allowed for extraction only to text ... with a limitation of 500 records a time ... and when I wrote a script to extract it, I find what, corrupt records ...08:38
ankitgoh the 500 record limit is important since there were some 350,000 + records ... so that is 716 extractions ...08:38
hdworakpaulproteus/nathany: in ccValidator in there is an inaccuracy:08:39
hdworakENC_REGEX = '(<meta http-equiv="Content-Type".*?charset=)(.*?)(".*?>)'08:39
hdworakto check for the encoding in META08:39
hdworakthe problem here is that the http-equiv attribute does not have to come first (right after meta and a single space)08:40
hdworakit can be <meta   dir="ltr" (...)08:41
hdworakplus this might simply appear in a CDATA block, so using regex should be unreliable IMHO08:42
hdworakor even in a simple comment08:42
hdworakankitg: definitely not a nice situation, I'm sorry08:43
ankitghdworak: all's well, my supervisor is nice =)08:44
*** kristallpirat has joined #cc08:53
ankitghdworak: I don't think they are up and about yet ... it's nearing 7 am there now ...08:53
hdworakoh, so you have just woke up08:57
hdworakif u go to bed at 6am08:58
ankitghdworak: it's 7 am in SFO ... it's 22:00 hrs here ...08:59
*** sama has joined #cc09:06
hdworak4pm here :)09:07
*** BjornW has joined #cc09:09
*** tvol_ has joined #CC09:15
*** tvol has quit IRC09:15
*** bheekling has quit IRC09:26
*** sama_ has joined #cc09:52
*** sama has quit IRC09:56
hdworakwas there ever any test suite for ccValidator or ccRdf?10:13
*** BobChao has joined #cc10:40
hdworakwhat is "deprecation date" for a license?10:41
hdworakthe date after the license expires?10:41
paulproteushdworak, BTW, I'm waking up now, and I'll be in the office in 30-40m and we can talk then.10:41
paulproteushdworak, As for deprecation date - some licenses have been "retired".10:41
hdworakok, don't forget to take the bike10:42
paulproteusThey can still be used but we don't promote them anymore.10:42
hdworak"CC0" - huh????10:43
paulproteusCC Zero10:43
paulproteusA new in-development public domain dedication10:43
paulproteus fwiw10:44
hdworakok, has a Wiki page10:44
hdworakis this legal?10:44
hdworak'cause on public domain-licensed files on Wiki there's an adnotation10:45
hdworakthat if under a given jurisdiction waving all the rights is not possible10:45
paulproteusRight - there's discussion of that going on...10:45
hdworakthe author grants all possible rights permitted by law10:45
hdworakbut not everything10:45
paulproteusThat's part of what makes it "in-development".10:45
*** sama_ has quit IRC10:51
*** kristallpirat has quit IRC10:57
*** bheekling has joined #cc10:58
*** Yaco has joined #cc11:02
*** kristallpirat has joined #cc11:06
*** Yaco has quit IRC11:11
*** tvol has joined #CC11:23
*** tvol_ has quit IRC11:23
*** grahl has quit IRC11:25
*** bovinity has joined #cc11:30
*** tvol has quit IRC11:39
*** tvol has joined #CC11:39
*** tvol_ has joined #CC11:40
*** tvol has quit IRC11:40
*** tvol_ has quit IRC11:42
*** tvol has joined #CC11:42
*** stevel has joined #cc12:03
*** rejon has quit IRC12:22
*** rejon has joined #cc12:25
*** kristallpirat has quit IRC12:26
hdworakin ccRdf package, file which relies on regular expression to parse files12:37
hdworakrel_regex = re.compile('rel="meta"',12:37
hdworak"rel" is an attribute that accepts a space-separated list12:37
paulproteushdworak, Since it's pulling the data out of HTML comments, there's not a *much* better way to do it.12:38
paulproteusHe could use an SGML parser and look at the comment blocks, though.12:38
hdworakI'm talking about the <link rel="meta" stuff12:38
hdworaknot the comments12:38
paulproteusOh, I see.12:38
paulproteusWell, that sucks.12:38
paulproteusWouldn't that fail on <link rel='meta' ..>?12:38
hdworakof course it would12:39
paulproteusThe least it could do is use BeautifulSoup to identify the <link> tags.12:39
paulproteusBut this might predate BeautifulSoup.12:39
paulproteus2004 - No, BS was around back then, but maybe NY didn't know about it.12:40
hdworakw/o BS or not12:40
hdworakthis regex is insufficient12:40
hdworaknot that it's a big deal12:40
paulproteusAgreed.  I'm saying BeautifulSoup is a good way to do it instead of the regex.12:40
hdworakbut I'm just reporting a bug12:40
hdworakas for the comments12:41
hdworakRDF in comments12:41
hdworakthat's tricky, indeed12:41
hdworakdid you ever consider RDF in CDATA sections?12:41
hdworakor isn't this a historical method?12:41
hdworak(CDATA is allowed in XHTML)12:41
paulproteusThat would escape it, though.12:41
paulproteusSo you'd see it.12:41
paulproteusWhich would be a little odd.12:42
*** nathany has joined #cc12:42
hdworakno, no12:42
hdworakno problem12:42
hdworakI'm doing a validator that honours current and deprecated methods12:43
* paulproteus nods12:43
hdworakI'm not doing a validator that parses every trick a human can think of12:43
paulproteusSo right - that's not a historical method. (-:12:43
hdworakwhen I'm writing XHTML, I try to come up with a nice code12:44
hdworakI'm not writing <p>hey, unescaped >>>> </p>12:44
hdworakbecause I simply can12:44
hdworakI've posted some questions while you were sleeping12:45
hdworakpaulproteus: which of the license embedding methods did the old ccValidator support?12:45
hdworakpaulproteus: what if RDF is broken (not the HTML/XHTML/RSS it is embedded in)? stop parsing?12:46
hdworakpaulproteus/nathany: in ccValidator in there is an inaccuracy:12:46
hdworakENC_REGEX = '(<meta http-equiv="Content-Type".*?charset=)(.*?)(".*?>)'12:46
hdworakto check for the encoding in META; the problem here is that the http-equiv attribute does not have to come first (right after meta and a single space); it can be <meta   dir="ltr" (...)12:46
hdworakwith regex the basic problem are comments and CDATA IMHO12:47
hdworak'cause they are captured, too12:47
nathanyhdworak: good point12:47
hdworakhistorically, we should capture RDF inside comments12:47
hdworakoh, nathany12:47
hdworakhi there12:47
hdworakbut we never capture CDATA12:47
nathanymy suggestion is to fix it in your new implementation ;)12:47
hdworakthese days I'm analysing what can I use for your software12:48
hdworaknathany: which of the license embedding methods did the old ccValidator support?12:48
nathanyRDF-in-a-comment, linked RDF, meta tags12:48
hdworakw/o data: URI ?12:49
nathanyi think the data: URI was suggested by someone after the initial validator was done12:49
hdworakI haven't seen this in the source code12:49
nathanyCC never gave out data: URIs or encouraged their use12:49
hdworakoh, the same bug with rel_regex = re.compile('rel="license"',12:49
hdworakfor anchor elements12:49
paulproteushdworak, If you find that bugfixing nathany's code is a useful thing to do, then by all means go down that path.12:50
* hdworak notes to himself: include that in unit tests12:50
hdworakI mean... which of these projects are still active?12:51
hdworakand which are dead?12:51
paulproteusI think none have been maintained since the (C) date listed at the top.12:51
hdworakSTART_TAG = '<rdf:rdf'12:51
hdworakrdf is a reserved namespace as far as I recall12:52
hdworakpoints out to
paulproteusI don't recall it being reserved, but I don't know for sure.12:53
paulproteusI can read that link.12:53
nathanyreserved is  inaccurate; it does point to that URI, though, IIRC12:53
nathanybtw, hdworak, i have a couple things that i have to focus on today so please don't alert my IRC nick unless you actually need my input12:53
hdworakok, sorry, I was reporting bugs I found in your software12:54
paulproteushdworak, I'll handle them for now (-:12:54
nathanyand that's awesome12:54
paulproteusI committed a FIXME comment to that one you mentioned earlier.12:54
paulproteusIf you like, you're welcome to have commit access.12:54
nathanyi'm just saying i don't need to do it immediately :)12:54
nathanyemail, via paulproteus, etc is fine :)12:54
hdworakbut is it legal to do xmlns:foo=""12:55
nathanyhdworak: sure12:55
hdworakand then use <foo:rdf12:55
paulproteusJust superfluous.12:55
nathanyi make a mistake in a lot of that old code that's pretty common -- ignoring namespaces12:55
hdworakand it's perfectly valid RDF for describing a license among other things?12:55
nathanywe should probably warn about it since lots of people parse RDF in a namespace ignorant way12:56
nathanybut it's legal, yes12:56
*** nathany is now known as nathany_focused12:56
paulproteusThat was my server for most of Tuesday.12:56
hdworakgetEncoding in (ccValidator) ignored the mean of expressing the character encoding that has the highest priority - HTTP header12:57
ankitgheh ... console shuts up ...12:57
paulproteusYeah, zing.12:57
hdworak(when we are considering remote files, not direct input)12:57
*** Mihai` has joined #cc12:57
hdworakhtml_quote - what's the point of escaping (as entities) quotes and apostrophes?12:59
hdworak(quotes are actually escaped here)12:59
nathany_focusedpaulproteus: your email forward has changed; please confirm that you received my confirmation msg as expected12:59
paulproteusnathany_focused, Got it, thanks.12:59
hdworakwhat is magnet (magnetic)?13:00
hdworakmag_regex = re.compile('urn:sha1:[a-zA-Z0-9]*')13:00
hdworakis this an eMule link?13:00
paulproteus ?13:00
paulproteusSeems to be
hdworakah, it's KaZaA, not eMule13:01
hdworakthanks for the link13:01
paulproteushdworak, Agreed, that getEncoding thing is a bug.13:01
hdworakin autolink we have13:02
hdworakre.compile('[a-z]*://[^ \t\n\r\f\v<>"]*')13:02
hdworakThe scheme name consist of a letter followed by any combination of letters, digits, and the plus ("+"), period ("."), or hyphen ("-") characters; and is terminated by a colon (":").13:03
hdworakbut then again, we should have a plus sign not an asterisk13:03
hdworakplus double slashes are not always present, right?13:03
*** tvol_ has joined #CC13:04
hdworaklike in mailto13:04
*** tvol has quit IRC13:04
hdworakso big thing is, how are we going to handle comments and CDATA13:06
paulproteusCDATA is part of none of the specs we've published, I think.13:06
hdworakI suggest stripping all CDATA right in the first parse13:06
hdworak'cause they were not a part of the specification?13:06
paulproteusOkay, sure - but can't you just ignore them instead of removing them?13:07
hdworakbut wait, they can be inside RDFa stuff, for instance13:07
paulproteusIt's no big deal, I'm just curious.13:07
paulproteusFeel free to say, "Whatever, not worth arguing about".13:07
hdworakno, actually we're both wrong imho13:07
paulproteusWhy's that?13:07
hdworakjust a sec, I'm gonna find an example of RDFa13:08
paulproteusBTW, can we communicate more in the form of code and less in the form of IRC?13:08
paulproteusI guess we'll get to that eventually.13:08
paulproteusBut this is the reason I like tests, you see.13:08
paulproteusWhile this *is* logged, I don't find IRC logs particularly easy to digest afterwards anyway.13:09
paulproteusWhich is why I want to make sure what we discuss here goes onto a wikip age.13:09
hdworakeverything relevant goes into unit tests and on the Wiki13:09
hdworakdo not worry about that13:09
paulproteusOkay, great.13:09
hdworakbut I'm not gonna post on Wiki one-sentence questions13:10
hdworak'cause I think it's IRC that should be used for that13:10
hdworakbut the OUTCOMES of these questions - sure13:10
paulproteusThat makes good sense.13:10
hdworakother way, cc Wiki turns into a phpBB forum13:10
paulproteusThe nice thing about writing the questions on the wiki is I can answer them in batch.13:11
paulproteusBut that's not so important, and besides, that makes me more likely to put it off. (-;13:11
hdworakin RDFa, is the actual TEXT (not attribute values) ever used? (in terms of parsing for license)13:11
paulproteusThe text inside a tag, you mean?13:12
paulproteusLike "text" in <tag>text</tag> ?13:12
hdworak<a rel="license" href="">and this????</a>13:12
paulproteusNot for rel="license", I think, but it can be for other RDFa things.13:12
hdworakok, so then CDATA will affect RDFa, too13:12
hdworakbut it will 100% sure affect RDF13:12
hdworakthis is why we cannot ignore it or remove it13:13
paulproteusRight, so just treat it normally.13:13
hdworakbecause in RDF the object does matter13:13
paulproteusIgnore it when it's irrelevant, and don't when it's not. (-;13:13
hdworakand object can be in CDATA13:13
hdworaklike the example I extracted from the data: URI by na-tha-ny yesterday13:14
paulproteusYou can write "NY" for short.13:15
hdworakit has <dc:title>Compilers in the Key of C</dc:title>13:15
hdworakbut it (="Compilers in the Key of C") could easily be in CDATA as well13:15
hdworakso then again, were there ever any unit tests for ccValidator and related?13:16
hdworakso that I could use them now13:16
paulproteusI don't think so. )-:13:16
hdworakok :)13:16
hdworakI've been looking for software for feed parsing13:16
hdworakand I've stumbled upon the feedparser by Mark Pilgrim13:17
paulproteusWidely-used, well-maintained as I recall.13:17
hdworakthe documentation says it allows extracting license info13:17
paulproteusSo BTW, I never actually managed to get to the office.  I should probably take 15m and go do that.13:17
hdworakbut there are two problems with it:13:17
paulproteus(Let me know when you get to an ok stopping point)13:18
*** ajbrooks has quit IRC13:18
hdworak(btw relevant part of the doc is )13:18
hdworakproblem #1: the author says there are 3000 unit tests for that. unfortunately, the word "license" cannot be found in any of them13:19
paulproteus(I'm sure he'd say, "Then submit one!")13:20
paulproteus(But that's pretty surprising.)13:20
hdworakproblem #2: it ignores ATOM means of expressing a license and xmlns:creativeCommons for RSS 2.013:21
hdworakif he would add that, I could use his code for feeds, huh?13:21
paulproteusThat's probably true, but you might do well to implement that logic yourself since it's so easy.13:22
paulproteusIt would be good to submit a patch + unit test for feedparser, though!13:23
hdworakis this namespace deprecated
paulproteusSo you would do well to point out its use and suggest a move.13:27
*** tvol_ has quit IRC13:32
*** tvol has joined #CC13:34
*** tvol has quit IRC13:39
*** tvol has joined #CC13:39
*** BobChao has quit IRC13:57
paulproteustt y'all later, lunchtime13:57
*** tvol has quit IRC14:08
*** sama has joined #cc14:08
*** tvol has joined #CC14:10
*** Yaco has joined #cc14:12
*** tvol has quit IRC14:27
*** presroi has quit IRC14:45
*** presroi has joined #cc14:50
*** ankitg is now known as ankitg|away14:52
*** presroi has quit IRC15:10
*** sama has quit IRC15:16
* paulproteus waves.15:32
hdworakback from lunch?15:32
*** ajbrooks has joined #cc15:38
hdworakok, so I should annotate with FIXME whenever I see a bug?15:39
hdworakbut are those cc* packages at the git repository?15:39
paulproteusYou probably don't have commit access to the Subversion repository.15:39
paulproteusI can put your key in there, just a sec.15:39
*** tie has joined #cc15:40
*** tie has left #cc15:41
hdworakah, it's on the SVN15:41
paulproteusWhat username do you want to show up as in the logs?15:42
paulproteushugo, or hdworak, or something else?15:42
hdworakHugo Dworak15:43
hdworakisn't possible?15:43
paulproteusWell, this is the "UNIX username" field.15:43
paulproteusI don't think spaces are allowed, and everyone else's name is lowercase.15:43
hdworakhdworak then :)15:43
hdworakno need to make exceptions15:43
paulproteusAnd BTW, in addition to "fixme" it would probably be good to write (1) your name and (2) a long description of the bug, and maybe also (3) the date in ISO date format (e.g. 2008-06-04) that you added the remark.15:44
paulproteusYou can now check out things using the svn+ssh access described at
paulproteusAnd commit too.15:45
paulproteusUsing either of the keys that have access to git.15:45
paulproteusnathany_focused, If you have a sec to set up another API install on a8, that'd be great.15:45
hdworakokay, thanks15:46
nathany_focusedpaulproteus: i don't at this moment, trying to get NO launched15:46
paulproteusCarry on then!15:46
paulproteusnathany_focused, Would now be an okay time for me to transition the a8 sites back onto a8, or does that conflict with your NO launch?15:54
nathany_focusedshould be fine15:54
hdworakgood night :)15:57
*** hdworak has quit IRC15:57
*** pmiller has joined #cc17:28
*** pmiller has quit IRC17:36
*** pmiller has joined #cc17:39
*** Mihai` has quit IRC17:52
*** nathany_focused has quit IRC18:30
*** UltraMagnus has joined #cc18:55
*** ajbrooks has quit IRC19:04
*** bovinity has quit IRC19:30
*** ankitg|away is now known as ankitg19:40
*** rejon has quit IRC19:45
*** rejon has joined #cc19:48
*** rejon has quit IRC19:57
*** rejon has joined #cc20:00
*** UltraMagnus has quit IRC20:20
*** stevel has quit IRC20:35
*** ajbrooks has joined #cc20:38
*** jgay has joined #cc20:44
*** UncleCJ has quit IRC21:10
*** UncleCJ has joined #cc21:10
*** jgay has quit IRC21:21
*** ankitg has quit IRC22:20

Generated by 2.6 by Marius Gedminas - find it at!