Wednesday, 2008-06-04

*** rejon has quit IRC		00:25
*** rejon has joined #cc		00:50
*** ourbunny has joined #cc		00:54
*** pmiller has quit IRC		00:56
*** ourbunny has left #cc		01:03
*** UltraMagnus has joined #cc		01:07
*** presroi has joined #cc		01:21
*** Yaco has quit IRC		02:06
*** edward has left #cc		02:45
*** hdworak has joined #cc		02:49
hdworak	hello	02:49
hdworak	what a beautiful morning	02:49
*** presroi has quit IRC		02:59
*** presroi has joined #cc		03:06
*** pmiller has joined #cc		03:22
*** pmiller has left #cc		03:22
*** presroi has quit IRC		03:51
*** UltraMagnus has quit IRC		04:21
*** bheekling has joined #cc		05:36
*** bheekling has quit IRC		06:01
bring3	hello	06:17
hdworak	hi, bring3	06:19
hdworak	paulproteus: which of the license embedding methods did the old ccValidator support?	06:20
hdworak	paulproteus: I assume that no RDFa	06:20
*** tvol has joined #CC		06:25
*** bheekling has joined #cc		06:42
*** tvol has quit IRC		06:50
*** rohitj has joined #cc		06:56
*** bheekling has quit IRC		06:56
*** bheekling has joined #cc		07:04
*** grahl has joined #cc		07:24
*** rohitj has quit IRC		07:34
*** tvol has joined #CC		07:40
*** strayd has joined #cc		07:54
*** strayd has left #cc		07:54
*** ankitg has joined #cc		08:16
hdworak	paulproteus: what if RDF is broken (not the HTML/XHTML/RSS it is embedded in)? stop parsing?	08:19
ankitg	Hi hdworak ... I see you still busy with badgering paulproteus with questions ... how goes?	08:21
hdworak	ankitg: I'm trying to understand what should be done	08:22
hdworak	hello :)	08:22
ankitg	aren't we all ...	08:22
hdworak	sup?	08:22
ankitg	After two sleepless nights of trying to met this deadline, I was beat by the computational inability of our modern day computers, and I just just asked my prof. for an extension and am now waiting for a reply ...	08:24
ankitg	*s/met/meet	08:24
hdworak	what/which deadline?	08:25
*** presroi has joined #cc		08:25
hdworak	for the research project on USA patents in China?	08:25
ankitg	I was taking this course - an independent study ... ^^ yes that one ...	08:26
ankitg	and I am supposed to submit a term paper for it ... I am still extracting data coz we got it so darn late and it was to say the least in a bad format ...	08:26
hdworak	JPEGs in a PDF?	08:28
hdworak	encrypted ofc.	08:29
hdworak	;)	08:31
ankitg	they had tiff versions too ... but we purchased a text version which was in a proprietary software [windows only] ... which allowed for extraction only to text ... with a limitation of 500 records a time ... and when I wrote a script to extract it, I find what, corrupt records ...	08:38
ankitg	oh the 500 record limit is important since there were some 350,000 + records ... so that is 716 extractions ...	08:38
hdworak	paulproteus/nathany: in ccValidator in support.py there is an inaccuracy:	08:39
hdworak	ENC_REGEX = '(<meta http-equiv="Content-Type".?charset=)(.?)(".*?>)'	08:39
hdworak	to check for the encoding in META	08:39
hdworak	the problem here is that the http-equiv attribute does not have to come first (right after meta and a single space)	08:40
hdworak	it can be <meta dir="ltr" (...)	08:41
hdworak	plus this might simply appear in a CDATA block, so using regex should be unreliable IMHO	08:42
hdworak	or even in a simple comment	08:42
hdworak	ankitg: definitely not a nice situation, I'm sorry	08:43
ankitg	hdworak: all's well, my supervisor is nice =)	08:44
*** kristallpirat has joined #cc		08:53
ankitg	hdworak: I don't think they are up and about yet ... it's nearing 7 am there now ...	08:53
hdworak	oh, so you have just woke up	08:57
hdworak	if u go to bed at 6am	08:58
ankitg	hdworak: it's 7 am in SFO ... it's 22:00 hrs here ...	08:59
*** sama has joined #cc		09:06
hdworak	4pm here :)	09:07
*** BjornW has joined #cc		09:09
*** tvol_ has joined #CC		09:15
*** tvol has quit IRC		09:15
*** bheekling has quit IRC		09:26
*** sama_ has joined #cc		09:52
*** sama has quit IRC		09:56
hdworak	was there ever any test suite for ccValidator or ccRdf?	10:13
*** BobChao has joined #cc		10:40
hdworak	what is "deprecation date" for a license?	10:41
hdworak	http://wiki.creativecommons.org/images/7/73/20080415_OpenWeb_2008.pdf	10:41
hdworak	the date after the license expires?	10:41
paulproteus	hdworak, BTW, I'm waking up now, and I'll be in the office in 30-40m and we can talk then.	10:41
paulproteus	hdworak, As for deprecation date - some licenses have been "retired".	10:41
hdworak	ok, don't forget to take the bike	10:42
paulproteus	They can still be used but we don't promote them anymore.	10:42
paulproteus	http://creativecommons.org/retiredlicenses	10:42
hdworak	"CC0" - huh????	10:43
paulproteus	CC Zero	10:43
paulproteus	A new in-development public domain dedication	10:43
paulproteus	http://wiki.creativecommons.org/CC0 fwiw	10:44
hdworak	ok, has a Wiki page	10:44
hdworak	is this legal?	10:44
paulproteus	Huh?	10:45
hdworak	'cause on public domain-licensed files on Wiki there's an adnotation	10:45
hdworak	that if under a given jurisdiction waving all the rights is not possible	10:45
paulproteus	Right - there's discussion of that going on...	10:45
hdworak	the author grants all possible rights permitted by law	10:45
hdworak	but not everything	10:45
paulproteus	That's part of what makes it "in-development".	10:45
*** sama_ has quit IRC		10:51
*** kristallpirat has quit IRC		10:57
*** bheekling has joined #cc		10:58
*** Yaco has joined #cc		11:02
*** kristallpirat has joined #cc		11:06
*** Yaco has quit IRC		11:11
*** tvol has joined #CC		11:23
*** tvol_ has quit IRC		11:23
*** grahl has quit IRC		11:25
*** bovinity has joined #cc		11:30
*** tvol has quit IRC		11:39
*** tvol has joined #CC		11:39
*** tvol_ has joined #CC		11:40
*** tvol has quit IRC		11:40
*** tvol_ has quit IRC		11:42
*** tvol has joined #CC		11:42
*** stevel has joined #cc		12:03
*** rejon has quit IRC		12:22
*** rejon has joined #cc		12:25
*** kristallpirat has quit IRC		12:26
hdworak	in ccRdf package, file rdfextract.py which relies on regular expression to parse files	12:37
hdworak	rel_regex = re.compile('rel="meta"',	12:37
hdworak	"rel" is an attribute that accepts a space-separated list	12:37
paulproteus	hdworak, Since it's pulling the data out of HTML comments, there's not a much better way to do it.	12:38
paulproteus	He could use an SGML parser and look at the comment blocks, though.	12:38
hdworak	I'm talking about the <link rel="meta" stuff	12:38
hdworak	not the comments	12:38
paulproteus	Oh, I see.	12:38
paulproteus	Well, that sucks.	12:38
paulproteus	Wouldn't that fail on <link rel='meta' ..>?	12:38
hdworak	of course it would	12:39
hdworak	http://www.google.com/codesearch?hl=pl&q=show:izjB3CA13SM:7wGNlOa5HrM:WfIWOsJXViE&sa=N&ct=rd&cs_p=http://mirrors.creativecommons.org/software/ccrdf/download/ccrdf-0.5.0.dev-r669.tar.gz&cs_f=ccrdf-0.5.0.dev-r669/src/ccrdf/rdfextract.py&start=1	12:39
paulproteus	The least it could do is use BeautifulSoup to identify the <link> tags.	12:39
paulproteus	But this might predate BeautifulSoup.	12:39
paulproteus	2004 - No, BS was around back then, but maybe NY didn't know about it.	12:40
hdworak	w/o BS or not	12:40
hdworak	this regex is insufficient	12:40
hdworak	not that it's a big deal	12:40
paulproteus	Agreed. I'm saying BeautifulSoup is a good way to do it instead of the regex.	12:40
hdworak	but I'm just reporting a bug	12:40
hdworak	correct	12:40
hdworak	as for the comments	12:41
hdworak	RDF in comments	12:41
hdworak	that's tricky, indeed	12:41
hdworak	did you ever consider RDF in CDATA sections?	12:41
hdworak	or isn't this a historical method?	12:41
paulproteus	Interesting.	12:41
hdworak	(CDATA is allowed in XHTML)	12:41
paulproteus	That would escape it, though.	12:41
paulproteus	So you'd see it.	12:41
paulproteus	Which would be a little odd.	12:42
*** nathany has joined #cc		12:42
hdworak	no, no	12:42
hdworak	no problem	12:42
hdworak	I'm doing a validator that honours current and deprecated methods	12:43
* paulproteus nods		12:43
hdworak	I'm not doing a validator that parses every trick a human can think of	12:43
paulproteus	So right - that's not a historical method. (-:	12:43
hdworak	:)	12:43
hdworak	when I'm writing XHTML, I try to come up with a nice code	12:44
hdworak	I'm not writing <p>hey, unescaped >>>> </p>	12:44
hdworak	because I simply can	12:44
hdworak	ok	12:45
hdworak	I've posted some questions while you were sleeping	12:45
hdworak	paulproteus: which of the license embedding methods did the old ccValidator support?	12:45
hdworak	paulproteus: what if RDF is broken (not the HTML/XHTML/RSS it is embedded in)? stop parsing?	12:46
hdworak	paulproteus/nathany: in ccValidator in support.py there is an inaccuracy:	12:46
hdworak	ENC_REGEX = '(<meta http-equiv="Content-Type".?charset=)(.?)(".*?>)'	12:46
hdworak	to check for the encoding in META; the problem here is that the http-equiv attribute does not have to come first (right after meta and a single space); it can be <meta dir="ltr" (...)	12:46
hdworak	with regex the basic problem are comments and CDATA IMHO	12:47
hdworak	'cause they are captured, too	12:47
nathany	hdworak: good point	12:47
hdworak	historically, we should capture RDF inside comments	12:47
hdworak	oh, nathany	12:47
hdworak	hi there	12:47
hdworak	but we never capture CDATA	12:47
nathany	my suggestion is to fix it in your new implementation ;)	12:47
hdworak	these days I'm analysing what can I use for your software	12:48
hdworak	for=from	12:48
hdworak	nathany: which of the license embedding methods did the old ccValidator support?	12:48
nathany	RDF-in-a-comment, linked RDF, meta tags	12:48
nathany	(IIRC)	12:48
hdworak	w/o data: URI ?	12:49
nathany	i think the data: URI was suggested by someone after the initial validator was done	12:49
hdworak	I haven't seen this in the source code	12:49
nathany	CC never gave out data: URIs or encouraged their use	12:49
hdworak	oh, the same bug with rdfextract.py: rel_regex = re.compile('rel="license"',	12:49
hdworak	for anchor elements	12:49
paulproteus	hdworak, If you find that bugfixing nathany's code is a useful thing to do, then by all means go down that path.	12:50
* hdworak notes to himself: include that in unit tests		12:50
hdworak	I mean... which of these projects are still active?	12:51
hdworak	and which are dead?	12:51
paulproteus	I think none have been maintained since the (C) date listed at the top.	12:51
hdworak	START_TAG = '<rdf:rdf'	12:51
hdworak	rdf is a reserved namespace as far as I recall	12:52
hdworak	points out to http://www.w3.org/1999/02/22-rdf-syntax-ns#	12:52
hdworak	right?	12:52
paulproteus	I don't recall it being reserved, but I don't know for sure.	12:53
paulproteus	I can read that link.	12:53
nathany	reserved is inaccurate; it does point to that URI, though, IIRC	12:53
hdworak	http://www.w3.org/TR/REC-rdf-syntax/#rdfmodel	12:53
nathany	btw, hdworak, i have a couple things that i have to focus on today so please don't alert my IRC nick unless you actually need my input	12:53
hdworak	ok, sorry, I was reporting bugs I found in your software	12:54
paulproteus	hdworak, I'll handle them for now (-:	12:54
nathany	and that's awesome	12:54
paulproteus	I committed a FIXME comment to that one you mentioned earlier.	12:54
paulproteus	If you like, you're welcome to have commit access.	12:54
nathany	i'm just saying i don't need to do it immediately :)	12:54
nathany	email, via paulproteus, etc is fine :)	12:54
hdworak	but is it legal to do xmlns:foo="http://www.w3.org/1999/02/22-rdf-syntax-ns#"	12:55
nathany	hdworak: sure	12:55
hdworak	and then use <foo:rdf	12:55
paulproteus	Just superfluous.	12:55
nathany	i make a mistake in a lot of that old code that's pretty common -- ignoring namespaces	12:55
hdworak	and it's perfectly valid RDF for describing a license among other things?	12:55
nathany	we should probably warn about it since lots of people parse RDF in a namespace ignorant way	12:56
nathany	but it's legal, yes	12:56
*** nathany is now known as nathany_focused		12:56
paulproteus	BTW: http://paulproteus.acm.jhu.edu/screenshot1.jpg	12:56
paulproteus	That was my server for most of Tuesday.	12:56
hdworak	getEncoding in support.py (ccValidator) ignored the mean of expressing the character encoding that has the highest priority - HTTP header	12:57
ankitg	heh ... console shuts up ...	12:57
paulproteus	Yeah, zing.	12:57
hdworak	(when we are considering remote files, not direct input)	12:57
*** Mihai` has joined #cc		12:57
hdworak	html_quote - what's the point of escaping (as entities) quotes and apostrophes?	12:59
hdworak	(quotes are actually escaped here)	12:59
nathany_focused	paulproteus: your email forward has changed; please confirm that you received my confirmation msg as expected	12:59
paulproteus	nathany_focused, Got it, thanks.	12:59
hdworak	what is magnet (magnetic)?	13:00
hdworak	mag_regex = re.compile('urn:sha1:[a-zA-Z0-9]*')	13:00
hdworak	is this an eMule link?	13:00
hdworak	:)	13:00
paulproteus	http://www.magnetlink.org/ ?	13:00
paulproteus	Seems to be http://en.wikipedia.org/wiki/Magnet:_URI_scheme	13:00
hdworak	ah, it's KaZaA, not eMule	13:01
hdworak	thanks for the link	13:01
paulproteus	hdworak, Agreed, that getEncoding thing is a bug.	13:01
hdworak	in autolink we have	13:02
hdworak	re.compile('[a-z]://[^ \t\n\r\f\v<>"]')	13:02
hdworak	http://en.wikipedia.org/wiki/URI_scheme	13:03
hdworak	The scheme name consist of a letter followed by any combination of letters, digits, and the plus ("+"), period ("."), or hyphen ("-") characters; and is terminated by a colon (":").	13:03
hdworak	but then again, we should have a plus sign not an asterisk	13:03
hdworak	plus double slashes are not always present, right?	13:03
paulproteus	Agreed.	13:04
*** tvol_ has joined #CC		13:04
hdworak	like in mailto	13:04
*** tvol has quit IRC		13:04
hdworak	so big thing is, how are we going to handle comments and CDATA	13:06
paulproteus	CDATA is part of none of the specs we've published, I think.	13:06
hdworak	I suggest stripping all CDATA right in the first parse	13:06
paulproteus	Why?	13:06
hdworak	'cause they were not a part of the specification?	13:06
paulproteus	Okay, sure - but can't you just ignore them instead of removing them?	13:07
hdworak	but wait, they can be inside RDFa stuff, for instance	13:07
paulproteus	It's no big deal, I'm just curious.	13:07
paulproteus	Feel free to say, "Whatever, not worth arguing about".	13:07
hdworak	no, actually we're both wrong imho	13:07
paulproteus	Why's that?	13:07
hdworak	just a sec, I'm gonna find an example of RDFa	13:08
paulproteus	BTW, can we communicate more in the form of code and less in the form of IRC?	13:08
paulproteus	I guess we'll get to that eventually.	13:08
paulproteus	But this is the reason I like tests, you see.	13:08
paulproteus	While this is logged, I don't find IRC logs particularly easy to digest afterwards anyway.	13:09
paulproteus	Which is why I want to make sure what we discuss here goes onto a wikip age.	13:09
hdworak	everything relevant goes into unit tests and on the Wiki	13:09
hdworak	do not worry about that	13:09
paulproteus	Okay, great.	13:09
hdworak	but I'm not gonna post on Wiki one-sentence questions	13:10
hdworak	'cause I think it's IRC that should be used for that	13:10
hdworak	but the OUTCOMES of these questions - sure	13:10
paulproteus	That makes good sense.	13:10
hdworak	other way, cc Wiki turns into a phpBB forum	13:10
paulproteus	The nice thing about writing the questions on the wiki is I can answer them in batch.	13:11
paulproteus	But that's not so important, and besides, that makes me more likely to put it off. (-;	13:11
hdworak	in RDFa, is the actual TEXT (not attribute values) ever used? (in terms of parsing for license)	13:11
hdworak	?	13:11
paulproteus	The text inside a tag, you mean?	13:12
paulproteus	Like "text" in <tag>text</tag> ?	13:12
hdworak	<a rel="license" href="http://this.is.very.important">and this????</a>	13:12
hdworak	yes	13:12
paulproteus	Not for rel="license", I think, but it can be for other RDFa things.	13:12
hdworak	ok, so then CDATA will affect RDFa, too	13:12
hdworak	but it will 100% sure affect RDF	13:12
hdworak	this is why we cannot ignore it or remove it	13:13
paulproteus	Right, so just treat it normally.	13:13
hdworak	because in RDF the object does matter	13:13
paulproteus	Ignore it when it's irrelevant, and don't when it's not. (-;	13:13
hdworak	and object can be in CDATA	13:13
hdworak	like the example I extracted from the data: URI by na-tha-ny yesterday	13:14
hdworak	http://pastebin.com/f143f5fdf	13:14
paulproteus	You can write "NY" for short.	13:15
hdworak	it has <dc:title>Compilers in the Key of C</dc:title>	13:15
paulproteus	Ya.	13:15
hdworak	but it (="Compilers in the Key of C") could easily be in CDATA as well	13:15
paulproteus	Agreed.	13:15
hdworak	so then again, were there ever any unit tests for ccValidator and related?	13:16
hdworak	so that I could use them now	13:16
paulproteus	I don't think so. )-:	13:16
hdworak	ok :)	13:16
hdworak	I've been looking for software for feed parsing	13:16
hdworak	and I've stumbled upon the feedparser by Mark Pilgrim	13:17
paulproteus	Widely-used, well-maintained as I recall.	13:17
hdworak	the documentation says it allows extracting license info	13:17
paulproteus	So BTW, I never actually managed to get to the office. I should probably take 15m and go do that.	13:17
hdworak	but there are two problems with it:	13:17
paulproteus	(Let me know when you get to an ok stopping point)	13:18
*** ajbrooks has quit IRC		13:18
hdworak	(btw relevant part of the doc is http://www.feedparser.org/docs/reference-feed-license.html )	13:18
hdworak	problem #1: the author says there are 3000 unit tests for that. unfortunately, the word "license" cannot be found in any of them	13:19
paulproteus	(I'm sure he'd say, "Then submit one!")	13:20
paulproteus	(But that's pretty surprising.)	13:20
hdworak	problem #2: it ignores ATOM means of expressing a license and xmlns:creativeCommons for RSS 2.0	13:21
hdworak	if he would add that, I could use his code for feeds, huh?	13:21
paulproteus	That's probably true, but you might do well to implement that logic yourself since it's so easy.	13:22
paulproteus	It would be good to submit a patch + unit test for feedparser, though!	13:23
hdworak	is this namespace deprecated http://web.resource.org/cc/	13:26
hdworak	?	13:26
paulproteus	Yes.	13:26
paulproteus	So you would do well to point out its use and suggest a move.	13:27
*** tvol_ has quit IRC		13:32
*** tvol has joined #CC		13:34
*** tvol has quit IRC		13:39
*** tvol has joined #CC		13:39
*** BobChao has quit IRC		13:57
paulproteus	tt y'all later, lunchtime	13:57
*** tvol has quit IRC		14:08
*** sama has joined #cc		14:08
*** tvol has joined #CC		14:10
*** Yaco has joined #cc		14:12
*** tvol has quit IRC		14:27
*** presroi has quit IRC		14:45
*** presroi has joined #cc		14:50
*** ankitg is now known as ankitg\|away		14:52
*** presroi has quit IRC		15:10
*** sama has quit IRC		15:16
* paulproteus waves.		15:32
hdworak	:)	15:32
hdworak	back from lunch?	15:32
paulproteus	Ya.	15:37
*** ajbrooks has joined #cc		15:38
hdworak	ok, so I should annotate with FIXME whenever I see a bug?	15:39
hdworak	but are those cc* packages at the git repository?	15:39
paulproteus	You probably don't have commit access to the Subversion repository.	15:39
paulproteus	I can put your key in there, just a sec.	15:39
*** tie has joined #cc		15:40
*** tie has left #cc		15:41
hdworak	ah, it's on the SVN	15:41
paulproteus	What username do you want to show up as in the logs?	15:42
paulproteus	hugo, or hdworak, or something else?	15:42
hdworak	Hugo Dworak	15:43
hdworak	isn't possible?	15:43
paulproteus	Well, this is the "UNIX username" field.	15:43
paulproteus	I don't think spaces are allowed, and everyone else's name is lowercase.	15:43
hdworak	hdworak then :)	15:43
hdworak	no need to make exceptions	15:43
paulproteus	And BTW, in addition to "fixme" it would probably be good to write (1) your name and (2) a long description of the bug, and maybe also (3) the date in ISO date format (e.g. 2008-06-04) that you added the remark.	15:44
paulproteus	You can now check out things using the svn+ssh access described at code.creativecommons.org.	15:44
paulproteus	And commit too.	15:45
paulproteus	Using either of the keys that have access to git.	15:45
paulproteus	nathany_focused, If you have a sec to set up another API install on a8, that'd be great.	15:45
hdworak	okay, thanks	15:46
nathany_focused	paulproteus: i don't at this moment, trying to get NO launched	15:46
paulproteus	Carry on then!	15:46
paulproteus	nathany_focused, Would now be an okay time for me to transition the a8 sites back onto a8, or does that conflict with your NO launch?	15:54
nathany_focused	should be fine	15:54
paulproteus	Great.	15:54
hdworak	good night :)	15:57
*** hdworak has quit IRC		15:57
paulproteus	Night!	15:59
*** pmiller has joined #cc		17:28
*** pmiller has quit IRC		17:36
*** pmiller has joined #cc		17:39
*** Mihai` has quit IRC		17:52
*** nathany_focused has quit IRC		18:30
*** UltraMagnus has joined #cc		18:55
*** ajbrooks has quit IRC		19:04
*** bovinity has quit IRC		19:30
*** ankitg\|away is now known as ankitg		19:40
*** rejon has quit IRC		19:45
*** rejon has joined #cc		19:48
*** rejon has quit IRC		19:57
*** rejon has joined #cc		20:00
*** UltraMagnus has quit IRC		20:20
*** stevel has quit IRC		20:35
*** ajbrooks has joined #cc		20:38
*** jgay has joined #cc		20:44
*** UncleCJ has quit IRC		21:10
*** UncleCJ has joined #cc		21:10
*** jgay has quit IRC		21:21
*** ankitg has quit IRC		22:20

Generated by irclog2html.py 2.6 by Marius Gedminas - find it at mg.pov.lt!