<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Pelago :: web design &#38; development blog &#187; regular expressions</title>
	<atom:link href="http://www.pelagodesign.com/blog/tag/regular-expressions/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.pelagodesign.com/blog</link>
	<description>Santa Barbara Web Design and Web Development Blog on the web world and other randoms</description>
	<lastBuildDate>Wed, 09 Mar 2011 16:31:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1</generator>
		<item>
		<title>ISO 8601 Date Validation That Doesn&#8217;t Suck</title>
		<link>http://www.pelagodesign.com/blog/2009/05/20/iso-8601-date-validation-that-doesnt-suck/</link>
		<comments>http://www.pelagodesign.com/blog/2009/05/20/iso-8601-date-validation-that-doesnt-suck/#comments</comments>
		<pubDate>Wed, 20 May 2009 16:38:25 +0000</pubDate>
		<dc:creator>Cameron</dc:creator>
				<category><![CDATA[Creative Engineering]]></category>
		<category><![CDATA[iso-8601]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[regular expressions]]></category>
		<category><![CDATA[validation]]></category>

		<guid isPermaLink="false">http://www.pelagodesign.com/blog/?p=448</guid>
		<description><![CDATA[UPDATED February 19th, 2010: As BobM pointed out, the original solution to this problem didn&#8217;t account for fractional decimals. Originally I didn&#8217;t include them because Intervals didn&#8217;t require that level of precision, but apparently fractional decimals are quite common elsewhere. Because of that, I&#8217;ve updated this post, along with the regex, to include support for [...]]]></description>
			<content:encoded><![CDATA[<p><strong>UPDATED February 19th, 2010:</strong> <em>As <a href="#comment-6503" title="View BobM's comment">BobM</a> pointed out, the original solution to this problem didn&#8217;t account for fractional decimals. Originally I didn&#8217;t include them because Intervals didn&#8217;t require that level of precision, but apparently fractional decimals are quite common elsewhere. Because of that, I&#8217;ve updated this post, along with the regex, to include support for fractional decimals.</em></p>
<p>For the Intervals API, we&#8217;re wrestling with issues surrounding data input validation. This recently became interesting when the matter of date validation came up. Ordinarily, Intervals allows many, many different date formats, dependent on the locale that the customer is using (for example, Intervals may expect the date format &#8216;mm/dd/yyyy&#8217; for US customers, &#8216;dd.mm.yy&#8217; for a customer in Austria).</p>
<p>For our API developers, we wanted to use a common, universal format, one that would be easily compatible with our application and database layers. For that we selected ISO 8601, which is great in terms of widespread use, but not so great in terms of how complicated its specifications are.</p>
<p>Generally, ISO 8601 looks something like &#8217;2009-05-20&#8242; for dates and &#8217;2009-05-20 12:30:30&#8242; for date/time combinations. These two examples encompass 98% of the user input we&#8217;re likely to encounter. But we wanted to make sure that if we told developers they could use ISO 8601 dates, our system would support it. <span id="more-448"></span>Unfortunately, there&#8217;s not a lot of code out there for the validation of ISO 8601 dates (especially regular expressions), and most of the stuff that <em>is</em> out there doesn&#8217;t encompass the entirety of the ISO 8601 spec.</p>
<p>Starting off, here are some dates that the validator <strong>should</strong> match (all these are valid ISO 8601 dates to the best of my knowledge):</p>
<p>2009-12T12:34<br />
2009<br />
2009-05-19<br />
2009-05-19<br />
20090519<br />
2009123<br />
2009-05<br />
2009-123<br />
2009-222<br />
2009-001<br />
2009-W01-1<br />
2009-W51-1<br />
2009-W511<br />
2009-W33<br />
2009W511<br />
2009-05-19<br />
2009-05-19 00:00<br />
2009-05-19 14<br />
2009-05-19 14:31<br />
2009-05-19 14:39:22<br />
2009-05-19T14:39Z<br />
2009-W21-2<br />
2009-W21-2T01:22<br />
2009-139<br />
2009-05-19 14:39:22-06:00<br />
2009-05-19 14:39:22+0600<br />
2009-05-19 14:39:22-01<br />
20090621T0545Z<br />
2007-04-06T00:00<br />
2007-04-05T24:00</p>
<p><em>Added Feb 19 2010:</em><br />
2010-02-18T16:23:48.5<br />
2010-02-18T16:23:48,444<br />
2010-02-18T16:23:48,3-06:00<br />
2010-02-18T16:23.4<br />
2010-02-18T16:23,25<br />
2010-02-18T16:23.33+0600<br />
2010-02-18T16.23334444<br />
2010-02-18T16,2283<br />
2009-05-19 143922.500<br />
2009-05-19 1439,55</p>
<p>And here are some of the strings that the validator <strong>should not</strong> match (ie. reject):</p>
<p>200905<br />
2009367<br />
2009-<br />
2007-04-05T24:50<br />
2009-000<br />
2009-M511<br />
2009M511<br />
2009-05-19T14a39r<br />
2009-05-19T14:3924<br />
2009-0519<br />
2009-05-1914:39<br />
2009-05-19 14:<br />
2009-05-19r14:39<br />
2009-05-19 14a39a22<br />
200912-01<br />
2009-05-19 14:39:22+06a00</p>
<p><em>Added Feb 19 2010:</em><br />
2009-05-19 146922.500<br />
2010-02-18T16.5:23.35:48<br />
2010-02-18T16:23.35:48<br />
2010-02-18T16:23.35:48.45<br />
2009-05-19 14.5.44<br />
2010-02-18T16:23.33.600<br />
2010-02-18T16,25:23:48,444</p>
<p>The code we came up with was the following:</p>
<p><em>Updated Feb 19 2010:</em><br />
<code>^([\+-]?\d{4}(?!\d{2}\b))((-?)((0[1-9]|1[0-2])(\3([12]\d|0[1-9]|3[01]))?|W([0-4]\d|5[0-2])(-?[1-7])?|(00[1-9]|0[1-9]\d|[12]\d{2}|3([0-5]\d|6[1-6])))([T\s]((([01]\d|2[0-3])((:?)[0-5]\d)?|24\:?00)([\.,]\d+(?!:))?)?(\17[0-5]\d([\.,]\d+)?)?([zZ]|([\+-])([01]\d|2[0-3]):?([0-5]\d)?)?)?)?$</code></p>
<p>I guess I should add the caveat that this code doesn&#8217;t support the time interval or duration part of the ISO 8601 spec, so I didn&#8217;t include it. And it only supports dates or date/times, since right now we don&#8217;t have to deal with time input (for the Intervals API, all time is input in decimal format, rather than ISO 8601). But it should support everything else. Please let me know if this works for you or doesn&#8217;t, of if you can fine tune it.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pelagodesign.com/blog/2009/05/20/iso-8601-date-validation-that-doesnt-suck/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>WTF!? preg_replace() returns null?</title>
		<link>http://www.pelagodesign.com/blog/2008/01/25/wtf-preg_replace-returns-null/</link>
		<comments>http://www.pelagodesign.com/blog/2008/01/25/wtf-preg_replace-returns-null/#comments</comments>
		<pubDate>Sat, 26 Jan 2008 00:09:17 +0000</pubDate>
		<dc:creator>Cameron</dc:creator>
				<category><![CDATA[Creative Engineering]]></category>
		<category><![CDATA[PCRE]]></category>
		<category><![CDATA[perl]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[preg_replace]]></category>
		<category><![CDATA[regex]]></category>
		<category><![CDATA[regular expressions]]></category>

		<guid isPermaLink="false">http://www.pelagodesign.com/blog/2008/01/25/wtf-preg_replace-returns-null/</guid>
		<description><![CDATA[On one of our sites were were running into a problem when we tried to pass HTML content from a database through an email obfuscation function to prevent spiders from scraping our clients&#8217; email addresses. We quickly discovered that some of the longer pages were showing up completely blank. The preg_replace() function we were using [...]]]></description>
			<content:encoded><![CDATA[<p>On one of our sites were were running into a problem when we tried to pass HTML content from a database through an email obfuscation function to prevent spiders from scraping our clients&#8217; email addresses. We quickly discovered that some of the longer pages were showing up completely blank. The preg_replace() function we were using to run the obfuscation code on email addresses was returning null. After some hunting I found the answer.</p>
<p><span id="more-384"></span></p>
<p>In PHP 5.2, Perl Compatible Regular Expressions (PCRE) introduced with little fanfare a PHP setting called backtrack_limit, which, for the first time, set a limit on the number of backtracks a regular expression could perform before it stops operating and reports an error. Unfortunately, when PCRE encounters an error of this type, it doesn&#8217;t report a notice or warning or error. All it does is return NULL, something that the preg family of functions typically never does. There were a lot of entries on the PHP.net site reporting this behavior as a bug, and sites that are regex heavy (like Wiki sites) scrambled to figure out WTF was going on.</p>
<p>The only way to actually determine that this type of PCRE error took place in your code is to call preg_last_error() after you&#8217;ve tried to run your regex. Of course, before PHP 5.2, backtrack errors were handled much more gracefully (if they were even triggered), by returning the original string that was passed to the regex function.</p>
<p>To get around this backtrack limit, if you&#8217;re running regex on large pages (or really long strings) is to increase the backtrack limit in your PHP.ini settings. I increased ours from 100,000 to 1,000,000. Of course, you still run the risk of producing an error on really, <em>really</em> long strings, and that&#8217;s why a second step you should take is to add better error handling any place where you might run a PCRE function on a really long string. Should an error be produced, it&#8217;s up to you how to handle it, whether that be returning the original string, or breaking your string up into smaller pieces and running them separately.</p>
<p>Ultimately the best thing one can do (and should always do) is optimize your regex as much as possible, and for some people that just means knowing when to use regex and when a simple str_replace() will suffice.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.pelagodesign.com/blog/2008/01/25/wtf-preg_replace-returns-null/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>

