Introduction
Special care may need to be taken when loading documents into MarkLogic Server where the document URI contains one or more special characters. In this article, we will walk through a scenario where exceptions are thrown if such a URI with special character is not handled properly and then we will talk about how to handle such URIs. This article will take advantage of inbuilt functions (and encode method of
java.net.URLEncoder
class) and showcase their usage via a couple of samples created using
XCC/J
to understand this scenario and suggested approach.
Relationship between URI and URL
A
Uniform Resource Identifier
(URI)
is a compact sequence of characters that identifies an abstract or physical resource. The most common form of URI is the
Uniform Resource Locator
(URL)
.
A URL is a URI that, in addition to identifying a web resource, specifies the means of acting upon or obtaining the representation, specifying both its primary access mechanism and network location. For example, the URL '
http://example.org/wiki/Main_Page
' refers to a resource identified as
/wiki/Main_Page
whose representation, in the form of HTML and related code, is obtainable via HyperText Transfer Protocol (
http
) from a network host whose domain name is
example.org
.
While it is possible to load documents into MarkLogic Server, where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack.
Importance of URL encoding
URL encoding is often required to convert special characters (such as "/", "&", "#", ...), because special characters:
have special meaning in some contexts; or
are not valid character for an URL; or
could be altered during transfer.
For instance, the "#" character needs to be encoded because it has a special meaning of that of an html anchor. The <space> character needs to be encoded because it is not a valid URL character. Also, some characters, such as "~" might not transport properly across the internet.
Consider the example where a parameter is supplied in a URL and parameter value has a special character in it, such as,
Parameter is "movie1" and its value is "Fast & Furious"
The parameter may be submitted via a URL such as "
http://www.awebsite.com/encodingurls/submitmoviename.html?movie1=Fast & Furious".
In this example, space and & need to be handled specially, otherwise it may not be interpreted properly - for example, the associated GET request may fail.
These character can be encoding:
Space as '
%20
' or '
+
'
'
&
' as '
%26'
And thus the URL, after encoding, would look like '
http://www.awebsite.com/encodingurls/submitmoviename.html?movie1=Fast+%26+Furious
'.
What is URL encoding?
URL Encoding is the process of converting a string into a valid URL format. Valid URL format means that the URL contains only "alpha | digit | safe | extra | escape" characters. For URL specifications, there are various established standards including below listed w3c standards:
http://www.w3.org/Addressing/URL/url-spec.html
http://www.w3.org/International/francois.yergeau.html
Safe and unsafe characters
Based on Web Standards, the following quick reference chart explains which characters are “safe” and which characters should be encoded in URLs.
Alphanumerics [
0-9a-zA-Z
], special characters
$-_.+!*'()
, and reserved characters used for their reserved purposes (e.g., question mark used to denote a query string)
Includes the ISO-8859-1 (ISO-Latin) character ranges 00-1F hex (0-31 decimal) and 7F (127 decimal.)
* Note: Reserved characters only need encoding when not used for their defined, reserved purposes.
For complete details and understanding these character classification please check
RFC1738
Walkthrough of an example Scenario using XCC/J
Let's take a look at a sample created to connect to MarkLogic Server using the
XCC/J
connector.
We will start with a case in our scenario where we have a special character in a document URI which is not safely handled properly while loading this document in to MarkLogic Server. Next we will resolve it by using URI encoding
Consider the following code:
In above code we are running a
newAdHocQuery
and calling
xdmp:document-insert
and passing in the URI (with special character). Request has been submitted in a try-catch block to handle any exception which comes out while submitting this request
On running this code we will get below exception:
Full adHocQuery being executed: xdmp:document-insert("&.xml", <test/>)
com.marklogic.xcc.exceptions.XQueryException
: XDMP-ENTITYREF: (err:XPST0003) Invalid entity reference ".xml"
[Session: user=[user], cb={default} [ContentSource: user=admin, cb={none} [provider: address=localhost/127.0.0.1:8000, pool=1/64]]]
[Client: XCC/8.0-1, Server: XDBC/8.0-1.1]
in /eval, on line 1
expr:
Notice that there is no '
&
' character present in the exception trace because
'&
' is a special character and is not handled properly. To resolve this issue, we can use the
encode
method of
java.net.URLEncoder
class to encode these characters. Now consider below example,
As you can see in above example we have encoded a uri with special character by encoding it,
String badUri = "&.xml";
String goodUri = URLEncoder.encode(badUri, "UTF-8");
Running this code will successfully load the document with encoded URI, as
%26.xml
Another example for scenario using curl
Here in this example, we are using curl to load a simple XML document with a URI having a special character (ム). Scenario is similar as mentioned in above. This time we are using curl to load document into MarkLogic.
Consider the following curl command:
curl --anyauth --user username:password -X PUT -T ./test.xml -i -H "Content-type: application/xml" http://localhost:8000/v1/documents?uri=/%e3%83%a0.xml
Here are the contents of test.xml: <test><sample>test 1</sample></test>
Running above curl command to load a simple xml document with a URI having a special character (ム) fails with "400 Bad Request":
{"errorResponse":{"statusCode":400, "status":"Bad Request", "messageCode":"REST-INVALIDPARAM", "message":"REST-INVALIDPARAM: (err:FOER0000) Invalid parameter: invalid uri: /πâá.xml"}}
To resolve this issue, we can use the
--data-urlencode
option provided by
curl
to encode data.
Now consider below example,
curl --anyauth --user username:password -X PUT -T ./test.xml -i -H "Content-type: application/xml" http://localhost:8000/v1/documents --data-urlencode uri=/%e3%83%a0.xml –G
--data-urlencode
is used to encode the uri parameter and
-G
is used to join arguments into request data
Running this code will successfully load the document with encoded URI, as
/%e3%83%a0.xml
Conclusion
While it is possible to load documents into MarkLogic Server, where the document URI contains special characters not encoded, it is recommended to follow best practices by URL encoding document URIs as it will help you design robust applications, free from the side effects caused by such special characters in other areas of your application stack.
References
I.
http://www.permadi.com/tutorial/urlEncoding/
II.
http://perishablepress.com/stop-using-unsafe-characters-in-urls/
III.
http://www.ietf.org/rfc/rfc3986.txt
RFC3986 on URI
IV.
http://www.ietf.org/rfc/rfc1738.txt
RFC1738 on URL
V.
http://developer.marklogic.com/products/xcc
VI.
http://docs.oracle.com/javase/7/docs/api/java/net/URLEncoder.html
VII.
http://en.wikipedia.org/wiki/Uniform_resource_identifier#The_relationship_between_URIs.2C_URLs.2C_and_URNs
VIII.
https://ec.haxx.se/http-post.html