Viewing 8 posts - 1 through 8 (of 8 total)
  • Author
    Posts
  • #34011

    Instead of “First Pageä|ä|á|ą|â|à|ả|ã|ạ|ă|ằ|ắ|ẳ|ẵ|ặ|ầ|ấ|ẩ|ẫ|ậ|å|ā”
    I receive: “First Pageä|ä|á|Ä…|â|à |ả|ã|ạ|ă|ằ|Ạ̄|ẳ|áºμ|ặ|ầ|ấ|ẩ|ẫ|Ẕ

    Please, help me, how can I fix this?
    My code:
    PD4ML pd4ml = new PD4ML(); String html = "First Pageä|ä|á|ą|â|à|ả|ã|ạ|ă|ằ|ắ|ẳ|ẵ|ặ|ầ|ấ|ẩ|ẫ|ậ|å|ā"; System.out.println(html); byte[] myBytes = html.getBytes(StandardCharsets.UTF_8); InputStream stream = new ByteArrayInputStream(myBytes); pd4ml.useTTF("C:\\Windows\\Fonts", true); pd4ml.readHTML(stream); String output_path = "C:\\test\\test.pdf"; try(OutputStream outputStream = new FileOutputStream(output_path)) { pd4ml.writePDF(outputStream); } Desktop.getDesktop().open(new File(output_path));

    #34012

    The string you received is typical for a charset mismatch.

    As you see your test string is not well-formed HTML. You may solve the issue by a prefixing the string with an HTML header defines the correct charset – in your case it should be UTF-8.

    Or you can use readHTML() API method with an encoding parameter, i.e.
    https://pd4ml.tech/javadoc/com/pd4ml/PD4ML.html#readHTML-java.io.InputStream-java.net.URL-java.lang.String-
    Try to specify “UTF8” there.

    #34013

    Received same result with, fixed well formated html, setting HTML header charset and using readHTML with encoding parametr.
    PD4ML pd4ml = new PD4ML(); String html = "<html>" + " <head>\n" + " <meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\"/>\n" + " </head>" + " <body>" + " First Pageä|ä|á|ą|â|à|ả|ã|ạ|ă|ằ|ắ|ẳ|ẵ|ặ|ầ|ấ|ẩ|ẫ|ậ|å" + " </body>" + "</html>"; System.out.println(html); byte[] myBytes = html.getBytes(StandardCharsets.UTF_8); System.out.println(new String(myBytes, StandardCharsets.UTF_8)); InputStream stream = new ByteArrayInputStream(myBytes); pd4ml.overrideDocumentEncoding("utf-8"); pd4ml.useTTF("C:\\Windows\\Fonts", true); pd4ml.readHTML(stream, new URL("https://google.com"), "utf-8"); String output_path = "C:\\test\\zxccc.pdf"; try(OutputStream outputStream = new FileOutputStream(output_path)) { pd4ml.writePDF(outputStream); } Desktop.getDesktop().open(new File(output_path));

    #34014

    Also, receive Java error if I use: <h1>First Pageä|ä|á|ą|â|à|ả|ã|ạ|ă|ằ|ắ|ẳ|ẵ|ặ|ầ|ấ|ẩ|ẫ|ậ|å</h1>

    java.lang.StringIndexOutOfBoundsException: String index out of range: -1

    #34015

    If I use pd4ml.readHTML(new URL(“file:///C:/test/test.html”));
    from local file, all is OK, even without well formatted HTML.

    test.html file content:
    <h1>First Pageä|ä|á|ą|â|à|ả|ã|ạ|ă|ằ|ắ|ẳ|ẵ|ặ|ầ|ấ|ẩ|ẫ|ậ|å|ā</h1>

    #34016

    We’ll analyze the issue and let you know.

    BTW: does it change the output if you remove the charset from
    byte[] myBytes = html.getBytes(StandardCharsets.UTF_8);
    to
    byte[] myBytes = html.getBytes();
    ?

    #34017

    OMG, It helped, thank you.
    Except for the last character: ā which is now: “�?”

    Full output, with Windows fonts:
    First Pageä|ä|á|ą|â|à|ả|ã|ạ|ă|ằ|ắ|ẳ|ẵ|ặ|ầ|ấ|ẩ|ẫ|ậ|å|�?
    Full output, without Windows fonts:
    First Pageä|ä|á|?|â|à|?|ã|?|?|?|?|?|?|?|?|?|?|?|?|å|??

    I think that means, some fonts are missing, yes?
    Where else I could get/download them?
    Can I use “pd4ml.useTTF” function multiple times?

    Working code:
    PD4ML pd4ml = new PD4ML(); String html ="First Pageä|ä|á|ą|â|à|ả|ã|ạ|ă|ằ|ắ|ẳ|ẵ|ặ|ầ|ấ|ẩ|ẫ|ậ|å|ā"; byte[] myBytes = html.getBytes(); InputStream stream = new ByteArrayInputStream(myBytes); pd4ml.useTTF("C:\\Windows\\Fonts", true); pd4ml.readHTML(stream); String output_path = "C:\\test\\zxccc.pdf"; try(OutputStream outputStream = new FileOutputStream(output_path)) { pd4ml.writePDF(outputStream); } Desktop.getDesktop().open(new File(output_path));

    #34018

    Unfortunately we had no success to reproduce the original as well as the last issues you reported.

    I would suspect you saved the source Java file as UTF-8, but your build assumes another default Java source encoding (passed as -encoding param of javac or inherited from the OS). It would somehow explain the oddities you faced with, but there are still questions open.

    To omit the build environment dependency and to match typical PD4ML usage scenarios, save the text or HTML content to an external file and refer to it from the Java code.

    FYI: the ‘true’ parameter in pd4ml.useTTF("C:\\Windows\\Fonts", true) to reindex all system fonts with every conversion call – it is not a good idea from performance perspective. Index once and reuse the font mapping data. See https://pd4ml.tech/pdf-fonts/

    You may call pd4ml.useTTF() multiple times, but I guess the wrong rendering of the last char is because of a missing font – it is something wrong with encodings.

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic.