External editors REALLY needed for special use cases (i.e."non-print" control codes)?

Spliff · #1 05-22-2023, 01:09 PM

I am (again) after something important here, since currently, or rather, just applying current user knowledge, it's not possible, with the integrated MS rtf editor (mine is RichEdit20W6, but I could change within the available alternatives; I don't need tables in Content e.g. ...), to search for, and thus, also to replace, or insert by replace, those "non-printable control-codes" (newlines, tabs in particular).

I speak of these, in particular, and for which some other - external then - editors provide special ways of input, with \something, `some, ^some or similar, and which seem (sic!) to be unattainable within UR's integrated MS rtf editor(s), or then, in UR's Data Explorer (which I call "tree", since I can remember that term oh so much more easily... ;-) )

I speak of these here:

typical Windows newline: CRLF:
CR = \r = dec 13 = hex 0D = oct 015 = ^M bin 00001101 (carriage return)
LF = \n = dec 10 = hex 0A = oct 012 = ^J bin 00001010 (line feed)
and also of importance:
HT = dec 9 = hex 09 = oct 011 = ^K bin 00001001 (tab)
EmEditor calls it 000B ("Insert - Special Characters" ^M abd ^J not found in EmEditor))

newline in PlanMaker (^{enter} (Control-Enter) = control-enter): 00001010 (checked by binary editor)
in Excel (by !{enter} (Alt-Enter): &CHAR(10) = linefeed (see above)

Their possible use is threefold (at least):

- Imagine you want to integrate TAGS into your UR item titles, in order to both enter them very easily, AND retrieve them effortlwithout any effort ("Search - Search titles only"), BUT without being bothered, visually, by those tabs (e.g. in the form {.some .someother} - obviously, a - loooong - tab is the solution... AND it's available, oh yes indeed: Just insert a tab (with the {tab} key) into the Content field somewhere, then ^c, then ^v within the "F2" rename within the Data Explorer... and you're done... so: technically, these "non-printable control codes ARE available indeed)...

and ditto for entering the clipboard content, i.e. that "bloody" tab, into the MS "Find" field: now, it works, so WHY wouldn't there be ANY - crazy, be it: I wouldn't mind: "enter" string, to enter into that "find" field, and which would give the very same, and positive, result?:

- You want fo FIND those control-codes within your Content (i.e. "ItemText"), by ^f...: I've found no way yet, but there possibly IS some way?

- And, much more important, and IF we have found a way to identify / designate these control codes within MS Editor's find/s&r dialogs: We would then be able to replace e.g.

"Character: Some dialog"

with

"CHARACTER
The same dialog"

at last, WITHIN UR (1 item by 1 item: ok, but then so what...), and similar tasks asking for identification and/or replacement or inserting of those control codes?

Wouldn't there be any way? I have to admit that I tried similar (and other) things with line feed, and with carriage return, to no avail, but once we could identify them in any way, there should a way? And even entering (by macro then anyway) a 12-char string: where would be the problem indeed?

And once the identification problem, and the problem of "how to then enter those bloody codes", would have been resolved, you could write the necessary code of just stopping the screen update and keyboard-n-mouse interaction, while processing, replacing it with a progress bar plus "Please wait - multi-item search-n-replace processing" message... and UR would get the multi-item S&R some Mac-only applications, not being backed up by SQLite (but presenting quite a bulk of other glitches if I may say so...), come up with...

Background of my research, question and development:

I currently have tried "hierarchical csv input into UR tree", and it works perfectly!

See here: https://www.kinook.com/Forum/showthread.php?t=2483 - "Importing hierarchical information from a csv file", and bear in mind that over there, the "Note: an IndentLevel value of 0 indicates no indent - the order of the records in the csv file is critical to defining the relationships between the records (when a record has a non-zero indentlevel value, it will become a child of the last preceding record with an indentlevel value one less than the record's value)." is not perfectly worded, since you might be inclined to assume indentlevel then was relative, whilst, very thankfully, it's absolute within the sub-tree newly created, just relative to the (existant) parent item of the new sub-tree: you import at item x, which is deemed to be indentlevel 0, and then any indentlevel in your csv / new sub-tree will have, relative to 0, exactly the number of more-indentations its indentlevel field notifies;

and yes, UR's csv import dialog lets you choose the field divider (which in most cases would be {tab}, but not also the record divider, so that one will be `r`n or `n, i.e. some newline, and thus, for the Content column / fields, you'll need the "" "text" indicator, and then it'll be up to your "original", source, application's export's control code's encodings, AND to the capabilities of your intermediate, editing applications, if your "newlines" will / CAN be correctly recognized by UR's csv import function, within the double-quoted (or otherwise defined) "Content" (i.e. "ItemText") and/or "Notes" columns - those columns in which newlines can create real problems indeed.

In any case, any fellow user is well advised to first create a tiny but significant "dummy" csv file to import, in order to then revise their csv data accordingly.

And, Kyle, thank you so very much again for your amending the "Tools - C&R" function!!!!

EDIT:
So-called "global S&R", better, multi-item S&R (i.e. "whole sub-tree"), is primordial for "writers", and, for journalists, etc., obviously, but becomes a necessity for coders - e.g., the (much lesser in every respect) "RightNote" comes with a dedicated "plain-text" template, especially for code writing... but there is no multi-item S&R either, so this, in their case, become, well... "weird"... . and, btw, RN's "dark mode" just changes colors all AROUND the "editor" (if I'm not wrong - after my trial some years ago, I don't have access but to their "Lite" version, and their web-available documentation...) - this being said, I'm not up to denigrate other personal information managers, but I would like to emphasize that S&R, beyond the current item, would open a very larger market than just the partial market of "writers" and other text-producing people who haven't yet switched to Mac-n-Scrivener... both decisions being devoid of sense, as far as I'm concerned, btw - and I analyzed the available goodies over there thoroughly, before making my final decision that is... ;-)

Spliff · #2 05-23-2023, 06:01 AM

As to CSV import, no problem whatsoever:

In UR import, I have field delimiter TAB, text qualifier " (double-quote), and template "text"); within the column that goes to the UR content field ("ItemText"), my newlines are as LF (line feed, Ascii 010, hex 0A, binary 00001010, U+000A), my record delimiter is CRLF (LF as before, and the CR is Ascii 013, hex 0D, binary 00001101, U+000D).

Btw, TABs (see above for tabs in UR's Data Explorer), are Ascii 009, hex 09, binary 00001001, U+0009; there is NO difference between the Windows-1252 char table and UTF-8 (up to) here.

I also tried CRLF instead of just LF for newlines, and also TABs, both within the "-delimited content field (the "..." not being needed for regular data fields btw), and it all works as expected, so it's perfectly possible to replicate any tree within UR, preserving tabs and all sorts of newlines (CRLF, LF, CR) for the "content") in case.

As to find/replace the above control characters in UR Content ("Item Details" pane), I am lost, though, albeit having discovered a quite weird phenomenon:

It's possible to enter those (venerable Ascii) chars (from 1990 or before) into UR's content's "Find" dialog (e.g. Alt-10, Alt13, on the Numpad), and they then appear as symbols (from the Eighties, Ninetees...) in that dialog's "Find what" field then; ditto for "all the other" such control chars then represented by symbols (younger UR users will not even recognize anymore I suppose), and just entering Alt-0 (for Null character NUL) will have not effect in that dialog. (Doing Alt-10, Alt-13, etc, within UR's content pane, directly, will have NO effect though.)

Then, you can copy those chars from the "Find what" field (select them, then ^c), and you can then paste them into the content field (^v), where they appear as symbols exactly as you have seen them in the "Find what" field...

And then, from entering them (by ^v or by Alt-13, etc.) into the "Find what" field, the Find dialog will FIND them in Content! - i.e. find the symbols, NOT find anything those symbols represent, unfortunately...

So just the real TABs, LFs and real CRs are not found this way yet, but there should be a trick? (I also experimented with U+ and binary input, but may have missed the right way to do so?)

EDIT:

And I don't know if there is a realistic, possibly quite easy way to tweak the MS rtf control (i.e. just one of the two or more controls we could use), by something like
SetWindowText(HandleOfYourRTFControl, "Line1\nLine2")
( https://social.msdn.microsoft.com/Fo...orum=vcgeneral ) - understood that here, the problem is the search within the control, so perhaps that'll be impossible indeed?

EDIT 2:

It just occurred to me that it might be possible (?) to just create alternative "Find" and "Replace" dialogs, and which would then be triggered in lieu of the native ones... and which would send the pertinent LF, CR and TAB codes to the MS rtf control, from some code to be entered within the dialog... if (sic!) the problem is just sending the correct code to the control, instead of the dialog sending all strings literally. Just an idea...

Spliff · #3 05-28-2023, 06:39 PM

Re CSV Import

Having done extensive tries with "real" data now for CSV import, I have discovered that the tree replication is not without fault, example (indentation levels):

1 ok (the unique source element; the target element in UR being considered 0)
2 ok
3 ok
2 stays at 3 (instead of being positioned "up" again, with, or without, content; btw, I put systematically the tab behind the title, even when there is no content, in order to equalize the column count for every csv "record")
1 (which is logically false since 1 should not be there but once, but I changed the element-to-be-imported, in order to "see what it brings"): but goes up to 2, so this "trick" might help to preserve the tree at least a little bit better, but manual adjustments should currently be needed here in any case.

Other example (first the original indentation, then UR's one):
1 1 ok (UR target is 0)
2 2
3 3
2 3!
3 4!
2 3!
3 4!
3 4!
2 3!
2 3!

So much for the problem I discovered; it seems that there is a code problem which makes that from the level directly beneath the source level of the "import data set" can't be reached anymore, from "below", in other words, if you don't count as I do, existant UR target = 0, but you count as 0 the source item of the tree to be imported, then level 1 can't be reached anymore from level 2: that should be easy to detect then. ;-)

When the schema is correct, the user can freely use CRLF (see above) as "new CSV row" (i.e. new record) separator, AND as newline within the content field (within "..." of course), the distinct use of CRLF there and LF here is of no practical interest/value.

But the user should be reminded that both UR's tree and content are ANSI, not UTF-8, so in order that titles and content (etc.) are rendered correctly, they must change their CSV's file's code page before import (and I have even encountered non-import, with the creation of multiple, empty "New Text" items, by trying to import in UTF-8 format); almost any editor can do that, even Windows' native "Notepad": It indicates the current code page in its status bar, and to change it if necessary, it's "File - Save AS": that dialog will then offer to change the "Encoding".

Also, at every import of another file, the user must set up the respective import columns again, even when they always stay at indentlevel, itemtitle and itemtext, so for importing several / multiple files, it's advisable to just rename the different files to import, into a common "dummy" file name, so that UR will preserve the target columns; as for the import's field separator (e.g. {tab}, you must re-select it every time anew.

And finally: Don't bother endlessly with "csv-enabled" editors and their possibly endless claims your code was faulty csv: Just use any "dumb" editor (as the aforementioned Notepad or similar), and check your schema visually, newlines within fields are simply "too much" for some allegedly "csv-ready" editors (names withheld here...).

EDIT: My try to do away with the "" was a failure, then, since they are needed to distinguish the crlf as newlines from the crlf as row separator; in theory, using LFs vs. CRLFs might do away with that necessity, but I think that will be futile, too.

EDIT 2:

In order to check if my numbering, starting at 1, was the culprit, I have done new tries, both with starting at 1, and starting at 0, and they both are identical.

Starting with 0 (so the existant UR target (=parent) item would count as "-1"):

0 ok
1 ok
2 ok
3 ok but now I go 2 up, not just 1:
1*: not 1 but 0 (!), and title/content not preserved, but "New Text; but creates a second item (2), with title and with content of the "1" item, and:
2**
1**
After the wrong "1*" and its unwanted "2" item described above, 4 new items instead of just 2 (the above "2**" and "1**", oscillating between 0 and 1 (!), the "0" being items "New Text" ones, and the "1" items with titles and content of the empty "0" ones; and no difference here between 2** and 1**, i.e. both become empty 0 items, with then title and content as 1.

Obviously, I have checked and rechecked my schema, which is not at fault. EmEditor's "show all", for the "", and also for control characters, will prominently display all occurrences, with green background, so that for a short text, it's not possible to overlook unwanted, or missing quotes, tabs, or CRLFs (it also shows CRLFs and simple LFs with different symbols, I just use CRLFs now).

(In order to exclude any possible interplay with my AHK script running, I stopped that script, and the faulty import results are unchanged but exactly as before.)

I now tried the above (0123121) without content, just left the tabs (which without content don't make much sense, except for indicating there is possible content, i.e. 3 columns instead of just 2: Now it works as expected. So the existence of content makes the algorithm choke, when "going up" in tree (i.e. going down in indent level number).

EDIT 3

Doing more work currently, will post again in some hours. UTF-8 to Ansi seems to be the culprit, in combination with EmEditor - purging ("save as" with new code page) in EmEditor obviously NOT sufficient, since a second purge (again "save" and message "saving will lose characters") in (Windows') Notepad then IS sufficient, ditto for avoiding EmEditor's purge, and just doing ONE purge, in Notepad.

Obviously, EmEditor leaves special CONTROL chars within the "purged" data, which it does NOT display, neither before nor afterwards, albeit the UTF-8 format is always without (!) BOM, whilst Notepad really purges the UTF-8 into then - functioning even at UR import as expected (I'm processing and checking numerous real-life files, by alternatively also adding the indent-level number to the titles, so that it doesn't vanish at import and can thus be visually checked for possible faults easily.

It seems that within Firefox' (UTF-8) html bookmarks export (in my case 22,000 items), and then even in "simple"-looking excerpts of just some bookmarks, AND then correctly reformatted for UR import, there are always hidden control chars which are left over from UTF-8 to Ansi IF the re-encoding is done in EmEditor, and which then upon UR import scramble that import ONLY and whenever the import goes UP in tree hierarchy, and near the "top" of the tree which is to be imported.

As said, will post again in some hours.

Spliff · #4 05-29-2023, 07:36 AM

I have been able to import my 22,000 FF bookmarks, together with a perfect replication of their tree within an UR tree, without any fault whatsoever, and in ONE "jet"; both my script to get from html to tree, and for the links and link-titles/comments (i.e. doing away with all the crap, incl. icons, most referrers, etc, and then the UR import, both didn't take more than 30 seconds.

I used Notepad++ (also free) though, for the files and for the transcription UTF-8 to Ansi, since Notepad++ copes much better than Notepad, with (html) files of more than 30 MB (the text files (UTF-8) after running the script, and then after doing the conversion (Ansi), both are just about 2.5 MB, so I could have done the conversion with Notepad indeed, but in Notepad++, it worked as fine.

I'm positive about my allegation above though that the conversion UTF-8 to Ansi, done within / by EmEditor, leaves one or more hidden control characters unchanged, instead of purging them, with the above-described effects.

Thus, problem resolved, all the more so since both Notepad and Notepad++ are free.

(For writing (code or otherwise), Notepad is the worst thing on earth, since there is not the slightest boundary between the left window boundary and your text, so your text is scarcely readable, and you don't even identify your cursor - which was the reason that even for just some (trial) lines above, I hadn't thought of Notepad, whilst obviously, it does a perfect transmission work (if the source file isn't too "heavy").)

Spliff · #5 05-30-2023, 02:09 AM

Kyle, we / I have had similar similar codepage problems 1 year ago, and here https://www.kinook.com/Forum/showthr...ht=ansi&page=2 , you said, "Ultra Recall is a Unicode application, and Windows uses UTF-16 encoding for its APIs.
https://docs.microsoft.com/en-us/win...2/intl/unicode
https://docs.microsoft.com/en-us/win...ntl/code-pages
Text is converted to UTF-8 encoding when stored in the URD file."

Since that was 1 year ago, and AFTER UTF-8 import did all the Ã¤Ã¶Ã¼Ã©Ã*Ã¨ wrong, and then my Ansi trials worked fine, I hadn't remembered that very well, but remembered in the (mistaken) form that upon export, UR used UTF-8 then (btw, UTF-8 is not found by forum search, and for Ansi, I just find the link above, and this current thread).

So I'm at loss here, since I am 100 p.c. positive that any UTF-8 csv import does misrepresent non-ASCII characters in UR then - I saw it in the tree and in the content -, whilst Ansi csv import renders all those Ã¤Ã¶Ã¼Ã©Ã*Ã¨ correctly, both in the tree and in the content. So can we convene that csv import is the exception to the rule you stated over there? ;-)

kinook · #6 06-04-2023, 07:34 PM

If your .csv file is encoded in UTF-8, it needs a UTF-8 BOM (byte order mark) to indicate that. Otherwise, it will be assumed to be encoded in the current Windows code page.

https://www.kinook.com/Forum/showthread.php?t=5564

Spliff · #7 06-05-2023, 01:53 AM

Oh, my! Hadn't thought of that! Even (and especially) EmEditor (which I had declared inapt for the task above) is able to save non-BOM Unicode (or anything else) as BOM-Unicode.

(Hadn't ever seen the 2019's forum thread, and it's impossible to find "csv" or "UTF" (or "UTF-8") or other (even pertinent) 3-char search terms by the forum software's search (even tried google for "Ultra Recall forum" in vain, but not with "kinook", I have to admit) - whilst "Unicode" instead of "UTF" would have worked of course, just another case of not thinking of synonyms... took my 12 hours or so then, instead, on top of the bookmarks import script (which took much less) - so much for "debugging" vs the "constructive" work... when perhaps some fellow, long time, them, UR user could have chimed in, notifying me the thread I had overlooked, and which would have spared me the time. That being said

Thank you very much for your clarification, more than welcome to me, will ease up the necessary "workflow" a lot next time... and even (free and quite ubiquitous) Notepad++ can save UTF-8 without-BOM to UTF-8 with...