Getting readable diff displays in Mercurial on Unicode files (MS Windows)

2023-01-02 20:12 问答作者：

I'm trying to stor开发者_开发知识库e some Windows PowerShell scripts in a Mercurial repository. It seems the PowerShell editor likes to save files as UTF-16 Unicode. This means that there are lots of \0 bytes, which is what Mercurial uses to distinguish between "text" and "binary" files. I understand that this makes no difference to how Mercurial stores the data, but it does mean that it displays binary diffs, which are kind of hard to read. Is there a way to tell Mercurial that these really are text files? Presumably I would need to convince Mercurial to use an external Unicode-aware diff program for particular file types.

This may not be relevant to you; read the last paragraph if it doesn't sound like it is.

I'm not sure whether this is what you're needing, but I've needed diffs with UTF-16LE content more than just the "binary files are different" - when I searched around some months ago for it I found a thread and bug discussing it; here's part of it. I can't find the original source of this mini-extension now (though it's doing just what that patch does), but what I got was an extension, BOM.py:

#!/usr/bin/env python

from mercurial import hg, util

import codecs

boms = [
    codecs.BOM_UTF8,
    codecs.BOM_UTF16_BE, codecs.BOM_UTF16_LE,
    codecs.BOM_UTF32_BE, codecs.BOM_UTF32_LE
    ]

def binary(s):
    if s:
        for bom in boms:
            if s.startswith(bom):
                return False
        return '\0' in s
    return False


def reposetup(ui, repo):
    util.binary = binary

This gets loaded in the .hgrc (or your users\username\mercurial.ini) like this:

[extensions]
bom = ~/.hgexts/BOM.py

Note the path will vary between Windows and Linux; on my Windows copy I put the path as \...\whatever (it's on a USB disk where the drive letter can change). Unfortunately relative paths are taken relative to the current working directory rather than the repository root or any such thing, but if you are saving it on your C: drive, you can just put the full path.

In Linux (my main development environment), this works well; in Command Prompt (which I still use regularly), it generally works well. I've never tried it in PowerShell, but I would expect it to be better than Command Prompt in its support for arbitrary null bytes in the command line.

I'm not sure if this is what you want at all; by the way you've said "binary diffs" I suspect you may already either have this or be doing hg diff -a which is achieving the same thing. In that case, all I can think of is writing another extension which takes the UTF-16LE and attempts to decode it to UTF-8. I'm not sure of the syntax for such an extension, but I might try that out.

Edit: having now trawled the mercurial source through commands.py, cmdutil.py, patch.py and mdiff.py, I see that binary diffs are done with a base85 encoding (patch.b85diff) rather than the normal diff. I wasn't aware of that, I thought it just forced it to diff it. In that case, perhaps this text is relevant after all. I await a response to see if it is!

I have worked around this by creating a new file with NotePad++ and saving it as a PowerShell file (.ps1 extension). NotePad++ will create the file as a plain text ANSI file. Once created I can open the file in the PowerShell editor and make any changes as necessary without the editor modifying the file encoding.

Disclaimer: I encountered this just moments ago and so I am not sure if there are any repercussions but so far my scripts appear to work as normal and my diffs are showing up nicely.

If my other answer does not do what you want, I think this one may; although I haven't tested it on Windows at all yet, it's working well in Linux. It does what is potentially a nasty thing, in wrapping mercurial.mdiff.unidiff with a new function which converts utf-16le to utf-8. This will not affect hg st, but will affect hg diff. One potential pitfall is that the BOM will also be changed from UTF-16LE BOM to the UTF-8 BOM.

Anyway, I think it may be useful to you, so here it is.

Extension file utf16decodediff.py:

import codecs
from mercurial import mdiff

unidiff = mdiff.unidiff

def new_unidiff(a, ad, b, bd, fn1, fn2, r=None, opts=mdiff.defaultopts):
    """
    A simple wrapper around mercurial.mdiff.unidiff which first decodes
    UTF-16LE text.
    """

    if a.startswith(codecs.BOM_UTF16_LE):
        try:
            # Gets reencoded as utf-8 to be a str rather than a unicode; some
            # extensions may expect a str and may break if it's wrong.
            a = a.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    if b.startswith(codecs.BOM_UTF16_LE):
        try:
            b = b.decode('utf-16le').encode('utf-8')
        except UnicodeDecodeError:
            pass

    return unidiff(a, ad, b, bd, fn1, fn2, r, opts)

mdiff.unidiff = new_unidiff

In .hgrc:

[extensions]
utf16decodediff = ~/.hgexts/utf16decodediff.py

(Or equivalent paths.)

继续阅读：diff mercurial tortoisehg unicode windows

Getting readable diff displays in Mercurial on Unicode files (MS Windows)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？