SQL: Select distinct based on regular expression

2023-01-29 18:19 问答作者：

Basically, I'm dealing with a horribly set up table that I'd love to rebuild, but am not sure I can at this point.

So, the table is of addresses, and it has a ton of similar entries for the same address. But there are sometimes slight variations in the address (i.e., a room # is tacked on IN THE SAME COLUMN, ugh).

Like this:

id | place_name | place_street

1 | Place Name One | 1001 Mercury Blvd

2 | Place Name Two | 2388 Jupiter Street

3 | Place Name One | 1001 Mercury Blvd, Suite A

4 | Place Name, One | 1001 Mercury Boulevard

5 | Place Nam Two | 2388 Jupiter Street, Rm 101

What I would like to do is in SQL (this is mssql), if possible, is do a query that is like:

SELECT DISTINCT place_name, place_street where [the first 4 letters of the place_name are the same] && [the first 4 characters of the place_street are the same].

to, I guess at this point, get:

Plac | 1001

Plac | 2388

Basically, then I can figure out what are the main addresses I have to break out into another table to normalize this, because the rest are just slight derivations.

I hope that makes sense.

I've done some research and I see people using regular expressions in SQL, but a lot of them seem to be using C scripts or something. Do I have to write regex functions and save them into the SQL Server before executing any regular expressions?

Any direction on whether I can just write them in SQL or if I have another step to go through would be great.

Or on how to approach this problem.

Thanks in 开发者_如何学Pythonadvance!

Use the SQL function LEFT:

SELECT DISTINCT LEFT(place_name, 4)

I don't think you need regular expressions to get the results you describe. You just want to trim the columns and group by the results, which will effectively give you distinct values.

SELECT left(place_name, 4), left(place_street, 4), count(*)
FROM AddressTable
GROUP BY left(place_name, 4), left(place_street, 4)

The count(*) column isn't necessary, but it gives you some idea of which values might have the most (possibly) duplicate address rows in common.

I would recommend you look into Fuzzy Search Operations in SQL Server. You can match the results much better than what you are trying to do. Just google sql server fuzzy search.

Assuming at least SQL Server 2005 for the CTE:

;with cteCommonAddresses as (
    select left(place_name, 4) as LeftName, left(place_street,4) as LeftStreet
        from Address
        group by left(place_name, 4), left(place_street,4)
        having count(*) > 1
)
select a.id, a.place_name, a.place_street
    from cteCommonAddresses c
        inner join Address a
            on c.LeftName = left(a.place_name,4)
                and c.LeftStreet = left(a.place_street,4)
    order by a.place_name, a.place_street, a.id

继续阅读：regex sql sql-server

SQL: Select distinct based on regular expression

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？