database - Fast way to use String.Contains with huge list C#

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

The result of listUser is: user1, user2.

The problem is if i have a huge listUser (more than 10 million) and huge key_blacklist (100.000). That code is very very slow. Is have anyway to get that faster?

UPDATE: I find new solution in there. http://cc.davelozinski.com/c-sharp/fastest-way-to-check-if-a-string-occurs-within-a-string Hope that will help someone when he got in there! :)

@AndreasNiedermair, @Jon How will


    HashSet

help when you're checking whether strings contain other strings? – Rawling Jan 26, 2015 at 10:45 @tinhve As this is tagged


    database

, presumably your 10 million items are stored in a database. The correct thing to do is get the database to do this, but for us to help you with that we need to know what kind of database you're using and how you access it. – Rawling Jan 26, 2015 at 10:47 Sub-string searches are always slow. But why do you add this user in the first placel? If you don't add him you don't need to remove him later. Also, as Rawling has already mentioned, what database are you using? Do you want to delete them from the database? – Tim Schmelter Jan 26, 2015 at 10:47

If you don't have much control over how the list of users is constructed, you can at least test each item in the list in parallel, which on modern machines with multiple cores will speed up the checking a fair bit.

        listuser.AsParallel().Where(
                foreach (var key in key_blacklist)
                    if (s.Contains(key))
                        return false; //Not to be included
                return true; //To be included, as no match with the blacklist
Also - do you have to use .Contains? .Equals is going to be much much quicker, because in almost all cases a non-match will be determined when the HashCodes differ, which can be found only by an integer comparison. Super quick.
If you do need .Contains, you may want to think about restructuring the app. What do these strings in the list really represent? Separate sub-groups of users? Can I test each string, at the time it's added, for whether it represents a user on the blacklist?
UPDATE: In response to @Rawling's comment below - If you know that there is a finite set of usernames which have, say, "hacker" as a substring, that set would have to be pretty large before running a .Equals test of each username against a candidate would be slower than running .Contains on the candidate. This is because HashCode is really quick.
                From the example blacklist and user names, I think it's pretty clear that he does need Contains.
– Rawling
                Jan 26, 2015 at 10:47
                But what isn't clear is whether he can avoid doing this by organising the usernames a little differently? Why does the blacklist need to contain sub-strings instead of a complete set of those blacklisted?
– Phil Whittington
                Jan 26, 2015 at 10:49
                (I'm not saying you're wrong, I'm just saying it's not clear what the other constraints are, and whether he can make it more efficient by taking a different approach)
– Phil Whittington
                Jan 26, 2015 at 10:50
                It's a blacklist of substrings. hacker presumably will block any username with hacker in it. But I agree with "test each string, at the time it's added, for representing a user on the blacklist?"
– Rawling
                Jan 26, 2015 at 10:50
                @tinhve - also bear in mind that 100 threads is overkill. Once you have more than one thread per core, it's likely that either not all the threads are being used or that the context switching is costing you CPU cycles.
– Phil Whittington
                Jan 26, 2015 at 11:37
If you are using entity framework or linq to sql then using linq and sending the query to a server can improve the performance.
Then instead of removing the items you are actually querying for the items that fulfil the requirements, i.e. user where the name doesn't contain the banned expression:
listUser.Where(u => !key_blacklist.Any(u.Contains)).Select(u => u).ToList();
A possible solution is to use a tree-like data structure.
The basic idea is to have the blacklisted words organised like this:
| + ha
|   + hac
|     - hacker
|     - [other words beginning with hac]
| + fu
|   + fuk
|     - fukoff
|     - [other words beginning with fuk]
Then, when you check for blacklisted words, you avoid searching the whole list of words beginning with "hac" if you find out that your user string does not even contain "h".
In the example I provided, with your sample data, this does not of course make any difference, but with the real data sets this should reduce significantly the number of Contains, since you don't check against the full list of blacklisted words every time.
Here is a code example (please note that the code is pretty bad, this is just to illustrate my idea)
using System;
using System.Collections.Generic;
using System.Linq;
class Program {
    class Blacklist {
        public string Start;
        public int Level;
        const int MaxLevel = 3;
        public Dictionary<string, Blacklist> SubBlacklists = new Dictionary<string, Blacklist>();
        public List<string> BlacklistedWords = new List<string>();
        public Blacklist() {
            Start = string.Empty;
            Level = 0;
        Blacklist(string start, int level) {
            Start = start;
            Level = level;
        public void AddBlacklistedWord(string word) {
            if (word.Length > Level && Level < MaxLevel) {
                string index = word.Substring(0, Level + 1);
                Blacklist sublist = null;
                if (!SubBlacklists.TryGetValue(index, out sublist)) {
                    sublist = new Blacklist(index, Level + 1);
                    SubBlacklists[index] = sublist;
                sublist.AddBlacklistedWord(word);
            } else {
                BlacklistedWords.Add(word);
        public bool ContainsBlacklistedWord(string wordToCheck) {
            if (wordToCheck.Length > Level && Level < MaxLevel) {
                foreach (var sublist in SubBlacklists.Values) {
                    if (wordToCheck.Contains(sublist.Start)) {
                        return sublist.ContainsBlacklistedWord(wordToCheck);
            return BlacklistedWords.Any(x => wordToCheck.Contains(x));
    static void Main(string[] args) {
        List<string> listUser = new List<string>();
        listUser.Add("user1");
        listUser.Add("user2");
        listUser.Add("userhacker");
        listUser.Add("userfukoff1");
        Blacklist blacklist = new Blacklist();
        blacklist.AddBlacklistedWord("hacker");
        blacklist.AddBlacklistedWord("fukoff");
        foreach (string user in listUser) {
            if (blacklist.ContainsBlacklistedWord(user)) {
                Console.WriteLine("Contains blacklisted word: {0}", user);
                A tree works fairly well if searching at the beginning of the string, but doesn't help at all for any other case.  Consider if a blacklisted word is "hacker" and a user has the name "ImNotAHacker".  The actual instance of a simple tree being able to significantly reduce the number of Contains() calls is pretty tiny for real-world data.  Otherwise, why do you think any RDBMS has to resort to a table scan when you say WHERE column LIKE "%somevalue%"?
– dodexahedron
                Jan 26, 2015 at 18:35
                @dodexahedron: if you have a single value to check against, you can't do better than a full table scan, but in this case (100k blacklisted words) this could do a difference anyway.
– Paolo Tedesco
                Jan 26, 2015 at 19:19
You are using the wrong thing. If you have a lot of data, you should be using either HashSet<T> or SortedSet<T>. If you don't need the data sorted, go with HashSet<T>. Here is a program I wrote to demonstrate the time differences:
class Program
    private static readonly Random random = new Random((int)DateTime.Now.Ticks);
    static void Main(string[] args)
        Console.WriteLine("Creating Lists...");
        var stringList = new List<string>();
        var hashList = new HashSet<string>();
        var sortedList = new SortedSet<string>();
        var searchWords1 = new string[3];
        int ndx = 0;
        for (int x = 0; x < 1000000; x++)
            string str = RandomString(10);
            if (x == 5 || x == 500000 || x == 999999)
                str = "Z" + str;
                searchWords1[ndx] = str;
                ndx++;
            stringList.Add(str);
            hashList.Add(str);
            sortedList.Add(str);
        Console.WriteLine("Lists created!");
        var sw = new Stopwatch();
        sw.Start();
        bool search1 = stringList.Contains(searchWords1[2]);
        sw.Stop();
        Console.WriteLine("List<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
        sw.Reset();
        sw.Start();
        search1 = hashList.Contains(searchWords1[2]);
        sw.Stop();
        Console.WriteLine("HashSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
        sw.Reset();
        sw.Start();
        search1 = sortedList.Contains(searchWords1[2]);
        sw.Stop();
        Console.WriteLine("SortedSet<T> {0} ==> {1}ms", search1, sw.ElapsedMilliseconds);
    private static string RandomString(int size)
        var builder = new StringBuilder();
        char ch;
        for (int i = 0; i < size; i++)
            ch = Convert.ToChar(Convert.ToInt32(Math.Floor(26 * random.NextDouble() + 65)));
            builder.Append(ch);
        return builder.ToString();
On my machine, I got the following results:
Creating Lists...
Lists created!
List<T> True ==> 15ms
HashSet<T> True ==> 0ms
SortedSet<T> True ==> 0ms
As you can see, List<T> was extremely slow comparted to HashSet<T> and SortedSet<T>. Those were almost instantaneous.
                My point in there is we have a huge list searchWords1 ~ 100,000 elements. So I have UPDATE in 1st post about compare .contain function to another way. And I want to ask the performace of .Contain. It very slow so I need another way to change .Contain function.
– tinhve
                Jan 27, 2015 at 3:40
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.