添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement . We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account SqlError number 4060 (user login failed as default database doesn't exist) shouldn't retry #33191 SqlError number 4060 (user login failed as default database doesn't exist) shouldn't retry #33191 DamianEdwards opened this issue Feb 28, 2024 · 20 comments

SqlServerRetryingExecutionStrategy today treats a SqlError.Number == 4060 as a transient exception that can be retried. This error occurs when the login credentials are valid but the requested default database can't be accessed (e.g. when it hasn't been created by migrations yet, see details here ). This leads to a poor developer experience in the case of an ASP.NET Core project configured with the EF Core migrations endpoint and developer exception page filter that displays an "Apply migrations" button, as the page won't be displayed until the retry loop has been completely exhausted (minutes).

The Npgsql retry strategy treats the equivalent PostgreSQL error as fatal and doesn't retry.

It's possible that other 406x error numbers should be consider fatal too.

I think one reason for this was that in SQL Server, CREATE DATABASE returns before the database is actually ready and operational, so there's a race condition when trying to use it after creation (and the retry makes that go away (but @AndriySvyryd can confirm/provide more context). Of course if this error occurs without having just done CREATE DATABASE, it makes no sense to retry - but it's a distinction we can't make (we just had a discussion around how tricky a concept "transience" is... here's yet another example).

IIRC on PostgreSQL when CREATE DATABASE returns, you're guaranteed to be able to access the database, so it never makes sense to retry after "database doesn't exist".

Fundamentally, 4060 on SQL Server is transient . There is a very good chance that if the database has just been created, then there will be a failure of this kind that is then fixed when it re-tries. I think the real issue here is that a deployment/development flow with SQL Server/Azure SQL that assumes you can reliably and quickly check if a database exists is broken. I believe we should be developing experiences around the idea that things are provisioned and then there may be many minutes before the next step can be completed.

I've been thinking about all this a bit recently. FWIW testcontainers (which we've mentioned before) has a whole "waiting strategy" infrastructure for verifying when a service is actually up and ready to accept requests ( here are all the different supported strategies ). For example, here's the strategy for SQL Server . It seems like we may need to do something a bit similar with Aspire - hopefully we're not duplicating too much work here etc. (/cc @stephentoub , we chat about this recently).

Of course, all that's about services/container availability, which is unrelated to when a specific database within SQL Server is ready.

There is a very good chance that if the database has just been created [...]

That's true, though it's not always going to be the case (a user can also just get the database name wrong, for instance). We could explore a direction where instead of treating 4060 in general as transient (and retry wherever it's encountered), only specific places where we're creating a database would retry on that error, to try to separate the "creating" scenario from others. But I don't know how feasible this actually is or what problems it could create - any thoughts @ajcvickers @AndriySvyryd ?

Of course, all that's about services/container availability, which is unrelated to when a specific database within SQL Server is ready.

For Azure SQL/SQL Server, the database is effectively another service that needs provisioning like other services. For PostgreSQL, the database isn't really a service, just something that a service owns.

FWIW testcontainers (which we've mentioned before) has a whole "waiting strategy" infrastructure for verifying when a service is actually up and ready to accept requests

Yes, I think something like this would be great. Hopefully, it's something that is being considered as a general problem for orchestration of services, rather than as a database (or testcontainer) specific problem.

Yeah, interesting conversation.

SQL Azure does support CREATE DATABASE , so there's a possible usage of SQL Azure that mirrors the on-premise usage, right? I mean, assuming that one day there will be a SQL Azure local development docker image, you could bring one of those containers up, and then create multiple databases in that one container. Assuming that will be supported, SQL Azure would behave the same way as on-perm SQL Server in that respect.

But I see your point that in at least some typical scenarios, the container and the database will be the same thing, and if your Aspire application requires multiple database, it might bring up multiple containers when doing SQL Azure local dev. We may want to reach out to the relevant people to understand what the recommended pattern will actually look like...

In any case, treating 4060 differently (as transient when creating a database, as non-transient otherwise) still seems like it makes sense at least for on-perm.

SQL Azure does support CREATE DATABASE , so there's a possible usage of SQL Azure that mirrors the on-premise usage, right?

Yes, but even with a local SQL Server installation, CREATE DATABASE returns before the database is ready. It sometimes can be many minutes on a slow machine before the database is ready. This took us ages to make reliable in our test infrastructure, and is still not reliable on LocalDb.

a user can also just get the database name wrong, for instance

I'm pretty sure there are other error codes for things like that - see error 911 for example.

https://learn.microsoft.com/en-us/sql/relational-databases/errors-events/database-engine-events-and-errors-0-to-999?view=sql-server-ver16

Azure SQL documentation explicitly lists 4060 as transient.

https://learn.microsoft.com/en-us/azure/azure-sql/database/troubleshoot-common-errors-issues?view=azuresql#list-of-transient-fault-error-codes

I'd want to understand more about how OP is coming across this; the example of migrations not being applied yet is given, but not sure if that gives the full picture?

Can the SQL team explain 4060 in more detail?

That's a good point - there's an on-prem/Azure distinction here. But do you think it still makes sense to limit retry for this particular error only to contexts where the database is actually being created?

With the default retry strategy we only retry 4060 after creating the database
efcore/src/EFCore.SqlServer/Storage/Internal/SqlServerDatabaseCreator.cs Line 300 4341961

@stevendarby Yes, but I was responding to the comment about the possibility of treating 4060 differently based on whether the target was on-prem or Azure Sql. And for on-prem one can use the default SqlServerExecutionStrategy that already only retries 4060 when the database is being created.

In general, I don't think that the developer exception page is appropriate when targeting Azure Sql. Migrations should be applied through a migrations project compiled to an exe, migrations bundle or a migrations script.

In general, I don't think that the developer exception page is appropriate when targeting Azure Sql. Migrations should be applied through a migrations project compiled to an exe, migrations bundle or a migrations script.

For my knowledge/understanding, are you saying there's something specific about Azure Sql which makes the developer page less appropriate (than e.g. on-prem SQL Server)? Or are you generally referring to migration bundles/scripts being a better way compared to applying migrations at runtime? I might be missing something about Azure Sql here.

@roji I don't think on-prem is fundamentally different from Azure SQL. We should do the same for both. What is pragmatically different is that Azure SQL shows very clearly and painfully that the database deployment experience we have, especially for cloud native, is pretty bad. The way we should fix that is to fix it for both on-prem and Azure SQL, by having a good experience that works for both. Because the experience now for cloud is so bad, it is bringing the whole issue to the fore again, and, as we have talked about for years, the current experience from on-prem is also not good except in some narrow happy paths.

Update the code in Program.cs to enable retries:

var connectionString = builder.Configuration.GetConnectionString("DefaultConnection") ?? throw new InvalidOperationException("Connection string 'DefaultConnection' not found.");
builder.Services.AddDbContext<ApplicationDbContext>(options =>
    options.UseSqlServer(connectionString, sql => sql.EnableRetryOnFailure())); // <-- Add this config lambda here
builder.Services.AddDatabaseDeveloperPageExceptionFilter();

Run the project with F5 (or Ctrl+F5)

Navigate to the Login page

Enter an email address and password, e.g. [email protected]

Click the "Log in" button

The page will then sit there spinning for a very long time (I timed 2 minutes) as the execution strategy goes through the default retry with backoffs logic (you can see the exceptions in the app's console output).

Eventually, the developer exception page will be shown with the "Apply Migrations" button, if you were patient enough to wait that long.

This impact is amplified with projects using .NET Aspire because the Aspire components for EF Core providers default to enabling retries.